Forem: priteshsurana

Cassandra Internals: LSM Tree, SSTables, and Compaction

priteshsurana — Wed, 15 Apr 2026 22:16:07 +0000

Post 3 and 4 traced writes and reads through PostgreSQL and MongoDB. Both engines use B-Tree variants. Both optimize for reads - maintaining sorted indexes, linking leaf nodes, storing heap pointers or link to primary index and pay for that optimization with write complexity: page splits, locking, dead tuples, in-place update overhead.

Cassandra makes the opposite bet. It never modifies anything on disk. Every write is an append. Every file, once written, is immutable until compaction removes it. The read path pays for this. It has to reconcile data across potentially many files to find the latest version of a row. Understanding Cassandra means understanding why that tradeoff is worth making for certain workloads, and how the engine manages the read cost through Bloom filters and compaction.

Why Two Tables instead of One

Before any internals, the schema needs explaining. As this is a continuation in the series, read previous parts to understand the Orders table schema and how we might have to change it if using Cassandra. Here it is again:

TABLE: orders_by_id
  PRIMARY KEY (order_id)

TABLE: orders_by_user
  PRIMARY KEY (user_id, created_at)

In PostgreSQL, you'd have one orders table with a secondary index on user_id. The engine maintains that index; you write a row once, and PostgreSQL handles the index update. In Cassandra, you write the data twice, in two different shapes, into two different tables.

This isn't a design quirk. It follows directly from how LSM storage works.

LSM's write strength is sequential appends to known partition keys. Given a partition key, Cassandra can write to the right Memtable instantly - no page to find, no B-Tree to traverse, no lock to acquire. But if you want to query by a different field than the partition key, you need a different table with that field as the partition key. Cassandra does have secondary indexes, but they're implemented as hidden tables under the hood and carry significant read cost - scanning across SSTables that weren't organized for that access pattern. For production workloads, the idiomatic solution is: write the data multiple times, once per access pattern.

The consequence is that every INSERT into orders triggers two writes:

INSERT event:
  → write to orders_by_id   (keyed by order_id)
  → write to orders_by_user (keyed by user_id + created_at)

This is write amplification - paying more on the write side to make reads efficient. The tradeoff is explicit and application-owned. PostgreSQL maintains its secondary indexes for you, transparently. Cassandra requires you to maintain your denormalized tables explicitly. If orders_by_user gets out of sync with orders_by_id, that's your problem, not the engine's.

Why accept this burden? Because at high write throughput - millions of inserts per minute - Cassandra's append-only writes stay fast under load in a way that B-Tree engines struggle to match. The write amplification is a known, bounded cost. If not, then the alternative - secondary indexes on an LSM engine under heavy write load - is an unbounded performance hazard.

The Architecture

Five components handle most of the things in Cassandra's LSM engine:

The CommitLog is the crash-safety log - write here first, before touching any data structure. The Memtable is the in-memory sorted buffer that accumulates writes. When the Memtable fills, it flushes to disk as an SSTable - an immutable, sorted, self-contained file. Each SSTable has a Bloom filter (to quickly rule out keys that aren't in that file) and a partition index (to locate specific keys within the file). Compaction periodically merges SSTables, consolidating versions and reclaiming space from tombstones.

The Write Path

INSERT INTO orders_by_id
  (order_id, user_id, status, amount, description, created_at)
VALUES (uuid(), uuid(), 'shipped', 149.99, 'Order for...', toTimestamp(now()));

1. Partition key hashing

Cassandra hashes the order_id value using Murmur3 to produce a token - a number in a very large range that determines where this row lives in the cluster's token space. On a single node, every token maps to the same node, so routing is trivial. In a multi-node cluster (Post 7), this hash determines which node receives the write. No coordinator needs a lookup table - any node can compute the owner from the hash.

2. CommitLog append

Before the row touches the Memtable, Cassandra appends a record to the CommitLog - a sequential, append-only file on disk. This is Cassandra's equivalent of PostgreSQL's WAL, but structurally simpler. There are no page boundaries, no page headers, no B-Tree node structures. It's a flat sequence of mutation records, written front to back. Cassandra also compresses CommitLog segments, reducing the I/O cost relative to PostgreSQL's uncompressed WAL writes.

The CommitLog exists purely for crash recovery. If Cassandra crashes before the Memtable is flushed to an SSTable, the CommitLog lets it reconstruct the lost Memtable on restart.

Cassandra offers two CommitLog sync modes:

Periodic (default): the CommitLog is synced to disk every ~10 seconds. Writes are acknowledged before the sync. Up to 10 seconds of data can be lost in a hard crash. But is fast.
Batch: the CommitLog is synced before every acknowledgment. No data loss window. Slower. Every write pays a disk flush, similar to PostgreSQL's default synchronous_commit = on.

For the benchmark in Post 6, the sync mode is called out explicitly because it significantly affects write throughput numbers.

3. Memtable write - the client is acknowledged here

After the CommitLog append, the row is written to the Memtable for orders_by_id. The Memtable is a sorted in-memory data structure sorted by partition key, then by clustering key within each partition. For orders_by_id with only a partition key, the sort is by order_id. For orders_by_user with (user_id, created_at), rows are sorted first by user_id, then by created_at within each user.

Once the Memtable write is complete, Cassandra acknowledges the write to the client. No disk access to a data file. No B-Tree traversal. No page split. No locking against other writers. The write touched a sequential log file and an in-memory structure. That's it. This is why Cassandra's single-threaded insert throughput in the Post 1 benchmark was ~18,000 rows/sec compared to PostgreSQL's ~8,500.

4. The second Memtable write — write amplification in practice

The application now sends the second INSERT for the same order event:

INSERT INTO orders_by_user
  (user_id, created_at, order_id, status, amount)
VALUES (uuid(), toTimestamp(now()), uuid(), 'shipped', 149.99);

This goes through the same CommitLog + Memtable path, but to the orders_by_user Memtable. Two CommitLog appends, two Memtable writes, two eventual SSTable entries for one logical business event. The write amplification is real. It's the cost of Cassandra's access pattern design, and it's visible in disk space usage and write throughput measurements on multi-table schemas.

5. Memtable flush --> the SSTable is born

When the Memtable reaches its size threshold (configurable, typically 256MB–1GB), Cassandra flushes it to disk as a new SSTable. The flush is a single sequential write pass from beginning to end. As the Memtable is already sorted, so no sort step is needed. Cassandra writes the entire sorted buffer to a new file in one pass. Sequential. Fast. The best possible disk write pattern.

The resulting SSTable is immutable. It will never be modified after being written. If a row is later updated, the update goes to a new Memtable and eventually a new SSTable. The old SSTable keeps its old version.

Three files are written alongside the SSTable data file:

Bloom filter: a compact probabilistic structure that, given a partition key, can definitively answer "this key is NOT in this SSTable" (no false negatives). It answers "maybe yes" for keys that are there, and occasionally for keys that aren't (false positives). Kept in memory. Eliminates most unnecessary SSTable reads.
Partition index: maps each partition key in this SSTable to its byte offset in the data file. Used to seek directly to a partition without reading the whole file.
Partition summary: a sparse sample of the partition index, kept in memory. Used to narrow down the range to read from the partition index itself, avoiding a full index scan.

Once the SSTable is written and fsynced, the CommitLog segments that covered those writes are eligible for deletion as the data is now safe in the SSTable.

What's durable at each step:

Moment	What's safe
After CommitLog append	The write survives a crash, replay reconstructs the Memtable
After Memtable write	Same guarantee, Memtable is in RAM, CommitLog is the safety net
After SSTable flush	Doubly safe, SSTable on disk, CommitLog segment now disposable
After compaction	Cleaned up, old versions and tombstones removed, read path faster

Updates and Deletes: The Immutability Consequence

Since SSTables are never modified, Cassandra cannot update or delete a row the way PostgreSQL does.

Updates write a new version. If you update status from 'shipped' to 'delivered' on order_id = X, Cassandra writes a new row to the current Memtable with status = 'delivered' and a newer timestamp. The old row with status = 'shipped' still exists in an older SSTable. Before compaction runs, both versions are on disk. Reads resolve this by comparing timestamps and newest wins.

Deletes write a tombstone, a special record that marks a partition key (or specific row or column) as deleted at a particular timestamp. The tombstone goes to the Memtable and eventually an SSTable. The original data still sits in its original SSTable. Before compaction, the data is still there on disk; reads see the tombstone, find that it's newer than the data, and return nothing.

This means a table that has seen many updates looks like this on disk:

SSTable-1 (oldest):
  order X: status=shipped,   ts=1000
  order Y: status=pending,   ts=1001

SSTable-3 (newer):
  order X: status=delivered, ts=1500  ← newer version
  order Z: [tombstone]       ts=1600  ← delete marker

SSTable-5 (newest):
  order Y: status=shipped,   ts=2000  ← another update

The disk footprint of an orders table with a busy update pattern is larger than the logical size of the data. Every historical version of every row exists in some SSTable until compaction removes it. Monitoring SSTable count and disk amplification is part of running Cassandra in production.

Compare this to PostgreSQL's dead tuples: PostgreSQL also keeps old row versions around (in heap pages) and cleans them with VACUUM. Different mechanism, same root cause - both engines must keep old versions available for concurrent readers or crash recovery, and both accumulate waste that a background process cleans up. The specifics differ, but neither is free.

The Read Path

Scenario A: `SELECT * FROM orders_by_id WHERE order_id = ?`

1. Check the Memtable

The read path starts in memory. Is the target order_id in the current Memtable? If yes, return that version. It's the newest possible. If no, continue to SSTables.

2. Bloom filter check for each SSTable

Cassandra checks the Bloom filter for every SSTable on disk. A Bloom filter check is a memory operation - the filters are loaded into RAM. For each SSTable, the result is one of:

Definitely not here: skip this SSTable entirely. No disk read.
Maybe here: proceed to check the partition index.

With 20 SSTables and a Bloom filter false positive rate of ~1%, most SSTables are eliminated with zero disk I/O. The few that pass the filter get a partition index lookup.

3. Partition index lookup

For each SSTable that passes its Bloom filter, Cassandra checks the partition summary (in memory) to narrow the range, then reads the relevant portion of the partition index from disk to find the exact byte offset of this order_id in the SSTable data file.

4. Read the partition from the SSTable

Cassandra reads the partition data from the byte offset identified in step 3. This is the actual disk read.

5. Merge versions across SSTables

If the key appeared in multiple SSTables, Cassandra now has multiple versions of rows or cells with different timestamps. It merges them: for each column, the version with the highest timestamp wins. Tombstones suppress any data with an older timestamp. The result is the most recent consistent version of the row.

What read amplification looks like in practice:

After 1 flush (1 SSTable):
  → 1 Bloom filter check
  → 1 partition index lookup
  → 1 data read
  → No merge needed
  ≈ fast

After 20 flushes (20 SSTables), before compaction:
  → 20 Bloom filter checks (memory, fast)
  → ~1-3 partition index lookups (most filtered by Bloom)
  → 1-3 data reads (disk)
  → Merge step across matched versions
  ≈ slower, and the p99 tail grows with SSTable count

This is the compaction effect Post 6's benchmark will put exact numbers on: the same cold read query, the same hardware, the same data - and a p99 latency that drops by more than 7× once SSTable count falls from 8 to 1. The engine cleaned up after itself, and reads got proportionally faster.

Scenario B: `SELECT * FROM orders_by_user WHERE user_id = ?`

This query goes to the orders_by_user table - a completely separate set of SSTables with user_id as the partition key. The read path is identical to Scenario A: Memtable check, Bloom filters, partition index, data read, merge.

Here's the thing to notice: this is a primary key lookup on orders_by_user, not a secondary index lookup. The cost was paid at write time, when the application wrote to both tables. The read is as efficient as any partition key read on any table. There's no equivalent of PostgreSQL's heap fetch, no secondary B-Tree traversal, no ctid resolution step.

This is the core architectural contrast with PostgreSQL:

Operation	PostgreSQL	Cassandra
Write one order	1 heap write + N index updates	2 table writes (explicit, application-owned)
Read by `order_id`	Primary index → heap fetch	Partition lookup on `orders_by_id`
Read by `user_id`	Secondary index → heap fetch	Partition lookup on `orders_by_user`
Who pays for the secondary access	Engine, at read time	Application, at write time

PostgreSQL does the secondary access work at read time. The heap fetch and index maintenance are handled transparently. Cassandra moves that cost to write time - you write twice, but both reads are primary lookups. Same total work, different distribution across the write/read boundary.

Compaction: Where the magic happens.

Every post about Cassandra mentions compaction. Most treat it as an operational detail. It's not. Compaction is the corrective force that makes the LSM design sustainable. Without it, SSTables would accumulate indefinitely, reads would get progressively slower, and tombstones would never be reclaimed.

Why compaction exists

Every Memtable flush produces a new SSTable. Updates and deletes produce additional versions and tombstones in newer SSTables. Over time:

SSTable count grows → read amplification grows
Disk space grows → old versions and tombstones consume space that no longer represents live data
Read latency grows → more files to check, more merging at read time

Compaction is the mechanism that reverses all three.

What happens during compaction

Cassandra selects a set of SSTables to compact (which ones depends on the strategy). Then:

Open all selected SSTables simultaneously and read them in sorted partition key order. Because each SSTable is individually sorted, merging them is a merge sort efficiently.
For each partition key, collect all versions and cells from all selected SSTables.
For each cell, keep only the version with the highest timestamp.
For tombstones: if the tombstone is older than gc_grace_seconds (default: 10 days), drop both the tombstone and the data it deletes. If it's newer, keep the tombstone in the output if it may still be needed.
Write the merged, deduplicated, tombstone-cleaned result as a new SSTable.
Delete the input SSTables.

The output is a single, clean SSTable with no duplicate versions, no stale tombstones, and a fresh Bloom filter and partition index reflecting only live data.

The gc_grace_seconds window exists for multi-node safety: in a cluster, a tombstone needs time to propagate to all replicas. If compaction removed a tombstone before all replicas saw it, a replica that missed the deletion could serve the deleted data as if it were live. Ten days is the conservative window to ensure propagation completes.

The direct effect on reads

After compaction, the SSTable count drops. Fewer SSTables means:

Fewer Bloom filter checks per read
Fewer partition index lookups
Less data to merge
Shorter, faster read path

This is the benchmark result from Post 1 made concrete. Cassandra read p99 went from ~4.1ms to ~1.4ms after compaction. The engine cleaned up after itself, and reads got proportionally faster.

The three compaction strategies

STCS - Size-Tiered Compaction Strategy (the default)

Groups SSTables by size and merges groups of similarly-sized ones together. Think of it as bins: small SSTables merge into medium ones, medium into large, large into very large. Write-optimized, each byte of data participates in few compaction passes. The downside: at any moment you can have many SSTables at the small tier, which means read amplification spikes under heavy write load before a tier-level merge runs.

Use STCS for write-heavy workloads where compaction I/O budget is limited.

LCS - Leveled Compaction Strategy

Organizes SSTables into levels (L0, L1, L2, ...), where each level is 10× larger than the previous. From L1 onward no two SSTables at the same level overlap in key range. A read needs to check at most one SSTable per level from L1 up. With 5 levels, that's at most 5 SSTables for any read, regardless of how many total SSTables exist.

The exception is L0. L0 receives Memtable flushes directly and SSTables here can have overlapping key ranges, they arrive as flushed, unsorted relative to each other. A read must check every L0 SSTable. This is why L0 SSTable count is the critical operational metric for LCS: under heavy write load, L0 accumulates faster than compaction can promote files to L1, and read amplification rises until the backlog clears. A healthy LCS table keeps L0 small. Typically under 4 files.

Read-optimized above L0. Bounded read amplification once L0 is under control. The cost: compaction is more frequent and more I/O-intensive because every write eventually needs to be organized into non-overlapping ranges at each level.

Use LCS for read-heavy workloads where predictable read latency matters more than compaction overhead.

TWCS - Time-Window Compaction Strategy

Divides SSTables into time windows (e.g., one window per day). SSTables within a window are compacted together; windows don't compact across each other. When a window's TTL expires, the entire SSTable for that window is deleted as a unit - no need to read the data, just drop the file.

Built for time-series data with TTL. Extremely efficient for the "write once, read briefly, expire in bulk" pattern. Breaks down for workloads that update historical data, because updates write new timestamps that cross window boundaries.

Use TWCS for time-series tables, event logs, anything with uniform TTL.

The operational cost

During compaction, both the input SSTables (being read) and the output SSTable (being written) exist on disk simultaneously. Peak disk usage during a compaction can be roughly 2× the size of the data being compacted. Cassandra nodes should never run above ~50% disk utilization, or compaction may fail for lack of space.

Compaction also competes with live reads and writes for disk I/O. Cassandra has throughput throttling for compaction, but under heavy write load that produces SSTables faster than compaction can consume them, read amplification can climb even with throttling. Monitoring SSTable count per table is a core Cassandra operational metric.

The Cassandra Production Footgun: Tombstones

Here's the scenario that causes real production incidents.

Your application deletes all orders with status = 'cancelled' from a partition lets say, all orders for a specific user in orders_by_user. Each delete writes a tombstone. The data still exists in the older SSTables. For the next gc_grace_seconds (10 days by default), every read of that partition must process every tombstone.

Now imagine a partition with 100,000 cancelled orders, all tombstoned. A read for that user's current orders must scan through 100,000 tombstones to find the handful of live rows. Even if the application considers those orders "deleted," Cassandra is reading every tombstone at query time. Reads that should return 5 rows in milliseconds take seconds because of tombstone scanning.

Cassandra will warn you that there's a tombstone_warn_threshold (default: 1,000 tombstones per read) and a tombstone_failure_threshold (default: 100,000). Hitting the failure threshold causes reads to be aborted. Both are real production incident causes at companies running Cassandra at scale.

The mitigation: set TTLs on data instead of deleting it, use TWCS so expired data is dropped as whole SSTables rather than tombstoned row by row, and monitor tombstone metrics actively. The problem doesn't appear in development because development data volumes are too small to trigger it.

Three Engines, One Comparison

Now that you've seen all three storage paths, here's how they line up:

Where the write lands first

PostgreSQL and MongoDB both write to a sequential log first (WAL / WiredTiger journal), then modify in-memory page structures (shared_buffers / WiredTiger cache). Cassandra also writes to a sequential log first (CommitLog), then writes to an in-memory sorted buffer (Memtable). The durability pattern is the same - log first, then memory, then data files. The difference is what the in-memory structure is: a page cache holding B-Tree nodes (PG/Mongo) vs a sorted write buffer that will become an immutable file (Cassandra).

What makes writes fast

PostgreSQL and MongoDB writes involve finding the right position in a B-Tree, acquiring page or document locks, potentially splitting pages, and writing WAL records for modified pages. Under sustained write load, page splits and locking create latency variance. Cassandra writes append to a log and insert into a sorted RAM buffer. No page to find, no lock to acquire, no split to handle. The write path is maximally simple. The price is paid elsewhere.

What makes reads complex

PostgreSQL and MongoDB reads follow a tree to a single authoritative location. The heap or the B-Tree leaf. The data is there, in one place. Cassandra reads must check multiple SSTables, each of which may contain a version of the requested row. Bloom filters eliminate most checks, but the merge step is always present when multiple versions exist. Read complexity grows with SSTable count and shrinks after compaction.

Who owns the secondary access pattern

PostgreSQL maintains secondary indexes automatically. MongoDB maintains them automatically. The application writes once and the engine handles multiple access paths. In Cassandra, the application owns the secondary access pattern by writing to multiple tables. This is more work for the application developer and more disk space consumed. But both reads end up as efficient primary key lookups on their respective tables, which is not true of B-Tree secondary indexes that require heap fetches.

What's Next: The Numbers

Posts 2, 3, 4 and 5 have been entirely about understanding why each engine behaves the way it does. Post 6 is where that understanding meets measurement.

Real C++ clients. 1 million rows. Identical hardware. Cold reads and warm reads. Write throughput under sustained load. Pre-compaction and post-compaction latency distributions. The full picture.

The most surprising result in the benchmark is not the one you'd predict from the theory alone. You'll need to read coming post to find out what it is.

MongoDB Internals: Inside the Storage Engine and How is it different than Postgre

priteshsurana — Thu, 09 Apr 2026 00:05:04 +0000

Post 3 explained the flow of INSERT and SELECT from PostgreSQL lense. Now its time for insertOne/insertMany and find.

MongoDB

How MongoDB is different before you start

Three major differences from PostgreSQL that we will visit in this section.

First, WiredTiger is a separate, pluggable storage engine underneath MongoDB. PostgreSQL's storage is tightly integrated with the query engine. WiredTiger is a standalone embeddable key-value store that MongoDB sits on top of. This matters because WiredTiger has its own caching, its own journal, its own compression, and its own concurrency model, somewhat independent of MongoDB's query layer.

Second, documents are stored as BSON — a binary encoding where field names are stored as strings inside every document, on disk, for every document in the collection. PostgreSQL's heap rows contain only values; column names live once in the catalog. BSON's field name overhead matters at scale.

Third, MongoDB provides document-level concurrency, implemented using WiredTiger’s optimistic concurrency control and fine-grained locking, not page-level locking. Two concurrent writes to different documents in the orders collection never block each other, even if they land on the same internal storage page. PostgreSQL's page-level LWLocks can cause contention between concurrent writers targeting the same page.

The architecture

The write path: `insertOne` into orders

db.orders.insertOne({
  order_id: "a1b2...",
  user_id: "u9x8...",
  status: "shipped",
  amount: 149.99,
  description: "Order for...",
  created_at: new Date()
})

1. BSON serialization

Before anything reaches WiredTiger, the document is serialized to BSON. In BSON, each field is encoded as: a type byte, the field name as a null-terminated string, then the value. For our orders document with six fields, the field names themselves (order_id, user_id, status, amount, description, created_at) add roughly 50–70 bytes of overhead per document.

That overhead exists for every document in the collection. For 1 million orders, that's 50–70MB of field name data that PostgreSQL simply doesn't have, because PostgreSQL stores column names once in pg_attribute. For a collection with short values and many fields, BSON overhead is a meaningful fraction of total storage. For a collection dominated by large field values (like the 500-character description), it's a smaller percentage but never zero.

This is a fundamental consequence of the schema-free document model that the schema travels with the data.

2. WiredTiger cache - the document lands here first

WiredTiger maintains its own in-memory cache (configured via wiredTigerCacheSizeGB). This is conceptually similar to PostgreSQL's shared_buffers - a pool of in-memory pages that buffer both reads and writes.

The key difference: WiredTiger stores data compressed on disk but uncompressed in cache. When a document is written to the WiredTiger cache, it lives there uncompressed. When it's evicted to disk (during a checkpoint), WiredTiger compresses it using Snappy by default. When it's read back from disk (a cold read), it's decompressed as it loads into cache.

This means your configured cache size represents uncompressed data, while your disk usage reflects compressed data. A 4GB WiredTiger cache might correspond to 8–12GB of data on disk, depending on compression ratio. Cold reads pay a decompression cost that PostgreSQL doesn't have in its default configuration.

3. Journal write - durability before acknowledgment

WiredTiger has its own journal, conceptually equivalent to PostgreSQL's WAL. It's a sequential, append-only log that describes changes before they're applied to data files.

The key behavioral difference from PostgreSQL is in the default durability setting. PostgreSQL's default synchronous_commit = on fsyncs the WAL before every commit. WiredTiger's default journal sync interval is 100 milliseconds. Acknowledgment can happen before the journal is fsynced, accepting up to 100ms of potential data loss in a hard crash.

MongoDB exposes this to the application as write concern. With j: false, MongoDB acknowledges the write as soon as WiredTiger's cache accepts it. With j: true, MongoDB waits for the journal to be fsynced before acknowledging. The latency difference between these two settings is measurable; j: true adds the cost of a synchronous disk flush to every write, similar to what PostgreSQL pays by default.

For the benchmark in Post 6, write concern settings will be explicitly called out because they significantly affect the numbers.

4. Primary B-Tree index update on order_id

WiredTiger updates the order_id B-Tree index. The traversal is the same pattern as PostgreSQL's B+Tree - root to internal nodes to leaf, finding the right position for the new UUID. The same random-UUID page-split problem applies: UUIDs land at random positions, any leaf can be full, splits are frequent.

One architectural distinction worth noting: in MongoDB, the collection storage itself is a WiredTiger B-Tree keyed by order_id. There isn't a separate heap file and a separate primary index. The collection B-Tree serves as both. The document data lives in the B-Tree's leaf nodes. This is different from PostgreSQL, where the heap is unordered storage and the primary index is a separate B+Tree pointing into it.

5. Secondary index updates on user_id and created_at

MongoDB secondary indexes differ from PostgreSQL's in one important design choice: MongoDB secondary index entries contain the document's order_id value, not a physical storage location.

In PostgreSQL, a secondary index leaf entry holds a ctid- a literal page number and slot number. This is a direct physical pointer into the heap. It's fast to follow at read time, but it becomes stale if the row moves (which can happen during certain heap operations).

MongoDB chose to store order_id in secondary index entries instead. The lookup then requires a second step: use the order_id to look up the document in the collection's primary B-Tree. This is effectively two B-Tree traversals for a secondary index read.

The reason for this choice: documents in WiredTiger can move within storage during compaction and internal page restructuring. If secondary indexes contained physical locations, every document move would require updating every secondary index entry pointing to it which is potentially expensive. Storing order_id instead means document moves never invalidate secondary index entries. The tradeoff is paid at read time with the double traversal.

6. Document-level locking

When two concurrent inserts arrive, WiredTiger acquires a document-level lock for each, not a page-level lock. If the two documents happen to land on the same internal B-Tree page, they still don't block each other. So WiredTiger's concurrency model ensures independent documents can be written independently even when they share a storage page.

PostgreSQL's LWLocks are page-level: if two concurrent inserts target the same B+Tree leaf page, one must wait for the other to release the lock before proceeding. Under high concurrent write load to the same key range (like sequential timestamps), this becomes measurable contention.

The read path: primary and secondary key

Scenario A: `db.orders.findOne({ order_id: "a1b2..." })`

1. B-Tree traversal

MongoDB traverses the order_id B-Tree from root to the leaf containing the target UUID. Same depth as PostgreSQL for the same data volume - 3–4 levels for 1M documents. The query planner identifies this as a primary key lookup and routes it through the order_id index without deliberation.

MongoDB caches which index to use per query shape, the structure of the filter, sort, and projection without the specific values. If you've run findOne({ order_id: ... }) before, the planner uses the cached plan. PostgreSQL re-evaluates cost-based plans per query but also uses plan caching for prepared statements; MongoDB's trial-based plan caching behaves differently under data distribution changes.

2. Document fetch and decompression

Because the collection data lives in the order_id B-Tree itself (not a separate heap), the document is retrieved directly from the leaf node. There's no separate heap fetch step. The B-Tree traversal ends with the document.

If the page is in the WiredTiger cache, the document is already uncompressed in memory. If not, a cold read - WiredTiger reads the compressed page from disk, decompresses it into cache, and returns the document. The decompression step adds CPU work to cold reads that PostgreSQL's default configuration doesn't have.

For the benchmark's cold read numbers, this decompression cost is part of why MongoDB's p99 is slightly higher than PostgreSQL's - the disk I/O is similar, but MongoDB adds decompression.

Scenario B: `db.orders.find({ user_id: "u9x8..." })`

1. Secondary index traversal

MongoDB traverses the user_id secondary index B-Tree to find all entries matching the target UUID. Each matching leaf entry contains (user_id_value, _order_id_value).

2. Document lookups by order_id

For each matching order_id, MongoDB performs a second B-Tree traversal but this time on the collection's order_id B-Tree to retrieve the full document. If a user has 20 orders, that's 20 secondary index traversals + 20 order_id lookups. Each order_id lookup is a full tree traversal from root to leaf.

This double-traversal is the same fundamental cost as PostgreSQL's secondary index + heap fetch, but the mechanism differs. PostgreSQL follows a ctid (direct physical pointer, one disk seek). MongoDB follows an order_id (logical key, full tree traversal). The logical key approach is more robust to storage reorganization; the physical pointer approach is faster per lookup.

Both databases pay the secondary-index penalty. The performance gap between primary key reads and secondary index reads is visible in both engines.

The MongoDB surprise: WiredTiger cache and compression

Engineers who come from PostgreSQL often assume that "cache size" and "data size" are in the same units. In WiredTiger, they're not.

WiredTiger compresses data on disk (Snappy by default achieves 2–4× compression on typical BSON documents). The WiredTiger cache holds data uncompressed. This means:

The WiredTiger cache holds data uncompressed, but disk stores it compressed. This means the cache must be sized for the uncompressed working set which is 2–4× larger than the on-disk footprint. A working set that occupies 4GB compressed on disk expands to 8–16GB in the WiredTiger cache. Under-sizing the cache relative to this uncompressed working set is a common MongoDB performance issue.
Every cold read involves a decompression step. Reading from disk means reading compressed bytes, then spending CPU cycles to decompress them into cache. On modern hardware this is fast. Snappy decompresses at multiple GB/sec but it's not free, and it doesn't exist in PostgreSQL's default storage.
Cache pressure is measured in uncompressed bytes. To hold 8GB of compressed on-disk data in the WiredTiger cache, you need roughly 8GB × compression_ratio so somewhere between 16GB and 32GB of cache. More cache is required than the raw disk size suggests, not less. This is the inverse of what most engineers assume when first sizing a MongoDB deployment.

The compression tradeoff is generally positive. You get more effective cache and less disk I/O. But it changes the cost model for cold reads in a way that's easy to overlook when sizing hardware.

Direct comparison

After tracing both engines, here's where they converge and where they diverge:

Where the first write lands

Both PostgreSQL (WAL) and MongoDB (journal) write to a sequential log before touching data structures. The principle is identical - write-ahead logging for crash safety. The difference is the default sync behavior: PostgreSQL fsyncs per commit by default; WiredTiger's journal syncs every 100ms by default. PostgreSQL's default is safer; MongoDB's default is faster.

Locking granularity

PostgreSQL uses page-level LWLocks - concurrent writes to the same B+Tree leaf page serialize. WiredTiger uses document-level locks - concurrent writes to different documents never block each other. Under high concurrent write load to overlapping key ranges, WiredTiger's finer granularity shows up as better throughput.

Secondary index design

PostgreSQL secondary indexes store ctids - physical heap pointers. Fast to follow, but tied to physical location. MongoDB secondary indexes store order_id - logical keys. Requires a second B-Tree traversal to fetch the document, but immune to document moves. Both choices are deliberate; both have real costs at read time.

Storage overhead

PostgreSQL stores column names once in the system catalog; heap rows contain only values. MongoDB embeds field names in every BSON document on disk. For our orders collection with six fields and 1 million documents, BSON overhead adds roughly 50–70MB that PostgreSQL doesn't have. For collections with larger values, this is a smaller percentage; for collections with many small fields, it's significant.

Where each engine is faster

PostgreSQL's linked B+Tree leaf nodes make range scans faster - follow the list, read sequentially. MongoDB's document-level locking makes high-concurrency writes more scalable. PostgreSQL's direct ctid heap fetch is faster per secondary index lookup than MongoDB's double B-Tree traversal. WiredTiger's compression means more data fits in cache per GB of RAM.

What's next: when B-Tree assumptions break down

Both PostgreSQL and MongoDB are built on B-Tree variants. They organize data in sorted pages, they update those pages in place, they use WAL or journal for crash recovery, and they handle MVCC by keeping old versions around for concurrent readers. The details differ, but the fundamental bias is the same: optimize for reads at the cost of write complexity and in-place update overhead.

Next introduces Cassandra, which starts from the opposite premise. Its storage is append-only and immutable. There are no pages to split, no tuples to vacuum, no in-place updates. Writes are always sequential appends to a log and an in-memory buffer. Reads are more complex because data may be spread across multiple immutable files.

Every performance characteristic that differs between Cassandra and the two engines you just traced flows from that single architectural inversion.

PostgreSQL Internals: Inside the Storage Engine

priteshsurana — Sun, 05 Apr 2026 04:29:42 +0000

Post 2 gave you the data structures: B+Tree, B-Tree, LSM Tree - how they're shaped and the tradeoffs they do.

This post traces an INSERT and a SELECT through PostgreSQL, step by step, using the same orders schema throughout.

PostgreSQL

The architecture

Before tracing any queries, map the terrain. PostgreSQL has four major components you'll encounter on every operation:

shared_buffers is PostgreSQL's in-memory page cache; every read and write touches it first. The WAL is an append-only sequential log on disk; it's how PostgreSQL survives crashes. The heap file is where rows actually live, in 8KB pages, in roughly insertion order. The index files are separate B+Tree structures, also stored as 8KB pages, pointing into the heap. Writes go to the WAL first, then into shared_buffers, and eventually to the heap and index files on disk. Reads check shared_buffers first; if the page isn't there, it comes from disk.

The write path: `INSERT INTO orders`

INSERT INTO orders
VALUES ('a1b2...', 'u9x8...', 'shipped', 149.99, 'Order for...', now());

1. Parsing and planning

PostgreSQL parses the SQL into an AST, resolves table and column names against the catalog, and produces a trivial plan: "insert one row into the orders heap, update the primary index on order_id, update secondary indexes on user_id and created_at." No interesting planning happens for a simple insert; the executor takes over immediately.

2. The WAL write happens before anything else

Before PostgreSQL touches shared_buffers, before it modifies a single heap page, it writes a WAL record describing this insert. The WAL record contains enough information to reconstruct the change: which relation, which page, what was written.

Why first? Because disk writes are not atomic. If PostgreSQL wrote to the heap file and then crashed before finishing, you'd have a partially written page with no way to know what it should contain. The WAL is written sequentially. On crash, PostgreSQL replays WAL to reconstruct any changes that didn't make it to the heap.

The WAL write is fsynced to disk before the transaction is acknowledged to your application. That fsync is real I/O and it is one of the biggest contributors to PostgreSQL write latency.

This behavior is controlled by synchronous_commit. The default (on) fsyncs the WAL before acknowledging. Setting it to off lets PostgreSQL acknowledge before the fsync, reducing write latency significantly, but accepting up to wal_writer_delay (default 200ms) of potential data loss on a hard crash. In the benchmark in Post 5, you'll see exactly how much latency this setting saves. The difference is substantial.

3. The row lands in shared_buffers

PostgreSQL now needs a heap page with enough free space for the new row. The orders table with a 500-character description field has rows of roughly 600–700 bytes. An 8KB page holds about 10–12 of these rows.

PostgreSQL consults the Free Space Map (FSM) , a structure that tracks how much free space exists in each heap page, to find a suitable page. It loads that page into shared_buffers if it isn't already there, and writes the new row into the page's free space. The page is now dirty; its in-memory version differs from what's on disk. It will be flushed to the heap file eventually, by the checkpointer background process.

The heap is unordered by design. PostgreSQL doesn't store rows sorted by order_id or any other key. New rows go wherever there's space. This is what enables fast inserts; you never need to find a sorted position in the heap. The tradeoff is that reads by non-indexed fields require a full sequential scan.

So what's durable right now: The WAL record is on disk. If the server crashes at this exact moment, PostgreSQL will replay the WAL on restart and re-apply this insert. The heap page is only in shared_buffers. But that's fine, because the WAL has it covered.

4. The primary B+Tree index update on order_id

PostgreSQL now updates the order_id index. It traverses the B+Tree from the root page to find the leaf page where the new UUID belongs.

For a table with 1 million rows, the B+Tree is typically 3–4 levels deep. Each level is a page read. So if the page is in shared_buffers, it's a memory access. If not, it's a disk read. Root and upper internal pages stay hot in shared_buffers because they're accessed on every operation; leaf pages are the cold part.

UUID keys are random. They don't arrive in sorted order, so each new key lands at a random position in the leaf level. Because of this, any leaf page can be the target of any insert and any leaf page can be full. Page splits happen frequently with UUID primary keys. When a leaf page is full, it splits: half its entries move to a new sibling page, and the parent node gets a new routing key. This is two page writes instead of one, plus a parent modification. Under high insert load with UUID keys, this is a real source of write amplification and p99 latency spikes.

Sequential or time-ordered keys (monotonically increasing integers) avoid this almost entirely. Splits only happen at the rightmost leaf as the tree grows forward. If write performance matters for your schema, key selection matters.

5. Secondary index updates on user_id and created_at

Each secondary index is a separate B+Tree on disk. After the heap write, PostgreSQL updates both of them. The user_id index leaf entries contain (user_id_value, ctid), the indexed value plus a tuple ID pointing to the physical location of the row in the heap (page number + slot number). The created_at index works identically.

This means a single INSERT into orders touches: 1 heap page + 1 primary index leaf page + 1 user_id index leaf page + 1 created_at index leaf page + WAL records for all of them. That's the minimum. Page splits add more.

This is why more secondary indexes means slower writes. Each index is another B+Tree traversal, another page modification, another WAL record. The cost is O(k) in the number of indexes.

What's durable right now: WAL records for all modifications have been fsynced. All four sets of page changes are in shared_buffers, not yet on disk in the heap and index files. None of that matters for durability; the WAL has everything. The checkpointer will flush the pages to disk in the background, at which point the corresponding WAL segments become eligible for recycling.

The read path: primary key and secondary index

Scenario A: `SELECT * FROM orders WHERE order_id = <some order id>`

1. Planning

The query planner sees a filter on order_id, which is the primary key. It knows there's a B+Tree index on this column. For an equality predicate on an indexed column with high selectivity (one specific UUID), an index scan is the obvious choice.

2. B+Tree traversal

PostgreSQL starts at the root page of the order_id index. Lets say with 1 million rows and 8KB pages, the tree is about 3–4 levels deep. At each level, it reads the node and follows the pointer toward the target UUID. This takes 3–4 page reads to reach the leaf.

Each page read checks shared_buffers first. Root and upper internal pages are almost always warm; they're tiny (a few pages) and accessed constantly. Leaf pages may or may not be cached depending on your workload and how recently this specific range was accessed.

3. The heap fetch

The leaf node entry contains a ctid, a physical pointer to a specific page and slot in the heap file. PostgreSQL takes that ctid and fetches the heap page. This is a second disk access (or shared_buffers hit) beyond the index traversal.

This two-step structure - index lookup to get a pointer, then heap fetch to get the actual row is the fundamental cost. And it's worth understanding clearly: even the primary index doesn't contain the row data. Rows live in the heap. Indexes are always pointers into the heap.

4. MVCC - finding the right version

When PostgreSQL finds the row in the heap page, it may find multiple versions of the same logical row. This is MVCC (Multi-Version Concurrency Control). Every row version has two hidden fields: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted or superseded it, or 0 if still live).

PostgreSQL checks these against your transaction's snapshot; the set of transaction IDs that were committed when your query started. If xmin is committed and visible to your snapshot, and xmax is 0 or not yet committed, this is your row. If another transaction is currently updating this row, you'll find its old version without blocking so MVCC means readers never wait for writers.

Warm vs cold read: If the heap page is in shared_buffers, the whole operation takes microseconds. If it's not, then a cold read - PostgreSQL reads it from disk, which is where that ~1.2ms p99 in the benchmark comes from. The OS page cache may have it buffered below the database level, which is faster than physical disk but slower than shared_buffers.

Scenario B: `SELECT * FROM orders WHERE user_id = $1`

1. Planning and index choice

user_id is not the primary key — it's a secondary index column. The planner estimates how many rows match this user_id value. If it's highly selective (one user with a few orders out of a million), an index scan on the user_id B+Tree is the right call. If it's low selectivity (a user with 50,000 orders), a sequential scan of the heap might actually be faster because random heap fetches at scale are slower than a sequential read.

One flag the planner uses is random_page_cost — the estimated cost of a random page read relative to a sequential read. The default is 4.0, which reflects spinning disk characteristics. On SSDs, it should be closer to 1.1–1.5. If random_page_cost is set too high for your hardware, the planner over-penalizes index scans and may choose a sequential scan when an index scan would be faster. This is a common tuning issue on SSD-backed databases.

2. Secondary index traversal + heap fetches

PostgreSQL traverses the user_id B+Tree to find all leaf entries matching the target UUID. Each matching entry contains a ctid. For a user with 20 orders, that's 20 ctids. PostgreSQL then fetches each corresponding heap page.

The critical issue: those 20 heap pages are likely scattered randomly across the heap file, because rows were inserted in time order, not user order. That's 20 potentially non-sequential disk reads. This is why secondary index reads are more expensive than primary key reads at scale. Not because the index traversal is slower, but because the heap fetches are random.

PostgreSQL has an optimization called a bitmap index scan for this case: it collects all matching ctids first, sorts them by physical page order, then fetches heap pages in order. This converts random reads into something closer to sequential reads. The planner chooses this strategy automatically when it estimates enough matching rows to make the sort worthwhile.

Crash recovery in summary

When PostgreSQL restarts after a crash, it reads the control file to find the position of the last successful checkpoint; a moment when all dirty shared_buffers pages were flushed to the heap and index files. From that position forward, PostgreSQL replays every WAL record, re-applying all changes that happened after the checkpoint. Any transaction with a WAL commit record is replayed to completion; any transaction without a commit record is effectively rolled back. When replay finishes, the heap and index files are in a consistent state and the database opens for connections.

The PostgreSQL surprise: dead tuples

Here's something that surprises most engineers when they first encounter it: PostgreSQL never updates a row in place.

When you execute UPDATE orders SET status = 'delivered' WHERE order_id = 1234, PostgreSQL does not find the existing row and modify it. It writes a new version of the row into the heap (in a free slot on the same or a different page), sets xmax on the old version to the current transaction ID, and leaves the old version in the heap page. The old version is now a dead tuple, invisible to future transactions but still occupying space.

The heap page now contains more data than it represents. Over time, on a table with frequent updates, heap pages can be mostly dead tuples. This is called heap bloat. It wastes disk space and, more importantly, causes reads to do more I/O to find live rows.

VACUUM is the background process that reclaims dead tuples. It scans heap pages, identifies tuples whose xmax is old enough that no active transaction could ever need them, and marks their space as reusable. autovacuum runs this automatically, but under heavy update load it can fall behind.

The reason PostgreSQL does writing new versions rather than modifying in place is MVCC. Concurrent readers may need the old version of a row while a writer is updating it. Both versions need to coexist in the heap until no reader needs the old one. The dead tuple overhead is the cost of non-blocking reads.

Cassandra has a different but equivalent cost: it also writes new versions and marks deletions with tombstones, and the cleanup (compaction) is also asynchronous.

B+Tree vs LSM Tree: Why Your Database's Data Structure Is Everything

priteshsurana — Wed, 01 Apr 2026 23:45:41 +0000

In Post 1, we looked at a benchmark result where Cassandra wrote 2× faster than PostgreSQL — then read 3× slower before compaction ran. Same hardware. Same data. Wildly different numbers in both directions.

That result is a direct consequence of the data structures each engine is built on. Cassandra and PostgreSQL made opposite choices at the foundation level, and those choices ripple through every read, every write, and every latency number you'll ever measure.

This post explains those choices.

The Problem All Database Indexes Must Solve

Before we look at any specific data structure, let's talk about why this problem is hard.

You want three things from a database index:

Fast writes. When you insert an order, the index should update quickly.
Fast reads. When you query by order_id, finding the right row should take as few disk operations as possible.
Efficient range scans. When you query orders between two dates, the engine should be able to find the start of the range and read forward — not scatter-gather across random disk locations.

Every storage engine is making a tradeoff between these goals, and the tradeoff it makes determines its entire performance profile.

Data stored in sorted order is fast to read sequentially and fast to scan in ranges. But keeping data sorted as new writes arrive requires finding the right position for each new key - which means touching existing data structures, reading before you can write, which adds latency. If instead you just append new data without sorting, writes are fast but reads become expensive because you have to search unsorted data.

Every storage engine resolves this tension differently. Let's look at each approach.

B+Tree - The Structure That Powers PostgreSQL

The B+Tree is the dominant data structure in relational databases and has been for decades. PostgreSQL uses it for every index. It's a tree, and to understand it, you need to picture what that tree actually looks like.

Nodes, leaves, and the shape of the tree

Imagine you're storing order_id values (UUIDs) in an index. The B+Tree organizes these into a hierarchy of nodes. Each node holds a sorted list of keys and pointers.

There are two kinds of nodes:

Internal nodes hold keys and pointers to child nodes. They exist purely for navigation - you use them to route your search toward the right leaf. They don't hold the actual row data.

Leaf nodes hold the actual data (or pointers to it).
In PostgreSQL's case, leaf nodes in a secondary index hold (key, heap tuple ID) pairs - the key you indexed, and a pointer to where the actual row lives in the heap file.

The root and internal nodes are small. The leaves are where the bulk of the data lives. For a table with millions of rows, the tree might be only 3-4 levels deep. That's 3-4 node reads to find any row, regardless of table size. This is what makes B+Tree reads fast.

Leaf nodes form a linked list - and this matters for range scans

Here's the detail that makes B+Trees especially good for range queries: all leaf nodes are linked together in a doubly linked list, in sorted key order.

[A-B] ↔ [C-D] ↔ [E-F] ↔ [G-H] ↔ [I-J] → ...

Why does this matter? Consider a query like:

SELECT * FROM orders WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31';

The engine traverses the tree from the root to find the leaf containing 2026-01-01. It reads that leaf. Then, instead of going back to the root to find the next range, it just follows the linked list pointer to the next leaf page, then the next, reading forward sequentially until it passes 2026-03-31.

Range scans on a B+Tree are essentially sequential reads through a linked list once you've found the starting point. On modern storage, sequential reads are dramatically faster than random reads. This is a core reason B+Trees are the default for databases with complex querying needs.

Pages - the unit of disk I/O

Each node in the tree maps to a page — a fixed-size chunk of data, typically 8KB in PostgreSQL. A page is the smallest unit the database reads from or writes to disk. If you need one row from a leaf node, you read the entire 8KB page that contains it.

This is important for understanding write cost. When you modify a node — say, inserting a new key into a leaf — you read the 8KB page, modify it in memory, and write the full 8KB page back to disk. Even if your change was one row.

Page splits - the hidden cost of inserts

Here's where B+Tree write performance gets interesting.

Every leaf page has a finite capacity. When a leaf page is full and a new key must be inserted into it, the page must split: the existing entries are divided between the old page and a new sibling page, and the parent internal node gets a new key pointing to the new sibling.

This split writes two pages instead of one, and also modifies the parent node. If the parent is also full, it splits too — and the cascade can propagate up to the root. Root splits are rare but expensive.

For sequential integer keys (1, 2, 3, ...), splits only happen at the rightmost leaf, the tree just grows a new page at the end. Predictable and cheap.

For random UUID keys, each insert lands at a random position in the key space. Any leaf can be the target. Any leaf can be full. Splits happen frequently and unpredictably. This is why p99 write latency for UUID primary keys is higher than p50 - most inserts are fast, but the splits that hit full pages cause latency spikes that show up in the tail.

This is also write amplification: one logical insert can cause two or more physical page writes.

Why B+Tree reads are fast, writes have variance

To summarize the B+Tree:

Reads: 3–4 page reads to find any row in a million-row table. Fast, predictable.
Range reads: Follow the leaf linked list sequentially. Very fast.
Writes: Usually fast, but page splits cause write amplification and latency spikes on random keys. The variance shows up at p99.

B-Tree - What MongoDB Uses and How It Differs

MongoDB's WiredTiger storage engine uses a B-Tree, and most engineers use the terms B-Tree and B+Tree interchangeably. They're not the same - but the difference that matters for WiredTiger isn't the one most textbooks describe.

The key difference is in the leaf layer.

In a B+Tree, all leaf nodes are linked together in a doubly linked list in sorted key order. Once you find the start of a range, you follow the chain forward page by page without re-entering the tree. This is what makes PostgreSQL range scans fast: a BETWEEN query traverses the tree once to find the starting leaf, then reads forward along the linked list to the end of the range.

In WiredTiger's B-Tree, leaf nodes hold the actual document data, same as a B+Tree so far. The structural difference is that leaf nodes are not linked. There is no chain to follow between adjacent leaves. To advance from one leaf to the next during a range scan, the engine must re-enter the tree from a higher level to find the next page.

What this means in practice:

Range scans are more expensive per step. A PostgreSQL range scan follows linked leaf pages sequentially. A WiredTiger range scan re-traverses the tree structure to reach each successive page. For small range scans the difference is minimal. For large ones, the linked list wins — which is why direct comparison of benchmark numbers in later post shows PostgreSQL's linked B+Tree leaf nodes as a range-scan advantage over MongoDB.

Point reads are equivalent. Both structures find a single key in O(log n) page reads. Neither has a meaningful advantage for typical table sizes.

For most transactional workloads — point lookups, small range scans, mixed reads and writes — the practical performance difference between WiredTiger's B-Tree and PostgreSQL's B+Tree is small. WiredTiger's implementation is heavily optimized and the engine adds its own caching and concurrency mechanisms on top.

The important comparison is not B-Tree vs B+Tree. It's both of them versus what comes next.

LSM Tree - The Structure That Powers Cassandra

The Log Structured Merge Tree (LSM Tree) starts from a completely different premise: what if we never modified data on disk at all?

This is the key insight that makes Cassandra's write throughput possible.

The core insight: sequential writes beat random writes

On any storage medium — spinning disk, SSD, NVMe — sequential writes are faster than random writes. On a spinning disk, the difference is enormous (the head doesn't need to seek). On an SSD, it's smaller but real (flash write amplification is lower for sequential patterns). The gap narrows on modern NVMe, but it never disappears.

B+Tree writes are fundamentally random: each insert must find its exact position in the tree and modify the page at that location. The page could be anywhere on disk.

LSM asks: what if we turned all writes into sequential appends?

The Memtable -- all writes go here first

When Cassandra receives an insert for an orders row, it writes to the Memtable, an in-memory sorted data structure. Think of it as a sorted list living entirely in RAM:

Memtable (in memory):
┌──────────────────────────────────────────────┐
│ order: aaa-111, user: x, status: pending ... │
│ order: bbb-222, user: y, status: shipped ... │
│ order: ccc-333, user: x, status: delivered.. │
│ order: ddd-444, user: z, status: pending ... │
│                  ↑ sorted by order_id        │
└──────────────────────────────────────────────┘

Writing to the Memtable is fast because it's just RAM. It's sorted because reads need to find data efficiently. New inserts go to their sorted position in memory — no disk I/O involved in the write path at all, beyond the CommitLog (Cassandra's crash-recovery log, which is a sequential append and extremely fast).

The SSTable - immutable and sorted

When the Memtable fills up, Cassandra flushes it to disk as an SSTable (Sorted String Table). This flush is a single sequential write, the entire sorted Memtable gets written from beginning to end in one pass. No random writes. No page modifications. Just a stream of sorted data to a new file.

Once written, an SSTable is never modified. It is immutable. Future writes don't touch it. This immutability is what makes the write path so clean: you never need to find a specific location on disk and modify it. You only ever write new files.

The immutability has an important consequence for updates and deletes. In a B+Tree, an update modifies the existing row in place. In an LSM Tree, an update writes a new version of the row to the current Memtable, which eventually becomes a new SSTable. Both the old version and the new version coexist on disk until compaction runs. Similarly, a delete doesn't remove data, it writes a tombstone record that marks the row as deleted. The old data persists until compaction.

The read problem — read amplification

Over time, you accumulate multiple SSTables:

Disk state after several Memtable flushes:

SSTable-1 (oldest): [aaa-111] [ccc-333] [eee-555] [ggg-777]
SSTable-2:          [bbb-222] [ddd-444] [fff-666]
SSTable-3:          [aaa-111] [hhh-888]  ← updated version of aaa-111
SSTable-4 (newest): [iii-999] [jjj-000]

Now you query for order_id = aaa-111. Where is the latest version?

It could be in any SSTable. SSTable-1 has an old version. SSTable-3 has a newer version. To find the latest, the read path must check all four SSTables, compare the timestamps on any matching rows, and return the most recent one. This is read amplification — one logical read requires multiple physical reads across multiple files.

Before compaction, with many SSTables, reads are slow. Later post's benchmark puts a precise number on exactly this: Cassandra's cold read p99 before compaction is dramatically higher than PostgreSQL's — a gap that closes almost entirely once compaction runs and SSTable count drops to one.

Bloom filters - the read shortcut

Reading every SSTable for every query would be unacceptably slow. Cassandra uses Bloom filters to short-circuit most of these checks.

A Bloom filter is a small, probabilistic data structure that answers one question: is this key definitely not in this SSTable?

The key word is definitely not. A Bloom filter can give you a false positive (it says "maybe yes" when the key isn't there) but it never gives you a false negative (if it says "definitely not," you can trust it). So before reading an SSTable, Cassandra checks the Bloom filter for that SSTable — if it says the key isn't there, you skip the entire SSTable. No disk read required.

In practice, Bloom filters eliminate most SSTable checks for most reads. Instead of reading 10 SSTables to find a key, you might read 1 or 2, the ones whose Bloom filters said "maybe." The filters live in memory, so checking them costs microseconds.

Bloom filters don't eliminate read amplification entirely, but they dramatically reduce it. They're why LSM reads are "slow but not catastrophic" rather than "completely unusable."

Compaction - the background process that makes reads fast again

Compaction is the merge step that gives LSM its name. At regular intervals, Cassandra picks a set of SSTables and merges them into a single new SSTable:

Before compaction:
SSTable-1: [aaa-111 v1] [ccc-333]
SSTable-3: [aaa-111 v2] [hhh-888]

After compaction:
SSTable-merged: [aaa-111 v2] [ccc-333] [hhh-888]

During compaction, for any key that appears in multiple SSTables, only the latest version (highest timestamp) survives. Tombstones are eventually resolved and the deleted data is removed. Old SSTables are deleted after the merge completes.

After compaction, there are fewer SSTables, so reads check fewer files. Bloom filters cover fewer files. Read latency drops.

This is the mechanism behind the benchmark number: Cassandra's read latency fell from 4.1ms to 1.4ms after compaction. Same data, same hardware, fewer SSTables to check. The engine cleaned up after itself, and reads got faster as a result.

There are different compaction strategies with different tradeoffs — some optimize for write throughput, others for read consistency, others for time-series data. Post 4 covers these in detail.

The Tradeoff Table

Here's how the three structures compare across the dimensions that matter:

Property	B+Tree (PostgreSQL)	B-Tree (MongoDB)	LSM Tree (Cassandra)
Write speed	Moderate - random page writes	Moderate - similar to B+Tree	High - sequential appends
Write variance	High - page splits cause p99 spikes	High - same mechanism	Low - appends are uniform
Point read speed	Fast - log(n) tree traversal	Fast - similar, sometimes shortcut	Variable - depends on SSTable count and compaction state
Range scan speed	Fast - linked leaf list	Moderate - no linked leaves	Moderate - needs Bloom filter + SSTable scan
Updates	In-place modification	In-place modification	New write (old version persists until compaction)
Deletes	In-place (mark deleted)	In-place (mark deleted)	Tombstone write (data persists until compaction)
Disk space	Compact, matches live data	Compact, matches live data	Can exceed live data size (historical versions + tombstones)
Read after heavy writes	Consistent	Consistent	Degrades until compaction runs

Neither structure is universally better. The table describes tradeoffs, not rankings.

Why This Matters for Your Application

The theory translates directly to system design choices.

Your workload is read-heavy with complex queries - e-commerce product search, financial reporting, analytics over order history. Users query by multiple fields, run date range scans, join across tables. B+Tree wins here. PostgreSQL's linked leaf nodes make range scans fast. The write cost is acceptable because writes are infrequent relative to reads. The predictable read latency matters more than maximum write throughput.

Your workload is write-heavy with known, fixed access patterns - event ingestion, IoT sensor data, activity logs, append-heavy order pipelines. You're inserting millions of rows and reading them back by a known key pattern. LSM wins here. Cassandra's sequential write path handles sustained high-throughput inserts without the page-split variance that B+Tree would introduce. You pay in read complexity, but if your reads are simple key lookups or partition scans, Bloom filters and compaction keep that cost manageable.

Most real workloads are somewhere in the middle, which is why MongoDB exists in the space between the two extremes - a B-Tree engine with flexible documents, good write performance, and reasonable query flexibility.

Post 6 will put real numbers on these tradeoffs. When you see Cassandra's write throughput compared to PostgreSQL's on the same hardware, the gap will be exactly what the data structure analysis predicts.

What's Next: The Theory Gets Concrete

You now have the foundation: B+Tree for reads, LSM for writes, and a B-Tree sitting comfortably in between. You know why page splits cause latency spikes. You know why Cassandra reads degrade with SSTable accumulation and recover after compaction.

Next Post takes everything you just learned and shows it working inside real database engines.

We're going inside PostgreSQL and MongoDB - from the moment a SELECT arrives, through the buffer pool, down the B+Tree, out to the heap file, and back. We'll trace a write through the WAL and watch it land in shared_buffers. We'll see what MVCC actually looks like inside a heap page, and why a PostgreSQL UPDATE doesn't modify a row - it writes a new one.

What Actually Happens When You Call INSERT?

priteshsurana — Tue, 31 Mar 2026 01:56:30 +0000

You call INSERT. The database says OK. You move on.

That acknowledgment feels instant. It feels like the database just... wrote something down. But between your INSERT and that OK, at minimum four distinct things happened that most engineers who use databases every day have never thought about:

The write was recorded in a sequential log before it touched any data structure, so a crash wouldn't lose it
At least one index was updated. And that update is more expensive than the insert itself on some engines
A decision was made about whether to hit the disk synchronously or defer it; a tradeoff with real latency consequences
The data was placed into a structure that was chosen years ago by the database designers, and that choice explains almost every performance characteristic you've ever observed

This series is about those four things.

The Moment of Insertion

Let's start with the most basic question: where does the data actually go first?

The answer is different for each database, and the differences are not superficial.

PostgreSQL receives your INSERT and immediately writes a record to something called the WAL - the Write-Ahead Log. This is a sequential append-only file on disk. Only after that WAL record is safely written does PostgreSQL modify the in-memory structure called shared_buffers, and eventually write the row to its final resting place: a heap file. The heap is just what it sounds like, rows stored roughly in the order they arrived, in fixed 8KB pages. The primary key index is a completely separate structure, updated separately.

MongoDB receives your document and routes it through WiredTiger, its storage engine. WiredTiger writes to its own journal (similar in spirit to PostgreSQL's WAL but different in format and behavior) and places the document into the WiredTiger cache, an in-memory buffer. Here's the twist: MongoDB's primary storage structure is a B-Tree, and unlike PostgreSQL's heap, the document data lives inside the B-Tree itself. There is no separate heap. The collection and the primary index are the same structure.

Cassandra does something neither of the others does. It writes to a CommitLog (its crash-recovery log) and then places the data into a Memtable, a sorted in-memory buffer that accumulates writes until it's full, at which point it gets flushed to disk as an immutable file called an SSTable. Cassandra never modifies existing files. Every write is an append. Every update is a new version. Every delete is a new record saying "this data is gone."

Three databases. Three fundamentally different answers to the question "where does the data go first."

Why does this matter? Because each approach has consequences that ripple through everything - write speed, read speed, crash recovery time, compaction behavior, and what happens to your p99 latency under load. Posts 2 through 4 go deep on each engine. But first, let's look at the structure that sits underneath all of this.

The Index Is Not What You Think

Most engineers think about indexes when they're writing queries. "This query is slow, I need an index." The mental model is: indexes exist to make reads fast.

That's not wrong. But it's half the picture.

Every index is updated on every write. Throughout this series we'll use a concrete example: an orders table with fields like order_id, user_id, status, amount, and created_at. When you insert an order, PostgreSQL doesn't just write the row to the heap, it also updates the B+Tree index on order_id, the B+Tree index on user_id, the B+Tree index on created_at. Three indexes means roughly four write operations for a single INSERT. The index update cost is not a footnote. On tables with several indexes under heavy write load, index maintenance is often the dominant cost of the write path.

Now here's the part that most engineers have never considered: the three databases use fundamentally different data structures for those indexes, and the choice of structure explains almost everything about their performance profiles.

PostgreSQL uses a B+Tree. MongoDB's WiredTiger also uses a B-Tree (a close relative). Cassandra uses an LSM Tree, a completely different class of structure that doesn't modify anything in place, ever.

The B+Tree and B-Tree are optimized for reads. You can find any key in a handful of disk reads, and the structure stays balanced automatically. The cost is that every write has to find the right place in the tree and modify it, which can mean reading pages from disk, modifying them, and writing them back. Random writes. Slow on spinning disks, less slow on SSDs, but never free.

The LSM Tree is optimized for writes. New data always goes to the end of something — a log, a buffer, a new file. There are no random writes. The cost is that reads become more complex, because the data you're looking for might be in any of several files that were written at different times.

This is the reason Cassandra writes faster than PostgreSQL under sustained load. And it's the reason Cassandra reads slower before a process called compaction runs.

Post 2 breaks this down completely. What each structure looks like, how it works, and why the choice of structure is the single most important decision a database designer makes.

What "Durability" Actually Costs

When someone says a database is durable, they mean: if the server loses power the instant after your INSERT is acknowledged, your data will still be there when the server comes back.

That's a strong guarantee. And it has a real cost.

To survive a power failure, data has to reach physical storage - magnetic disk or flash cells before the acknowledgment goes out. Not the OS's buffer. Not the CPU cache. The actual hardware. The system call that forces this is called fsync, and it is one of the most expensive operations a database performs. On a typical NVMe SSD, a single fsync takes microseconds. On a networked file system or a busy system under load, it can take milliseconds. And some workloads trigger thousands of them per second.

Here's where the databases diverge sharply.

PostgreSQL's default behavior is synchronous durability: every committed transaction waits for its WAL record to be fsync'd to disk before the acknowledgment goes out. You get the strongest possible guarantee. You also pay for every write with a disk flush.

MongoDB's default behavior has historically been more relaxed. By default, the journal syncs every 100 milliseconds. Your write is acknowledged as soon as WiredTiger's in-memory cache accepts it. You get lower latency. You accept that up to 100ms of writes could be lost in a hard crash. This is configurable - j: true forces a journal flush per write. But the default trades some durability for speed.

Cassandra's CommitLog syncs periodically too, every ~10 seconds by default. Same tradeoff, pushed even further toward throughput. If you need per-write durability in Cassandra, you configure it, and you pay the latency cost.

These aren't just configuration details. They're architectural philosophies. The benchmark numbers in later posts will show you exactly what each philosophy costs in microseconds. The difference between synchronous and asynchronous durability shows up plainly in the latency measurements — and the gap is bigger than most engineers expect.

The Benchmark Teaser

Here's what we measured when we ran 1 million inserts and then reads against all three databases on identical hardware. These numbers are illustrative — directionally accurate — and the real C++ benchmark results are covered in a later post.

Single-threaded insert throughput

Database	Rows/sec (approx)
Cassandra	~18,000
MongoDB	~12,000
PostgreSQL	~8,500

Cassandra is more than 2× faster than PostgreSQL on pure insert throughput. If you know about LSM Trees, this makes sense. If you don't, it looks like magic.

Cold primary key read latency (p99)

Database	p99 latency (approx)
PostgreSQL	~1.2ms
MongoDB	~1.8ms
Cassandra (before compaction)	~4.1ms
Cassandra (after compaction)	~1.4ms

Now look at Cassandra. Before compaction, it's the slowest reader by a factor of 3. After compaction, it's competitive with PostgreSQL. Same database. Same data. Same query. The difference is whether a background merge process has run.

This is the central tension of LSM-based storage: you pay for write speed with read complexity, and you recover that complexity through compaction. The "after compaction" number is not a cheat — it reflects real production behavior. But it tells you something important: Cassandra's read latency is not a fixed property. It depends on what state the engine is in.

These numbers also tell you something about the write-read tradeoff that runs through this entire series. The engine that writes fastest reads slowest before it cleans up after itself. The engine that reads fastest — PostgreSQL — is also the one paying the highest cost per write to keep its B+Tree in a consistent, readable state.

The real numbers from the C++ benchmark, with methodology, hardware specs, and full latency distributions, are in a later post. But this is the shape of what you'll find.

What This Series Will Cover

Here's the full map. Each post answers one question:

Post 1 - What is actually happening when you insert a row?
This post. The wide-angle view. Why the insert isn't simple, and why the differences between engines matter.

Post 2 - Why do databases use B+Tree, B-Tree, or LSM and what's the real difference?
The data structures underneath the storage engines, explained from first principles. Why the choice of structure determines almost everything about read and write performance.

Post 3 and 4 - How do PostgreSQL and MongoDB store and retrieve data internally?
Deep dive into the heap file, shared_buffers, WAL, WiredTiger's B-Tree, the journal, MVCC, and what a real insert and read look like step by step in each engine.

Post 5 - How does Cassandra's LSM-based storage work end to end?
Memtables, SSTables, Bloom filters, compaction strategies, tombstones, and why deleting data in Cassandra is more complicated than it sounds.

Post 6 - What do the benchmark numbers actually show - and why?
The C++ benchmark: 1 million records, controlled hardware, full methodology. Write throughput, read latency distributions, the compaction effect, and what explains every number.

Post 7 - How does everything change when you go multi-node?
Replication, sharding, consistency levels, and why the single-node storage engine behavior is just the beginning. CAP theorem applied to real production decisions.

What's Next

Before we can explain why Cassandra writes twice as fast as PostgreSQL but reads three times slower before compaction, we need to understand the structures that cause those numbers.

The next post is entirely about data structures - B+Tree, B-Tree, and LSM Tree. You'll understand why B+Trees are the default for read-heavy databases, why LSM Trees dominate write-heavy systems, and why the choice of structure is the single decision that every other database behavior flows from.

If you've ever wondered why adding a sixth index to a table hurt write performance more than the fifth index did, next post will have the answer.

Forem: priteshsurana

Cassandra Internals: LSM Tree, SSTables, and Compaction

Why Two Tables instead of One

The Architecture

The Write Path

Updates and Deletes: The Immutability Consequence

The Read Path

Scenario A: SELECT * FROM orders_by_id WHERE order_id = ?

Scenario B: SELECT * FROM orders_by_user WHERE user_id = ?

Compaction: Where the magic happens.

Why compaction exists

What happens during compaction

The direct effect on reads

The three compaction strategies

The operational cost

The Cassandra Production Footgun: Tombstones

Three Engines, One Comparison

What's Next: The Numbers

MongoDB Internals: Inside the Storage Engine and How is it different than Postgre

MongoDB

How MongoDB is different before you start

The architecture

The write path: insertOne into orders

The read path: primary and secondary key

Scenario A: db.orders.findOne({ order_id: "a1b2..." })

Scenario B: db.orders.find({ user_id: "u9x8..." })

The MongoDB surprise: WiredTiger cache and compression

Direct comparison

Where the first write lands

Locking granularity

Secondary index design

Storage overhead

Where each engine is faster

What's next: when B-Tree assumptions break down

PostgreSQL Internals: Inside the Storage Engine

PostgreSQL

The architecture

The write path: INSERT INTO orders

The read path: primary key and secondary index

Scenario A: SELECT * FROM orders WHERE order_id = <some order id>

Scenario B: SELECT * FROM orders WHERE user_id = $1

Crash recovery in summary

The PostgreSQL surprise: dead tuples

B+Tree vs LSM Tree: Why Your Database's Data Structure Is Everything

The Problem All Database Indexes Must Solve

B+Tree - The Structure That Powers PostgreSQL

Nodes, leaves, and the shape of the tree

Leaf nodes form a linked list - and this matters for range scans

Pages - the unit of disk I/O

Page splits - the hidden cost of inserts

Why B+Tree reads are fast, writes have variance

B-Tree - What MongoDB Uses and How It Differs

LSM Tree - The Structure That Powers Cassandra

The core insight: sequential writes beat random writes

The Memtable -- all writes go here first

The SSTable - immutable and sorted

The read problem — read amplification

Bloom filters - the read shortcut

Compaction - the background process that makes reads fast again

The Tradeoff Table

Why This Matters for Your Application

What's Next: The Theory Gets Concrete

What Actually Happens When You Call INSERT?

The Moment of Insertion

The Index Is Not What You Think

What "Durability" Actually Costs

The Benchmark Teaser

Single-threaded insert throughput

Cold primary key read latency (p99)

What This Series Will Cover

What's Next

Scenario A: `SELECT * FROM orders_by_id WHERE order_id = ?`

Scenario B: `SELECT * FROM orders_by_user WHERE user_id = ?`

The write path: `insertOne` into orders

Scenario A: `db.orders.findOne({ order_id: "a1b2..." })`

Scenario B: `db.orders.find({ user_id: "u9x8..." })`

The write path: `INSERT INTO orders`

Scenario A: `SELECT * FROM orders WHERE order_id = <some order id>`

Scenario B: `SELECT * FROM orders WHERE user_id = $1`