Forem: Yu Qian Yang

How We Built a Disaggregated Storage Layer for a Columnar MPP Database (And What Broke in Production)

Yu Qian Yang — Thu, 16 Apr 2026 10:30:27 +0000

How I Built a Disaggregated Storage Layer for a Columnar MPP Database

And what broke in production.

The Problem

Enterprise data warehouses are expensive to run. When your database stores data on local SSDs, you're paying SSD prices for data that might only be queried once a month. For a bank managing petabytes of historical transaction data, this adds up fast.

The standard answer in 2022 was "move cold data to object storage" — S3, Alibaba OSS, Huawei OBS, take your pick. Object storage is cheap. The problem is that it's also slow: latency in the hundreds of milliseconds per request, compared to microseconds for local NVMe.

Our task: make object storage fast enough that a columnar MPP database could use it as a primary storage backend without destroying query performance.

The result was UniStore, a disaggregated storage layer that sits between the query engine and cloud object storage. This is the architecture we built, the tradeoffs we made, and the bugs we found the hard way.

Architecture

UniStore runs as a separate process on each database node. From the query engine's perspective, it looks like a local filesystem. Under the hood, it manages a local cache (around 1TB per node) backed by object storage.

Query Engine
     |
     | (POSIX-like file API)
     v
  UniStore Frontend
     |
     | (internal protocol)
     v
  UniStore Backend
     |          |
     v          v
Local Cache   Object Storage
(NVMe, ~1TB)  (S3/OSS/OBS/HDFS)

The frontend handles file open/read/write calls from the query engine. The backend manages the cache, coordinates uploads to object storage, and handles prefetching. They communicate over a Unix socket.

Object files are split into fixed-size blocks. Each block is independently cached, uploaded, and tracked. The metadata — which blocks exist, where they are, their cache state — is stored in RocksDB for persistence across restarts.

Key Engineering: The Prefetch Engine

Cold reads from object storage are slow (~200ms per request vs ~100μs for local NVMe — roughly 2000x slower). The only way to make this tolerable is to have data in cache before the query asks for it.

UniStore's prefetch engine works by predicting sequential scan patterns. When the query engine opens a file and starts reading blocks sequentially, the backend detects the pattern and begins fetching ahead.

Two mechanisms drive this. The first is a sequential scan predictor: given the current read position and recent access history, it estimates which blocks will be needed next and issues prefetch requests proactively. The second is a pre-registration API: when the query optimizer knows in advance which files it will access, it registers this list before execution starts, and the backend begins warming the cache early.

To handle variable object storage latency, the prefetch timing uses a sliding-window average of recent round-trip times. If the storage backend is responding slowly, the prefetch window expands to compensate.

Performance result: Cold start (empty cache) is roughly 20x slower than SSD. Warm cache (data already prefetched) performs on par with SSD. In practice, for repeated workloads on the same dataset, queries run at SSD speed despite the underlying storage being object storage.

Key Engineering: The Cache Layer

The cache uses a block-level LRU policy with pin/unpin support. Blocks referenced by active queries are pinned — they cannot be evicted until the query releases them. This prevents a pathological case where a long-running scan evicts its own prefetched data.

Eviction is triggered when cache utilization crosses a threshold. The eviction policy walks the LRU list, skipping pinned blocks, and frees the least recently used eligible blocks.

One design decision worth noting: we track metadata (block location, cache state, reference counts) in RocksDB rather than in-memory only. This means cache metadata survives process restarts — after a crash, UniStore knows exactly which blocks are in the local cache without having to scan the filesystem.

Multi-Cloud Backend Abstraction

The customer required support for multiple cloud providers simultaneously, with the ability to switch backends or add redundancy without query engine changes.

We implemented a worker abstraction: separate backend implementations for each storage type (OSS-compatible, HDFS, and local filesystem) all satisfy the same interface. The frontend doesn't know or care which backend is active. A factory component selects the right implementation based on configuration.

This turned out to be more useful than we expected. During deployment, we needed to migrate data between cloud providers while keeping the database online. Because the worker abstraction was clean, we could run two backends simultaneously and drain one while filling the other — all transparent to the query engine.

What Broke in Production

Bug 1: Silent Data Corruption on Large File Writes

This one was subtle.

Large files are written as a sequence of blocks. The frontend issues completion signals one by one as each block finishes. The backend adds these to an async task queue — so in theory, blocks could be processed out of order.

The bug: when the last block was uploaded first (out of queue ordering), the multipart upload part number assignment collided with earlier blocks. The last block's data silently overwrote part of an earlier block. No error was returned. The file appeared to complete successfully, but the data was corrupt.

The root cause was in the upload type selection logic. The code was checking only buffer size to decide between single-upload and multipart-upload mode — but for files with multiple blocks, this check was wrong for any block beyond the first. It chose single-upload mode when it should have chosen multipart, which conflicted with an already-in-progress multipart upload session.

Fix: Gate the upload type decision on block position as well as buffer size. For any block beyond the first, force multipart mode.

We also added a fault injection flag that forces the last block to be uploaded first, making the race condition deterministically reproducible in CI. This is the kind of test infrastructure we should have built before shipping — async upload pipelines have inherent ordering ambiguity, and that should have been a first-class test scenario from day one.

Bug 2: Cache Deadlock from Tmpfile Leak

The original write path worked like this: when writing a file, data was first buffered into a local temporary file, and a corresponding cache object was created. When the file write completed normally, the temporary file was transitioned to a state where it could eventually be evicted.

The problem: if an exception occurred before the file write completed — say, if object storage became temporarily unavailable — the temporary file and its associated cache objects remained in a state where they could not be evicted. Cache space leaked. Given enough retries (which the query engine would do automatically), the entire cache filled up with unevictable incomplete objects. New queries blocked waiting for cache space that would never be freed.

A secondary problem: block identifiers are allocated serially, so retrying the same failed write created new identifiers and new temporary files with each attempt — accelerating the leak.

Fix: Added a configuration flag that bypasses the local cache entirely for writes, streaming data directly to object storage via dedicated write implementations for each backend type. In the worst case, only one temporary file leaks (named by object path rather than block identifier).

The tradeoff: writes no longer populate the cache, so data written and immediately read will hit object storage instead of cache. For our workload (write-once, read-later analytics), this was acceptable.

Lessons

1. Test your failure modes, not just your happy path.
The data corruption bug only manifested when the task queue processed blocks out of order. This was always possible, but we didn't test for it. Fault injection infrastructure should be built alongside the feature, not after the bug is found.

2. Resource cleanup ordering in async systems needs explicit contracts.
Both bugs above were variations of the same mistake: implicit assumptions about ordering in async code. The tmpfile bug assumed writes would always complete cleanly. The data corruption bug assumed blocks would always be processed in submission order. Neither assumption was documented, and neither was enforced.

3. Metadata persistence matters.
Storing cache metadata in RocksDB rather than memory-only was the right call. It added complexity (serialization, recovery logic), but it meant that a crashed UniStore process could restart without invalidating the entire cache or losing track of in-flight uploads.

Results

Deployed at a major bank's big data center, managing thousands of PBs across multiple cloud providers (AWS S3, Alibaba Cloud OSS, Huawei Cloud OBS, Tencent Cloud). Cache size per node: ~1TB. Concurrent client connections supported: up to 17,500.

Cold start performance: ~20x slower than SSD (dominated by object storage round-trip latency).
Warm cache performance: on par with SSD.

For the bank's analytics workloads — where the same datasets are queried repeatedly within a business day — warm cache performance means object storage costs with SSD-equivalent query speed.

If you're working on similar storage infrastructure problems — disaggregated storage, object storage integration, or database performance engineering — feel free to reach out.

How I Used Bit Manipulation to Speed Up Float-to-Int Conversion in a Storage Engine

Yu Qian Yang — Wed, 15 Apr 2026 02:36:14 +0000

In a columnar time-series database, one of the most effective compression tricks is
deceptively simple: if a float value is actually an integer, store it as one.

Why Integers Compress Better Than Floats

Integer compression algorithms like Delta-of-Delta, ZigZag, and Simple8b work by
exploiting predictable bit patterns — small deltas between adjacent values, values
that fit in fewer than 64 bits, and so on. They can pack multiple values into a
single 64-bit word.

Floats don't cooperate with these schemes. Even 1.0 and 2.0 have completely
different IEEE 754 bit representations (0x3FF0000000000000 and 0x4000000000000000).
Their XOR is large, their delta is meaningless as an integer, and bit-packing is useless.

So when a column is declared as FLOAT but actually contains values like 12.0,
18.0, 25.0 — which happens more often than you'd expect, either because the schema
was designed generically or because the upstream system always emits .0 values — you're
leaving significant compression headroom on the table.

The fix: detect these integer-valued floats at encode time, convert them losslessly
to integers, and route them through the integer compression path.

A temperature sensor that reports 21.0, 21.5, 22.0 is a good example. Multiply
by 10 and you get 210, 215, 220 — plain integers with small, predictable deltas.
Delta-of-Delta or Simple8b will compress these far more efficiently than any
float-specific scheme.

The challenge: before converting, you need to check whether the scaled value can be
losslessly represented as an integer. The naive check — std::isnan + range comparison —
works but it's slower than it needs to be on the hot encoding path.

Here's the faster approach I implemented, using nothing but bit manipulation.

The Setup: Scaling Floats to Integers

The encoding scheme works in two steps:

Scale: multiply the float by 10^scale (configurable per column)
Convert: cast the scaled value to integer using std::lround

For example, with scale = 2:

1.23 → 1.23 * 100 = 123.0 → 123
45.678 → 45.678 * 100 = 4567.8 → overflow risk or precision loss

Step 2 only makes sense if the scaled value actually fits in the target integer type.
That's the overflow check.

The Overflow Check

The function takes a pointer to the raw float bytes and the target integer width in bytes.
It returns non-zero if the value would overflow.

Called before every conversion — if it fires, skip the integer path and fall back to
float encoding.

The key insight: you can determine whether a float overflows a given integer type
purely from the float's exponent bits, without doing any arithmetic.

Here's why.

IEEE 754 in One Paragraph

A double-precision float is stored as 64 bits:

[ sign: 1 bit ][ exponent: 11 bits ][ fraction: 52 bits ]

The value is: 1.fraction × 2^(exponent − 1023)

The 1023 is the bias — it allows the 11-bit exponent field to represent negative
exponents. The real exponent is stored_exponent − 1023.

For 32-bit floats: 8 exponent bits, bias 127, fraction 23 bits.

Extracting the Exponent

For a double:

uint64_t bits;
memcpy(&bits, src, 8);                        /* safe type-pun, no UB */
int16_t real_exp = (int16_t)((bits >> 52) & 0x07ff) - 1023;

Step by step:

memcpy into a uint64_t — reinterpret the 8 bytes as a 64-bit integer (no arithmetic, just bits)
>> 52 — shift right past the 52 fraction bits, bringing the exponent to the low end
& 0x07ff — mask off the sign bit, keep only the 11 exponent bits
- 1023 — subtract the bias to get the real exponent

For a float:

uint32_t bits;
memcpy(&bits, src, 4);
int16_t real_exp = (int16_t)((bits >> 23) & 0xff) - 127;

Same logic: shift past 23 fraction bits, mask 8 exponent bits, subtract bias 127.

The Overflow Condition

Once you have the real exponent, the overflow check is one comparison:

is_overflow = real_exp > int_typewidth * 8 - 2;

Where does - 2 come from?

−1 for the sign bit: a signed integer of N bits can hold values up to 2^(N-1) - 1
−1 for the implicit leading 1: in IEEE 754, the fraction is 1.fraction, not 0.fraction

So a float with real exponent E represents a value with E + 1 significant bits
(the implicit 1 plus E fraction bits). For it to fit in a signed N-bit integer, you
need E + 1 ≤ N - 1, which simplifies to E ≤ N - 2.

Full implementation in C:

#include <stdint.h>
#include <string.h>

/* Returns 1 if the double at src overflows a signed integer of int_bytes bytes. */
static inline int double_overflow_check(const char *src, int int_bytes)
{
    uint64_t bits;
    memcpy(&bits, src, 8);
    int16_t real_exp = (int16_t)((bits >> 52) & 0x07ff) - 1023;
    return real_exp > int_bytes * 8 - 2;
}

/* Returns 1 if the float at src overflows a signed integer of int_bytes bytes. */
static inline int float_overflow_check(const char *src, int int_bytes)
{
    uint32_t bits;
    memcpy(&bits, src, 4);
    int16_t real_exp = (int16_t)((bits >> 23) & 0xff) - 127;
    return real_exp > int_bytes * 8 - 2;
}

Total cost: one memcpy, one shift, one AND, one subtract, one compare.
No floating-point arithmetic, no branches on the value itself.

How It Fits into the Encoder

The encoder scales the value first, then calls the overflow check on the scaled result:

double scaled = orig * scaler;               /* scale: e.g. orig * 100.0 */
if (double_overflow_check((char *)&scaled, sizeof(int64_t)))
    return ENCODE_OVERFLOW;                  /* fall back to float encoding */

int64_t result = llround(scaled);            /* safe: overflow already ruled out */

The scale factor is stored in the column header so the decoder can reverse the
operation: decoded = (double)stored_integer / pow(10, scale).

Why Not Just Use `std::isnan` + Range Check?

The conventional approach:

if (std::isnan(value)) return false;
if (value > INT64_MAX || value < INT64_MIN) return false;
return true;

This involves floating-point comparisons, which on many architectures require the
value to be loaded into a float register before comparison. On a hot encoding path
processing millions of values, the difference adds up.

The bit manipulation approach operates entirely on integer registers. The float's
bytes are reinterpreted as an integer — no floating-point unit involved until the
final std::lround conversion, which only happens when you've already confirmed
no overflow.

What This Enables

This check is the entry gate for the full encoding chain:

float column
    ↓
check_float_overflow   ← this article
    ↓ (passes)
float → integer cast
    ↓
Delta+ZigZag encoding
    ↓
Simple8b bit-packing

Without a cheap overflow gate, the chain can't run on untrusted float data. With it,
each value costs one check before entering the integer compression path — which can
achieve far better compression ratios than float-specific schemes on "integer-like"
time-series data.

What's Next

This article is part of a series on compression engineering in time-series databases:

Part 1: Runtime adaptive compression — how the system selects the best algorithm without scanning all data (published)
Part 3: Chained encoding — the full float-to-integer → Delta+ZigZag → Simple8b pipeline
Part 4: An improved floating-point compression algorithm based on ELF

I'm currently available for freelance work on backend systems, storage engineering,
and systems integration. Feel free to reach out.

How I Built a Runtime Adaptive Compression System That Selects the Best Algorithm Without Scanning All Data

Yu Qian Yang — Tue, 14 Apr 2026 13:31:07 +0000

When I was working on a columnar time-series database built on Greenplum, we faced a
problem that every time-series storage engineer eventually hits:

No single compression algorithm works best for all data.

A sensor reading temperature every second looks nothing like a financial tick stream.
A column of integer IDs compresses completely differently from a column of floating-point
measurements. And in the real world, the same column can change character over time —
steady signal for hours, then suddenly a burst of noise.

The naive answer is: let the user pick an algorithm. But that puts the burden on the user,
and in practice they almost always pick wrong — or just leave the default.

So I designed automode: a runtime analyzer that samples incoming data, detects its
statistical character, and selects the best encoding chain automatically.

Here's how it works.

The Encoding Chain

Before talking about the analyzer, I need to briefly explain what it's selecting between.

We built a set of encoding algorithms, each optimized for a specific data pattern:

Run-Length Encoding — all values are identical
Delta-of-Delta — slowly changing values (timestamps, smooth sensors)
Delta + ZigZag — values oscillating around a baseline
Bit Packing — small integers that can be bit-packed
Gorilla — floating-point values with small XOR between adjacent values

The insight that made automode possible: these algorithms operate on fundamentally
different statistical features of data. If I can identify the dominant feature of a
data stream, I can map it directly to the right algorithm.

The Design Problem

The obvious approach — scan all the data and compute statistics — doesn't work in a
storage engine. The whole point of compression is to be fast. Scanning every value
before encoding defeats the purpose.

I needed a way to characterize a data stream with minimal overhead.

The solution: sample windows.

Instead of scanning the full column, I divide the input stream into fixed-size windows
and analyze only a small number of items from each window. Within each window, the
sample start position is randomized to avoid being fooled by local patterns.

Why randomize? Consider this sequence:

[1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3]

If you always sample from the start, you'd classify this as "all equal" and pick
Run-Length Encoding. But if you look at the full stream, it's clearly an alternating
pattern — RLE would be terrible. Random sampling within the window gives a much more
representative picture.

The Five Features

Each sampled window is analyzed for five statistical features:

All Equal — all sampled values are identical
Values bitwise small — the values themselves fit in few bits
1st-order delta small — differences between adjacent values are small
2nd-order delta small — differences between differences are small (delta-of-delta)
Float XOR small — XOR of adjacent float values is small (Gorilla pattern)

For each window, the analyzer computes 0th, 1st, and 2nd order differentials, plus
XOR pairs for float data. "Small" is defined bitwise: a value is small if its
significant bit length is below a configurable threshold. This makes the check
robust against occasional spike values — a single outlier doesn't ruin the
classification.

Strong vs. Weak Features

Not all features are equal. I introduced a Strong/Weak classification:

Strongest: All Equal → Run-Length Encoding
Strong: 2nd-order delta small → Delta-of-Delta
Strong: Float XOR small → Gorilla
Weak: Values bitwise small → Bit Packing
Weak: 1st-order delta small → Delta + ZigZag

Why does this matter? Consider a case where Delta-of-Delta and Bit Packing score
similarly. Delta-of-Delta is fundamentally better for smoothly varying time-series
data, even with a slightly lower score. The weight system ensures strong methods
are preferred when they're competitive.

One Edge Case Worth Noting

The alternating pattern [A B A B A B] is a trap.

It looks like "values bitwise small", but Bit Packing performs poorly on alternating
sequences. A general-purpose compressor actually handles this pattern better.

I added an explicit check: if odd-indexed and even-indexed items within the sample
are each internally equal, it's an alternating pattern — skip the Bit Packing
classification and fall back to general compression.

This kind of edge case only appears when you test against real production data. In our
case, it showed up in sensor data from a deployment at a large financial institution.

Feature → Algorithm Mapping

Once features are aggregated across all sample windows, the dominant feature maps to
an algorithm:

All Equal → Run-Length Encoding
2nd-order delta small → Delta-of-Delta
Values bitwise small → Bit Packing
1st-order delta small → Delta + ZigZag
Float XOR small → Gorilla
No clear pattern → General Compression (fallback)

For floating-point columns, the encoding chain adds a float-to-integer conversion pass
first — converting floats that happen to be integers (like 1.0, 2.0) into actual
integers before passing to the integer encoding path. This is worth a separate article.

Two Modes

The system supports two optimization priorities:

Compression-rate mode: maximize storage efficiency
Speed mode: maximize decompression speed

In practice, compression-rate mode is used for cold storage and analytics workloads,
and speed mode for hot data that's frequently queried. The same feature detection runs
in both cases — only the final algorithm selection weights differ.

Results

The system is deployed in production at scale, including a large-scale installation
at a major financial institution managing thousands of PBs of time-series data. In
testing, automode consistently matched or outperformed manually-tuned single-algorithm
compression across diverse real-world datasets.

The key insight that made it work: you don't need to scan all the data to understand
its character. A few dozen samples, taken randomly from fixed-size windows, are
enough to make a reliable classification — as long as you handle the edge cases.

What's Next

This article is part of a series on compression engineering in time-series databases:

Part 2: Float-to-integer encoding — how I used IEEE 754 bit manipulation to detect convertibility with near-zero overhead
Part 3: Chained encoding — a unified float compression pipeline
Part 4: An improved floating-point compression algorithm based on ELF

I'm currently available for freelance work on backend systems, storage engineering,
and systems integration. Feel free to reach out.

Forem: Yu Qian Yang

How We Built a Disaggregated Storage Layer for a Columnar MPP Database (And What Broke in Production)

How I Built a Disaggregated Storage Layer for a Columnar MPP Database

The Problem

Architecture

Key Engineering: The Prefetch Engine

Key Engineering: The Cache Layer

Multi-Cloud Backend Abstraction

What Broke in Production

Bug 1: Silent Data Corruption on Large File Writes

Bug 2: Cache Deadlock from Tmpfile Leak

Lessons

Results

How I Used Bit Manipulation to Speed Up Float-to-Int Conversion in a Storage Engine

Why Integers Compress Better Than Floats

The Setup: Scaling Floats to Integers

The Overflow Check

IEEE 754 in One Paragraph

Extracting the Exponent

The Overflow Condition

How It Fits into the Encoder

Why Not Just Use std::isnan + Range Check?

What This Enables

What's Next

How I Built a Runtime Adaptive Compression System That Selects the Best Algorithm Without Scanning All Data

The Encoding Chain

The Design Problem

The Five Features

Strong vs. Weak Features

One Edge Case Worth Noting

Feature → Algorithm Mapping

Two Modes

Results

What's Next

Why Not Just Use `std::isnan` + Range Check?