Optimizing Persistent Storage with State Deltas

Galadd — Thu, 22 Jan 2026 11:20:28 +0000

Optimizing persistent storage through deltas might sound like a general solution—but this article is intentionally specific. This approach worked because I had full control over the data model and could make assumptions about how the data evolves over time.

That won’t always be the case. You might not be able to reshape your data this way, or your workload might look completely different. Still, optimization often comes down to understanding your data deeply and being willing to look at it from unconventional angles. Even if you don’t apply this exact approach, I hope it offers a useful perspective.

The idea itself is simple—almost offensively so. There’s nothing particularly novel or academically complex here. But sometimes, stepping away from heavy abstractions and focusing on the shape of your data can yield outsized wins.

So I’ll get straight to the point.

The Problem

Every 3.5 hours, a full system state was written to disk. Even after compression, each snapshot was roughly 140 MB.

This works, but it doesn’t scale. A new node syncing the network would need to download and store every historical snapshot. Over time, this balloons into terabytes of required disk space.

To put concrete numbers on it:

Before: ~10 TB of required storage
After: ~2 TB

That’s roughly an 80% reduction in disk usage.

Why Not Just Snapshot Less Often?

Although a new state exists every 12 seconds, not all states are equal. Each state contains specific information that allows reconstruction of prior states and verification of the chain’s history.

As snapshots get farther apart, reconstructing intermediate states becomes increasingly expensive. The farther a target state is from the nearest snapshot, the more work is required to replay or reconstruct it. This directly impacts sync time and verification latency.

Snapshots therefore can’t be spaced arbitrarily far apart without shifting cost elsewhere. There’s a practical lower bound on snapshot frequency if you want reasonable reconstruction times.

The Core Idea: State Deltas

The key observation was simple: consecutive full states written only 3.5 hours apart are extremely similar.

Instead of storing the entire state every time, I switched to storing:

One base state
Followed by deltas, which encode only the differences between states

When a full state is needed, the base state is loaded and the deltas are applied sequentially.

This slightly changes complexity—from pure O(x) reads to O(x + delta_application)—but in practice, the tradeoff is well worth it. Applying a delta took roughly ~200 ms, which was acceptable for this workload.

The result is dramatically less data written to disk and significantly less data read during sync.

Avoiding Disk Reads: Caching the Base State

Initially, delta computation was slower than expected. The reason was obvious in hindsight: computing a delta requires access to both the base state and the new state.

My first implementation read the base state from disk every time. That was a mistake.

If something is accessed repeatedly and fits comfortably in memory, it should be cached.

Each base state is reused for roughly 7 subsequent deltas, making caching an easy win. I introduced a simple in-memory cache using an Arc<RwLock<Option<_>>>:

base_state_cache: Arc::new(RwLock::new(None)),

The logic is straightforward:

If the base state is cached, reuse it
Otherwise, load it once from disk and cache it
Subsequent accesses should not hit disk again

fn load_base_state(&self, root: H256) -> Result<Option<Arc<BeaconState<P>>>> {
    if let Some(cached) = self.load_cached_base_state(root) {
        return Ok(Some(cached));
    }

    let Some(state) = self.state_by_block_root(root)? else {
        return Ok(None);
    };

    self.set_cached_base_state(root, &state);
    Ok(Some(state))
}

With this change, both read and write performance improved significantly.

Overall, the new approach reduced time spent on state storage by roughly 70%, while also drastically reducing disk usage.

A Closer Look: Optimizing Balances

One field deserved special treatment: balances.

Balances are stored as a Merkle tree containing over 2,000,000 u64 values, and roughly 1,000,000 of them change between states. At first glance, this suggests that storing half the values is unavoidable—but that assumption turns out to be wrong.

The key lesson here is simple: know your data.

After analyzing balance changes, four patterns emerged:

Many values don’t change
Many values change to zero
Many values change, but only by a small amount
Very few values change by a large amount

Once this distribution is clear, storing every changed value as a full u64 becomes wasteful.

Encoding Changes with 2 Bits

Each balance change can be classified into one of four categories, meaning it can be represented using 2 bits.

I used a compact tag vector where each entry encodes the change type:

00 → no change
10 → set to zero
11 → apply a small diff
01 → set to a full target value

Four balance entries fit into a single byte:

pub struct BitTagVec {
    data: Vec<u8>, // 4 entries per byte
    len: usize,
}

impl BitTagVec {
    pub fn new(len: usize) -> Self {
        let bytes = len.div_ceil(4);
        Self {
            data: vec![0; bytes],
            len,
        }
    }

    #[inline]
    pub fn set(&mut self, idx: usize, tag: u8) {
        let byte = idx / 4;
        let shift = (idx % 4) * 2;

        self.data[byte] |= (tag & 0b11) << shift;
    }

    #[inline]
    pub fn get(&self, idx: usize) -> u8 {
        let byte = idx / 4;
        let shift = (idx % 4) * 2;
        (self.data[byte] >> shift) & 0b11
    }
}

For ~2,000,000 balances, this tag vector alone takes roughly 500 KB.

The associated data is stored separately:

pub struct BalanceDiffs {
    tags: BitTagVec,
    small_diffs: Vec<i32>,
    target_values: Vec<Gwei>,
    new_balances: Vec<Gwei>,
}

small_diffs store small balance changes
target_values store large or special-case updates
new_balances represent newly added balances Applying the delta is a linear pass over the tag vector, applying the corresponding operation. The logic is predictable, cache-friendly, and fast.

Conclusion

This work started as a practical optimization and became a reminder of something fundamental: effective storage optimization often has less to do with clever algorithms and more to do with understanding your data.

By switching from full-state storage to delta-based storage—and by tailoring the delta format to how the data actually changes—I was able to significantly reduce both disk usage and runtime costs.

If you have ideas on how this approach could be improved, or alternative ways you would model these deltas, I’d love to hear your thoughts. The full PR is available here:

https://github.com/grandinetech/grandine/pull/523/files

Writing this was also an attempt to write more and engage more openly. Constructive critique—technical or otherwise—is very welcome.

Forem: Galadd