Inside Airbnb’s Mussel v2: Rebuilding a Petabyte-Scale Key-Value Store

iamradioactive — Mon, 06 Apr 2026 14:56:34 +0000

Airbnb recently published a detailed write-up on Mussel v2, the next generation of their internal key-value store for serving derived data. The system sits between offline data processing and online applications, turning large batch outputs and streaming updates into low-latency lookups used by systems like fraud detection, recommendations, and pricing.

The new architecture is a substantial redesign. It introduces a stateless dispatch layer, range-based sharding, stronger consistency controls, and a new storage backend designed to handle both large bulk loads and high-volume streaming updates in the same cluster. Migrating to it required moving more than a petabyte of data across thousands of tables, all while keeping production traffic online.

This article walks through both the original Mussel design and the new architecture, focusing on the engineering decisions behind the redesign.

What Mussel is - And Why it's a Hard Problem

Airbnb's fraud detection, recommendation engine, pricing models, and personalization systems all share a common need: take data computed offline - in warehouses and batch pipelines - and make it readable online at millisecond latency.

This is what engineers call the derived data problem. You can't query a data warehouse in real time. You can't fit 100TB into Redis. You need a system in the middle: one that can absorb large batch loads from offline compute, accept real-time streaming updates, and serve all of it under low-latency read paths.

Most teams end up with three separate systems for this: an object store for batch ingestion, a cache for reads, and a message queue for streaming writes. Each is optimized for its job, but keeping them in sync adds complexity and operational overhead.

By the time v1 was showing its limits, Mussel wasn't a small system. It was already serving over 4,000 client tables, managing roughly 130TB of data, handling 800k+ read QPS and 35k write QPS, with p95 read latency under 8ms. This was a mature, production-critical system - not a prototype being replaced by something grown-up.

What Was Breaking in v1

V1's problems weren't dramatic. They were the kind that build up slowly until they become a ceiling.

Operations were too manual. v1's storage layer was built on HRegion - the region server component from Apache HBase, powered by LSM trees, running locally on each Mussel storage node. HBase is capable, but it carries significant operational weight. Scaling v1 meant running multi-step Chef scripts on EC2 instances and carefully coordinating node additions in a system that wasn't designed for elastic scaling. None of this was automated. Every scaling event was a manual process that required institutional knowledge and took hours.

Hash partitioning created hotspots. v1 split data across 1024 logical shards by hashing primary keys. Apache Helix managed the mapping of those shards to physical nodes, and a leaderless replication model meant any replica of a shard could serve reads - which was good for read throughput. But hash partitioning gives you no control over where data actually lands. When access patterns aren't uniform - and they never are in production - certain nodes end up overloaded. Latency spikes follow. And fixing it requires rebalancing, which is more manual work.

Resource usage was invisible. v1 had no per-namespace quotas, no cost dashboards, no way for teams to see what their tables were consuming. At scale that becomes a reliability problem, not just a billing one. One team's runaway bulk load job can silently degrade the read latency of another team's fraud detection system.

These three problems - manual operations, uncontrollable hotspots, and opaque resource usage - aren't unique to Airbnb.

What v2 Changed and Why

Here's how the full v2 system is laid out:

┌──────────────────────────────────────────────────────┐
│                       Clients                        │
└──────────────────────────┬───────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────┐
│                Dispatcher (stateless)                │
│       routing · rate limiting · throttling           │
│   dual-write · shadow-read · per-table migration     │
└────────┬─────────────────┬──────────────────┬────────┘
         │                 │                  │
         ▼                 ▼                  ▼
  ┌─────────────┐   ┌─────────────┐   ┌───────────────────────┐
  │    Reads    │   │  Write path │   │      Bulk Load        │
  └──────┬──────┘   └──────┬──────┘   │  Airflow → S3         │
         │                 │          │  → StatefulSet Workers│
         │                 ▼          └──────────┬────────────┘
         │          ┌─────────────┐              │
         │          │    Kafka    │              │
         │          └──────┬──────┘              │
         │                 │                     │
         │                 ▼                     │
         │      ┌─────────────────────┐          │
         │      │  Replayer + Write   │          │
         │      │  Dispatcher         │          │
         │      └──────────┬──────────┘          │
         │                 │                     │
         └────────┬─────────┘                    │
                  │          ┌───────────────────┘
                  ▼          ▼
       ┌──────────────────────────────────────────┐
       │              NewSQL Backend              │
       │     range sharding · replicas · TTL      │
       └──────────────────────────────────────────┘

A Stateless Dispatcher as the Central Gateway

The most consequential architectural decision in v2 is the Dispatcher - a stateless, horizontally scalable Kubernetes service that sits between all clients and the storage backend. Every read and write goes through it.

Making the gateway stateless means you can scale it independently of the storage layer. Features like dual-write, shadow reads, rate limiting, and throttling all live in one place rather than being scattered across clients. And crucially, the team can control exactly where reads are served from - per table, dynamically - without changing a single line of code in any application that uses Mussel. That last capability turned out to be central to how the migration worked.

Kafka's Role - Old and New

One thing the v2 article doesn't make explicit: Kafka was already in v1. It wasn't a new addition. In v1, each storage node consumed from the Kafka partitions corresponding to its shards. This gave the system consistent write ordering across leaderless replicas without needing a leader node - nodes handling the same shard consume from the same partitions in the same order, so they all converge to the same state.

What changed in v2 is what Kafka is used for. In v1, it was primarily a replication mechanism between storage nodes. In v2, it's a durability and decoupling layer between the Dispatcher and the NewSQL backend. Writes land in Kafka first for durability, then a Replayer and Write Dispatcher apply them to the backend in order. This decoupling means the system can absorb write bursts without putting direct pressure on storage, and keeps writes safe even if the backend temporarily slows down. The tradeoff is that routing writes through Kafka adds some latency before they're visible.

Range-Based Sharding

Moving from hash-based to range-based partitioning is the foundational storage change in v2. With range sharding, each shard owns a known range of keys, which means tables can be pre-split before loading data into them. This distributes write pressure evenly across nodes and prevents hotspots during bulk ingestion - something hash partitioning gave you no control over. The tradeoff is that poorly chosen shard boundaries can still create hotspots. Airbnb's solution is to sample the source data before creating a table in v2, understand the key distribution, and pre-split accordingly. More work upfront, but it results in predictable, controllable data distribution.

TTL Without Relying on Compaction

Data expiration in v1 was handled by HRegion's compaction cycle. When HRegion's in-memory store fills up, it flushes to disk. Compaction periodically merges those files and cleans up expired records in the process. It works at moderate scale, but at petabyte scale the compaction process can't keep up - data accumulates faster than it gets cleaned.

v2 replaces this with a dedicated expiration service. It splits data namespaces into range-based subtasks and processes them concurrently across multiple workers, scanning and deleting expired records in parallel. Jobs are scheduled to avoid peak read traffic, and write-heavy tables get max-version enforcement with targeted deletes to prevent unbounded growth. The result is the same retention behavior as v1, but implemented as a purpose-built service rather than a side effect of storage compaction.

The Migration: Moving a Petabyte Without Downtime

Migrating data across storage engines while keeping production traffic running is where the interesting engineering happens. Airbnb's approach was methodical.

The Bootstrapping Challenge

The first obstacle: v1 didn't have table-level snapshots or Change Data Capture (CDC) streams. CDC is typically how you seed a new database during a migration - stream every change from the old system into the new one until they're in sync. Without it, Airbnb had to build a custom bootstrap pipeline for each table.

The process for each table: download v1 backup data, sample it to understand key distribution, create a pre-split v2 table based on that distribution, load the backup data into v2, verify via checksum, then replay any Kafka messages that accumulated during the load to catch up on writes that happened while bootstrapping was running.

For large tables, the bootstrap phase alone took hours or days. To handle interruptions gracefully, the team used Kubernetes StatefulSets - which provide persistent local state across pod restarts - to checkpoint progress periodically. If a job died partway through, it could resume from the last checkpoint rather than starting over.

The tradeoff is operational complexity - StatefulSets are harder to manage than stateless jobs. Each worker maintains local state that must be monitored, and failed pods need careful handling to avoid replaying already-completed work.

Four Stages, All Reversible

The migration moved through four stages:

Blue - all traffic on v1. The stable baseline while bootstrapping happens in the background.
Shadow - v1 continues serving traffic, but v2 processes reads and writes in parallel, silently. Responses from v2 are checked for correctness but never returned to callers. This lets the team verify v2's behavior without any risk to production.
Reverse - v2 takes over live traffic, with v1 kept on standby. Automatic circuit breakers monitor v2's error rate and replication lag; if either crosses a threshold, traffic snaps back to v1 instantly.
Cutover - v1 is decommissioned for that table.

Every stage could be reversed. And migration happened one table at a time, so each table could be at a different stage simultaneously. Lower-risk tables went first, building operational confidence before the team tackled high-criticality ones.


                         ┌──────────────┐
                         │   Clients    │
                         └──────┬───────┘
                                │
                                ▼
                     ┌────────────────────┐
                     │     Dispatcher     │
                     │--------------------│
                     │ • read routing     │
                     │ • dual-write ctrl  │
                     │ • stage config     │
                     └─────────┬──────────┘
                               │ writes
                               ▼
                        ┌───────────────┐
                        │     Kafka     │
                        │───────────────│
                        │ replayable    │
                        │ write log     │
                        └───────┬───────┘
                                │
                    ┌───────────┴───────────┐
                    │                       │
                    ▼                       ▼
            ┌───────────────┐       ┌───────────────┐
            │   Mussel v1   │       │   Mussel v2   │
            │  (HRegion)    │       │  (NewSQL)     │
            └───────────────┘       └───────────────┘
                    ▲                       ▲
                    └───────────┬───────────┘
                                │
                         read routing
                       (Dispatcher above
                        controls which
                        system responds)

* Each table is bootstrapped into v2 before entering Shadow stage.
┌──────────────────────────────────────────────────────────────────┐
│ Stage    │ Write Path              │ Read Path                      
├──────────┼─────────────────────────┼─────────────────────────────┤
│ BLUE     │ Kafka → v1              │ v1 serves                      
│          │                         │ v2 bootstrapping in backgroud 
├──────────┼─────────────────────────┼─────────────────────────────┤
│ SHADOW   │ Kafka → v1 + v2         │ v1 serves                      
│          │ (dual-write)            │ v2 responses checked,          
│          │                         │ not returned to callers        
├──────────┼─────────────────────────┼─────────────────────────────┤
│ REVERSE  │ Kafka → v1 + v2         │ v2 serves                      
│          │                         │ v1 on standby,                 
│          │                         │ circuit breaker active         
├──────────┼─────────────────────────┼─────────────────────────────┤
│ CUTOVER  │ Kafka → v2 only         │ v2 serves                      
│          │                         │ v1 decommissioned              
└──────────────────────────────────────────────────────────────────┘

Read Traffic Migration - The Underappreciated Half

Most migration stories focus on moving the data. Moving read traffic is a separate problem, and Airbnb's approach here is worth understanding on its own.

Once dual-write was established and data consistency confirmed, the Dispatcher handled read traffic migration in three phases. First, reads were served from v1 while shadow requests were simultaneously sent to v2 - responses compared but v2 never returned results to callers. Second, the roles flipped: v2 served reads while reverse shadow requests went to v1, providing a safety net to catch any instability. Third, full cutover with v1 decommissioned per table.

What makes read traffic migration more manageable than write migration is that reads are ephemeral. If a read returns a wrong result during shadowing, there's no lasting consequence - you catch it, investigate, and fix it. Writes are durable; if something goes wrong in a write path, you need a recovery story. Airbnb's pipeline handled these separately for this reason.

Running shadow reads effectively doubles the read load on both systems during the shadow and reverse phases. At an scale - where Mussel v1 already handles over 800k read QPS - this means the infrastructure must temporarily sustain roughly twice the total read traffic across both systems. That additional load is a non-trivial operational cost that must be accounted for during migration planning.

The Dispatcher made this fine-grained control possible. Because all read routing decisions lived in one stateless service, the team could change where any table's reads were served from - in real time, per table - without touching the applications consuming Mussel. Without a centralized routing layer, this would have required coordinating changes across every service using Mussel, which is a much harder problem across hundreds of tables.

Lessons Worth Taking Away

Consistency flexibility has a cost. v1 offered only eventual consistency. v2 introduced the option for stronger consistency guarantees per namespace. Airbnb had to build write deduplication, hotkey blocking, and lazy write repair to handle the edge cases that surfaced. If you're planning a similar migration, the storage layer change is usually the easier part. The surprises tend to show up in application behavior.

Presplitting isn't optional with range sharding. Migrating to a range-based backend without understanding your key distribution first will produce hotspots that look identical to the hash-partitioning hotspots you were trying to escape. Sampling the source data and pre-splitting tables accordingly isn't overhead - it's what makes the migration safe.

Architecture decisions today shape migration options tomorrow. Because Kafka was already the write path in v1, Airbnb could leverage the same stream for dual-write and catch-up replay during migration. This wasn't planned specifically for the migration - it was a property of v1's design that turned out to be the right foundation for v2. The decisions you make today directly affect how much flexibility you'll have when you need to change the system later.

What Mussel v2 Actually Delivers

At the end of their article, Airbnb makes a specific claim: Mussel v2 can simultaneously ingest tens of terabytes in bulk, sustain 100k+ streaming writes per second, and keep p99 reads under 25ms - all in the same cluster, with a single API for callers.

What makes this possible is that all three workloads share a common access contract: writes are eventually applied, reads are point or range lookups, and callers can tolerate a configurable staleness window. That shared contract is what lets one system serve all three patterns without them interfering with each other. Engineers no longer need to stitch together separate systems for each ingestion pattern and spend energy keeping them in sync - Mussel provides those guarantees out of the box.

What strengthens the claim most is that Mussel v2 has been running in production for a year at this scale. Engineering blog posts are full of capability claims. A year of production evidence at Airbnb's traffic levels is harder to dismiss.

What the Story is Really About

Pull back from the specific technology choices - the NewSQL backend, the Kubernetes manifests, the topology-aware TTL service - and the most transferable part of this story comes into focus.

Airbnb's ability to migrate a petabyte without downtime was shaped not just by what they built in v2, but by how v1 was designed years earlier. Kafka as the write path. Leaderless replication with topic-based ordering. Those decisions were made to solve different problems at the time. But the properties they created - a durable, replayable log at the center of every write, consistent ordering without a leader - turned out to be exactly what the v2 migration needed. The dual-write pipeline, the catch-up replay, the consistency guarantees during transition: all of it was more tractable because of how v1 worked.

Good infrastructure tends to have this quality. It solves the problem in front of you, but it also leaves options open for when the problem changes. The v1 engineers weren't building migration infrastructure. But the choices they made left those options available.

A useful question to ask about any storage system you own: not "can we migrate this?" but "if we needed to migrate this, what properties do we already have that would help?" Most teams only ask that question once a migration is already underway. Asking it earlier - and making deliberate architectural choices based on the answer - is what separates systems that are easy to evolve from systems that aren't.

References: Building a Next-Generation Key-Value Store at Airbnb · Mussel V1 (2022)

Forem: iamradioactive