Forem: Luis Cossío

Introducing Gridstore: Qdrant's Custom Key-Value Store

Luis Cossío — Thu, 06 Feb 2025 12:07:39 +0000

Why We Built Our Own Storage Engine

Databases need a place to store and retrieve data. That’s what Qdrant's key-value storage does—it links keys to values.

When we started building Qdrant, we needed to pick something ready for the task. So we chose RocksDB as our embedded key-value store.

Over time, we ran into issues. Its architecture required compaction (uses LSMT), which caused random latency spikes. It handles generic keys, while we only use it for sequential IDs. Having lots of configuration options makes it versatile, but accurately tuning it was a headache. Finally, interoperating with C++ slowed us down (although we will still support it for quite some time 😭).
While there are already some good options written in Rust that we could leverage, we needed something custom. Nothing out there fit our needs in the way we wanted. We didn’t require generic keys. We wanted full control over when and which data was written and flushed. Our system already has crash recovery mechanisms built-in. Online compaction isn’t a priority, we already have optimizers for that. Debugging misconfigurations was not a great use of our time.
So we built our own storage. As of Qdrant Version 1.13, we are using Gridstore for payload and sparse vector storages.

In this article, you’ll learn about:

How Gridstore works – a deep dive into its architecture and mechanics.
Why we built it this way – the key design decisions that shaped it.
Rigorous testing – how we ensured the new storage is production-ready.
Performance benchmarks – official metrics that demonstrate its efficiency. Our first challenge? Figuring out the best way to handle sequential keys and variable-sized data.

Gridstore Architecture: Three Main Components

Component	Description
Data Layer	Stores values in fixed-sized blocks and retrieves them using a pointer-based lookup system for efficient access.
Mask Layer	Maintains a bitmask to track block usage, distinguishing between allocated and available blocks.
Gaps Layer	Manages block availability at a higher level, optimizing space allocation and reuse.

1. The Data Layer for Fast Retrieval

At the core of Gridstore is The Data Layer, which is designed to store and retrieve values quickly based on their keys. This layer allows us to do efficient reads and lets us store variable-sized data. The main two components of this layer are The Tracker and The Data Grid.
Since internal IDs are always sequential integers (0, 1, 2, 3, 4, ...), the tracker is an array of pointers, where each pointer tells the system exactly where a value starts and how long it is.

The Data Layer uses an array of pointers to quickly retrieve data.

This makes lookups incredibly fast. For example, finding key 3 is just a matter of jumping to the third position in the tracker, and following the pointer to find the value in the data grid.

However, because values are of variable size, the data itself is stored separately in a grid of fixed-sized blocks, which are grouped into larger page files. The fixed size of each block is usually 128 bytes. When inserting a value, Gridstore allocates one or more consecutive blocks to store it, ensuring that each block only holds data from a single value.

2. The Mask Layer Reuses Space

The Mask Layer helps Gridstore handle updates and deletions without the need for expensive data compaction. Instead of maintaining complex metadata for each block, Gridstore tracks usage with a bitmask, where each bit represents a block, with 1 for used, 0 for free.

The bitmask efficiently tracks block usage.

This makes it easy to determine where new values can be written. When a value is removed, it gets soft-deleted at its pointer, and the corresponding blocks in the bitmask are marked as available. Similarly, when updating a value, the new version is written elsewhere, and the old blocks are freed at the bitmask.
This approach ensures that Gridstore doesn’t waste space. As the storage grows, however, scanning for available blocks in the entire bitmask can become computationally expensive.

3. The Gaps Layer for Effective Updates

To further optimize update handling, Gridstore introduces The Gaps Layer, which provides a higher-level view of block availability.
Instead of scanning the entire bitmask, Gridstore splits the bitmask into regions and keeps track of the largest contiguous free space within each region, known as The Region Gap. By also storing the leading and trailing gaps of each region, the system can efficiently combine multiple regions when needed for storing large values.

The complete architecture of Gridstore

This layered approach allows Gridstore to locate available space quickly, scaling down the work required for scans while keeping memory overhead minimal. With this system, finding storage space for new values requires scanning only a tiny fraction of the total metadata, making updates and insertions highly efficient, even in large segments.
Given the default configuration, the gaps layer is scoped out in a millionth fraction of the actual storage size. This means that for each 1GB of data, the gaps layer only requires scanning 6KB of metadata. With this mechanism, the other operations can be executed in virtually constant-time complexity.

Gridstore in Production: Maintaining Data Integrity

Gridstore’s architecture introduces multiple interdependent structures that must remain in sync to ensure data integrity:

The Data Layer holds the data and associates each key with its location in storage, including page ID, block offset, and the size of its value.
The Mask Layer keeps track of which blocks are occupied and which are free.
The Gaps Layer provides an indexed view of free blocks for efficient space allocation. Every time a new value is inserted or an existing value is updated, all these components need to be modified in a coordinated way.

When Things Break in Real Life

Real-world systems don’t operate in a vacuum. Failures happen: software bugs cause unexpected crashes, memory exhaustion forces processes to terminate, disks fail to persist data reliably, and power losses can interrupt operations at any moment.
The critical question is: what happens if a failure occurs while updating these structures?
If one component is updated but another isn’t, the entire system could become inconsistent. Worse, if an operation is only partially written to disk, it could lead to orphaned data, unusable space, or even data corruption.

Stability Through Idempotency: Recovering With WAL

To guard against these risks, Qdrant relies on a Write-Ahead Log (WAL). Before committing an operation, Qdrant ensures that it is at least recorded in the WAL. If a crash happens before all updates are flushed, the system can safely replay operations from the log.

This recovery mechanism introduces another essential property: idempotence.
The storage system must be designed so that reapplying the same operation after a failure leads to the same final state as if the operation had been applied just once.

The Grand Solution: Lazy Updates

To achieve this, Gridstore completes updates lazily, prioritizing the most critical part of the write: the data itself.

👉 Instead of immediately updating all metadata structures, it writes the new value first while keeping lightweight pending changes in a buffer.
👉 The system only finalizes these updates when explicitly requested, ensuring that a crash never results in marking data as deleted before the update has been safely persisted.
👉 In the worst-case scenario, Gridstore may need to write the same data twice, leading to a minor space overhead, but it will never corrupt the storage by overwriting valid data.

How We Tested the Final Product

First... Model Testing

Gridstore can be tested efficiently using model testing, which compares its behavior to a simple in-memory hash map. Since Gridstore should function like a persisted hash map, this method quickly detects inconsistencies.

The process is straightforward:

Initialize a Gridstore instance and an empty hash map.
Run random operations (put, delete, update) on both.
Verify that results match after each operation.
Compare all keys and values to ensure consistency. This approach provides high test coverage, exposing issues like incorrect persistence or faulty deletions. Running large-scale model tests ensures Gridstore remains reliable in real-world use.

Here is a naive way to generate operations in Rust.

enum Operation {
    Put(PointOffset, Payload),
    Delete(PointOffset),
    Update(PointOffset, Payload),
}
impl Operation {
    fn random(rng: &mut impl Rng, max_point_offset: u32) -> Self {
        let point_offset = rng.random_range(0..=max_point_offset);
        let operation = rng.gen_range(0..3);
        match operation {
            0 => {
                let size_factor = rng.random_range(1..10);
                let payload = random_payload(rng, size_factor);
                Operation::Put(point_offset, payload)
            }
            1 => Operation::Delete(point_offset),
            2 => {
                let size_factor = rng.random_range(1..10);
                let payload = random_payload(rng, size_factor);
                Operation::Update(point_offset, payload)
            }
            _ => unreachable!(),
        }
    }
}

Model testing is a high-value way to catch bugs, especially when your system mimics a well-defined component like a hash map. If your component behaves the same as another one, using model testing brings a lot of value for a bit of effort.

We could have tested against RocksDB, but simplicity matters more. A simple hash map lets us run massive test sequences quickly, exposing issues faster.

For even sharper debugging, Property-Based Testing adds automated test generation and shrinking. It pinpoints failures with minimalized test cases, making bug hunting faster and more effective.

Crash Testing: Can Gridstore Handle the Pressure?

Designing for crash resilience is one thing, and proving it works under stress is another. To push Qdrant’s data integrity to the limit, we built Crasher, a test bench that brutally kills and restarts Qdrant while it handles a heavy update workload.

Crasher runs a loop that continuously writes data, then randomly crashes Qdrant. On each restart, Qdrant replays its Write-Ahead Log (WAL), and we verify if data integrity holds. Possible anomalies include:

Missing data (points, vectors, or payloads)
Corrupt payload values This aggressive yet simple approach has uncovered real-world issues when run for extended periods. While we also use chaos testing for distributed setups, Crasher excels at fast, repeatable failure testing in a local environment.

Testing Gridstore Performance: Benchmarks

To measure the impact of our new storage engine, we used Bustle, a key-value storage benchmarking framework, to compare Gridstore against RocksDB. We tested three workloads:

Workload Type	Operation Distribution
Read-heavy	95% reads
Insert-heavy	80% inserts
Update-heavy	50% updates

The results speak for themselves:

Average latency for all kinds of workloads is lower across the board, particularly for inserts.

This shows a clear boost in performance. As we can see, the investment in Gridstore is paying off.

End-to-End Benchmarking

Now, let’s test the impact on a real Qdrant instance. So far, we’ve only integrated Gridstore for payloads and sparse vectors, but even this partial switch should show noticeable improvements.

For benchmarking, we used our in-house bfb tool to generate a workload. Our configuration:

bfb -n 2000000 --max-id 1000000 \
    --sparse-vectors 0.02 \
    --set-payload \
    --on-disk-payload \
    --dim 1 \
    --sparse-dim 5000 \
    --bool-payloads \
    --keywords 100 \
    --float-payloads true \
    --int-payloads 100000 \
    --text-payloads \
    --text-payload-length 512 \
    --skip-field-indices \
    --jsonl-updates ./rps.jsonl

This benchmark upserts 1 million points twice. Each point has:

A medium to large payload
A tiny dense vector (dense vectors use a different storage type)

- A sparse vector

Additional configuration:

The test we conducted updated payload data separately in another request.
There were no payload indices, which ensured we measured pure ingestion speed.

3. Finally, we gathered request latency metrics for analysis.

We ran this against Qdrant 1.12.6, toggling between the old and new storage backends.

Final Result

Data ingestion is twice as fast and with a smoother throughput — a massive win! 😍

We optimized for speed, and it paid off—but what about storage size?

Gridstore: 2333MB
RocksDB: 2319MB Strictly speaking, RocksDB is slightly smaller, but the difference is negligible compared to the 2x faster ingestion and more stable throughput. A small trade-off for a big performance gain!

Trying Out Gridstore

Gridstore represents a significant advancement in how Qdrant manages its key-value storage needs. It offers great performance and streamlined updates tailored specifically for our use case. We have managed to achieve faster, more reliable data ingestion while maintaining data integrity, even under heavy workloads and unexpected failures. It is already used as a storage backend for on-disk payloads and sparse vectors.

👉 It’s important to note that Gridstore remains tightly integrated with Qdrant and, as such, has not been released as a standalone crate.
Its API is still evolving, and we are focused on refining it within our ecosystem to ensure maximum stability and performance. That said, we recognize the value this innovation could bring to the wider Rust community. In the future, once the API stabilizes and we decouple it enough from Qdrant, we will consider publishing it as a contribution to the community ❤️.

For now, Gridstore continues to drive improvements in Qdrant, demonstrating the benefits of a custom-tailored storage engine designed with modern demands in mind. Stay tuned for further updates and potential community releases as we keep pushing the boundaries of performance and reliability.

Simple, efficient, and designed just for Qdrant.

Discovery needs context

Luis Cossío — Thu, 01 Feb 2024 13:55:36 +0000

When Christopher Columbus and his crew sailed to cross the Atlantic Ocean, they were not looking for America. They were looking for a new route to India, and they were convinced that the Earth was round. They didn’t know anything about America, but since they were going west, they stumbled upon it.

They couldn’t reach their target, because the geography didn’t let them, but once they realized it wasn’t India, they claimed it a new “discovery” for their crown. If we consider that sailors need water to sail, then we can establish a context which is positive in the water, and negative on land. Once the sailor’s search was stopped by the land, they could not go any further, and a new route was found. Let’s keep this concepts of target and context in mind as we explore the new functionality of Qdrant: Discovery search.

In version 1.7, Qdrant released this novel API that lets you constrain the space in which a search is performed, relying only on pure vectors. This is a powerful tool that lets you explore the vector space in a more controlled way. It can be used to find points that are not necessarily closest to the target, but are still relevant to the search.

You can already select which points are available to the search by using payload filters. This by itself is very versatile because it allows us to craft complex filters that show only the points that satisfy their criteria deterministically. However, the payload associated with each point is arbitrary and cannot tell us anything about their position in the vector space. In other words, filtering out irrelevant points can be seen as creating a mask rather than a hyperplane –cutting in between the positive and negative vectors– in the space.

This is where a vector context can help. We define context as a list of pairs. Each pair is made up of a positive and a negative vector. With a context, we can define hyperplanes within the vector space, which always prefer the positive over the negative vectors. This effectively partitions the space where the search is performed. After the space is partitioned, we then need a target to return the points that are more similar to it.

While positive and negative vectors might suggest the use of the recommendation interface, in the case of context they require to be paired up in a positive-negative fashion. This is inspired from the machine-learning concept of triplet loss, where you have three vectors: an anchor, a positive, and a negative. Triplet loss is an evaluation of how much the anchor is closer to the positive than to the negative vector, so that learning happens by “moving” the positive and negative points to try to get a better evaluation. However, during discovery, we consider the positive and negative vectors as static points, and we search through the whole dataset for the “anchors”, or result cantidates, which fit this characteristic better.

Discovery search, then, is made up of two main inputs:

target: the main point of interest
context: the pairs of positive and negative points we just defined.
However, it is not the only way to use it. Alternatively, you can only provide a context, which invokes a Context Search. This is useful when you want to explore the space defined by the context, but don’t have a specific target in mind. But hold your horses, we’ll get to that later.

Discovery search

Let’s talk about the first case: context with a target.

To understand why this is useful, let’s take a look at a real-world example: using a multimodal encoder like CLIP to search for images, from text and images. CLIP is a neural network that can embed both images and text into the same vector space. This means that you can search for images using either a text query or an image query. For this example, we’ll reuse our food recommendations demo by typing “burger” in the text input:

This is basically nearest neighbor search, and while technically we have only images of burgers, one of them is a logo representation of a burger. We’re looking for actual burgers, though. Let’s try to exclude images like that by adding it as a negative example:

Wait a second, what has just happened? These pictures have nothing to do with burgers, and still, they appear on the first results. Is the demo broken?

Turns out, multimodal encoders might not work how you expect them to. Images and text are embedded in the same space, but they are not necessarily close to each other. This means that we can create a mental model of the distribution as two separate planes, one for images and one for text.

This is where discovery excels, because it allows us to constrain the space considering the same mode (images) while using a target from the other mode (text).

Discovery also lets us keep giving feedback to the search engine in the shape of more context pairs, so we can keep refining our search until we find what we are looking for.

Another intuitive example: imagine you’re looking for a fish pizza, but pizza names can be confusing, so you can just type “pizza”, and prefer a fish over meat. Discovery search will let you use these inputs to suggest a fish pizza… even if it’s not called fish pizza!

Context search

Now, second case: only providing context.

Ever been caught in the same recommendations on your favourite music streaming service? This may be caused by getting stuck in a similarity bubble. As user input gets more complex, diversity becomes scarce, and it becomes harder to force the system to recommend something different.

Context search solves this by de-focusing the search around a single point. Instead, it selects points randomly from within a zone in the vector space. This search is the most influenced by triplet loss, as the score can be thought of as “how much a point is closer to a negative than a positive vector?”. If it is closer to the positive one, then its score will be zero, same as any other point within the same zone. But if it is on the negative side, it will be assigned a more and more negative score the further it gets.

Creating complex tastes in a high-dimensional space becomes easier, since you can just add more context pairs to the search. This way, you should be able to constrain the space enough so you select points from a per-search “category” created just from the context in the input.

This way you can give refeshing recommendations, while still being in control by providing positive and negative feedback, or even by trying out different permutations of pairs.

Wrapping up

Discovery search is a powerful tool that lets you explore the vector space in a more controlled way. It can be used to find points that are not necessarily close to the target, but are still relevant to the search. It can also be used to represent complex tastes, and break out of the similarity bubble. Check out the documentation to learn more about the math behind it and how to use it.