Forem: Rafael Camargo Leite

Build your own Dynamo-like key/value database - Part 1 - TCP Server

Rafael Camargo Leite — Tue, 12 Nov 2024 20:13:44 +0000

Intro

This is the second blog post part of the Build your own Dynamo-like key/value database. You can find part one here.

Part 1 - Exposing APIs to clients

1. Background

Every database needs to somehow vend its APIs for clients to consume.
This is not strictly important for our goal of studying the internals of a database system but nevertheless we need to include an entry-point for clients to interact with, even if only to allow us to write proper end-to-end tests.

2. Requirements

2.1. Non-functional

readability
- This is going to be a requirement for every component but since this is our first one, better to state it clearly. If, at any point, going through the code base yields too many "wtf"s, we messed up..
Minimal amount of dependencies
- This another cornerstone of every component - our goal is to learn how all of these data structure and algorithms are implemented. if we just include dependencies for all of them, what's the point of even building this database!?
Simplicity and extensibility
- similar to requirement 1. - making components as simple as possible definitely helps with readability.
- let's make sure adding new commands doesn't require lots of changes/boilerplate.

2.2. Functional

The only functional requirement I'll add is the ability to trace requests based on request_ids - ie: For every request our database receives, we have to be able to trace it for debugging purposes.

3. Design

Different databases provide different interfaces for clients to interact with. A few examples are:

SQLite - This is a sql database that runs embedded in the process that consumes it. For this reason, there's a C client library that exposes all necessary APIs for database management and query execution.
Redis - Redis is an in memory database that provides a wide variaty of APIs to access different data structures though TCP. It uses a custom protocol on top of TCP called RESP. So clients that want to integrate with Redis have to implement RESP on top of TCP to be able to issue commands against it.
CouchDB - exposes a RESTful HTTP API for clients to consume.

As previously stated, in our case, the actual interface is less relevant - our goal is to study the internals of storage systems, not to write a production ready, performant database for broad consumption. For that reason, I'll use non-functional requirements 2.(as few dependencies as possible) and 3. (simplicity) to choose to expose our APIs via TCP using a thin serialization protocol based on JSON.

Our serialization protocol will be based on Message - a struct composed of a small binary header followed by an optional JSON payload that represents the Command we want to execute.
After parsing a Message from the network, we will then build a Command which, in turn, can be executed.

4. Implementation

We will start by defining what a Command is.
Think of a Command as a Controller on a MVC application. A command basically interprets the request a client has issued and generates the appropriate response by calling internal methods where the business logic is.
Let's get a bit more concrete by implementing Ping.
From a client's perspective, Ping is as simple as:

Client sends a Ping request over TCP,
Cluster node parses the received bytes into a Message,
Cluster node constructs a Ping command from the Message parsed in step 2
Cluster node replies to the client with a "PONG" message

The ping command can be found here in rldb. The core of it's implementation is shown below:

...
#[derive(Debug)]
pub struct Ping;

#[derive(Serialize, Deserialize)]
pub struct PingResponse {
    pub message: String,
}

impl Ping {
    pub async fn execute(self) -> Result<PingResponse> {
        Ok(PingResponse {
            message: "PONG".to_string(),
        })
    }
}
...

Now, how do we construct a Ping Command from a Tcp connection?
That's where the Message definition comes into play. Message is our serialization protocol on top of TCP.
It defines the format/frame of the bytes a client has to send in order for our server to be able to interpret it.
Think of it as an intermediate abstraction that enables our database server to properly route requests to the specific Command based on the arguments provided.

A Message is defined as follows:

pub struct Message {
    /// Used as a way of identifying the format of the payload for deserialization
    pub cmd_id: CommandId,
    /// A unique request identifier - used for request tracing and debugging
    /// Note that this has to be encoded as utf8 otherwise parsing the message will fail
    pub request_id: String,
    /// the Request payload
    pub payload: Option<Bytes>,
}

Every field prior to the payload can be thought of as the header of the message while payload is what actually encodes the Command we are trying to execute. Many other fields could be included as part of the header. Two fields will most likely be added in the future

a checksum of the payload (maybe one including the overall message as well?)
the timestamp of the message (just for debugging purposes)

Now let's construct a Message from a Tcp connection:

impl Message {
    pub async fn try_from_async_read(tcp_connection: &mut TcpStream) -> Result<Self> {
        let cmd_id = CommandId::try_from(tcp_connection.read_u8().await?)?;
        let request_id_length = tcp_connection.read_u32().await?;
        let request_id = {
            let mut buf = vec![0u8; request_id_length as usize];
            tcp_connection.read_exact(&mut buf).await?;
            String::from_utf8(buf).map_err(|_| {
                Error::InvalidRequest(InvalidRequest::MessageRequestIdMustBeUtf8Encoded)
            })?
        };

        let payload_length = tcp_connection.read_u32().await?;

        let payload = if payload_length > 0 {
            let mut buf = vec![0u8; payload_length as usize];
            tcp_connection.read_exact(&mut buf).await?;
            Some(buf.into())
        } else {
            None
        };

        Ok(Self {
            cmd_id,
            request_id,
            payload,
        })
    }
}

Three notes:

I removed most of the error handling from this snippet to make it more readable. Read the actual implementation here if curious.
If you check the real implementation, you will see that the signature of the function is slightly different. Instead of

pub async fn try_from_async_read(tcp_connection: &mut TcpStream) -> Result<Message>

we have

pub async fn try_from_async_read<R: AsyncRead + Unpin>(reader: &mut R) -> Result<Self>

This is an important distinction that is worth discussing. If we had the TcpStream as argument, every test would require that we setup a real Tcp server, create Tcp connections and send / receive bytes via localhost.
Not only this would make tests slow, they would also make writing tests much more complicated than they had to be.

Here is an example of a test that we can write without setting up any tcp connection:

#[tokio::test]
    async fn test_max_message_size_exceeded() {
        let mut reader = MaxMessageSizeExceededAsyncRead::default();
        let err = Message::try_from_async_read(&mut reader)
            .await
            .err()
            .unwrap();

        match err {
            Error::InvalidRequest(InvalidRequest::MaxMessageSizeExceeded { max, got }) => {
                assert_eq!(max, MAX_MESSAGE_SIZE);
                assert_eq!(got, MAX_MESSAGE_SIZE + 1);
            }
            _ => {
                panic!("Unexpected error: {}", err);
            }
        }
    }

In this example, I chose to implement AsyncRead for my custom struct MaxMessageSizeExceededAsyncRead. But we could've used UnixStreams or many other options of types that impl AsyncRead to inject the bytes we are interested in.

You will likely notice that no timeouts are set to any of the tcp connection interactions. This is something that has to be included if we are building a production ready service but that I chose to skip at this point as it is not the focus of our work. But it has to be stated that if you plan on deploying anything that interacts with the network to a production environment, error handling of timeouts/slow requests is mandatory.

Now that we can build a Message out of a TcpStream, we must be able to serialize a Message into bytes so that it can be sent via network. The snippet below depicts how this can be done (without error handling again)

impl Message {
    pub fn serialize(self) -> Bytes {
        let payload = self.payload.clone();
        let payload_len = payload.clone().map_or(0, |payload| payload.len());
        let mut buf = BytesMut::with_capacity(
            self.request_id.len() + payload_len + 2 * size_of::<u32>() + size_of::<u8>(),
        );

        buf.put_u8(self.cmd_id as u8);
        buf.put_u32(self.request_id.len() as u32);
        buf.put(self.request_id.as_bytes());
        buf.put_u32(payload_len as u32);
        if let Some(payload) = payload {
            buf.put(payload);
        }

        assert_eq!(buf.capacity(), buf.len());
        buf.freeze()
    }
}

Let's go over some notes here as well:

You will quickly see that I rely on the bytes crate heavily throughout this project. Bytes is a very useful tool to write network related applications that allows you to work with contiguous byte buffers while avoiding memcopies almost entirely. It also provides some neat traits like BufMut which gives use methods like put_u32 etc.. Please refer to its documentation for more information.
I included an assertion about buf len and capacity that is there to make sure that an important invariant is held: If we allocate a buffer of size X, we must completely fill it without ever resizing it. This guarantees that we are properly computing the buffer size prior to filling it (which is important if we care about performance, for example)

Finally, given that we have a cmd_id field in Message, we can easily choose how to serialize the payload field for each specific Command and vice versa.

For Ping, a Request Message would look like

Message {
    cmd_id: 1,
    request_id: <some string>,
    payload: None 
}

and a response message would look like

Message {
    cmd_id: 1,
    request_id: <some string>,
    payload Some(Bytes::from(serde_json::to_string(PingResponse {message: "PONG".to_string()})))
}

And that's it: A Client needs to know only about Message in order to be able to interact with our database.

If you want to walk through the code yourself, here are the pointers to the 4 larger components

The IntoMessage trait and request_id (covers the functional requirement around tracing requests)

For any type that we want to be able to be converted into a Message, we can implement the IntoMessage trait for it.

pub trait IntoMessage {
    /// Same as [`Message::cmd_id`]
    fn cmd_id(&self) -> CommandId;
    /// Same as [`Message::payload`]
    fn payload(&self) -> Option<Bytes> {
        None
    }
    fn request_id(&self) -> String {
        REQUEST_ID
            .try_with(|rid| rid.clone())
            .unwrap_or("NOT_SET".to_string())
    }
}

You will see that this trait has 2 default implementations:

payload -> Commands like Ping don't have a payload. So Ping doesn't have to implement this specific function and just rely on the default behavior
request_id -> This is the more interesting one: Every message has to have a request_id associated with it. This is going to be extremely important once we have to analyze logs/traces for requests received by the database.

The way request_id is handled by rldb at the time of this writing is: request_id is injected either by the client or by the server into tokio::task_local as soon as the Message is parsed from the network.
If you really think about it, it's a bit strange that we let the client set the request_id that is internal to the database. We allow this to happen (at least for now) because rldb nodes talk to each other. In requests like Get, a rldb cluster node will have to talk to at least another 2 nodes to issue internal GET requests. In order for us to be able to connect all of the logs generated for a client request, we have to provide a mechanism for a rldb node to inject the request_id being executed when issuing requests to other nodes in the cluster.

Final thoughts

This post covered how rldb handles incoming client requests. It described how requests are serialized/deserialized and how request ids are propagated throughout the request lifecycle.

Next chapter

Part 3 - Bootstrapping our cluster: Node discovery and failure detection

Which will cover how gossip protocols work and go over the rldb gossip implementation.

Build your own Dynamo-like key/value database - Part 0 - Intro

Rafael Camargo Leite — Sat, 20 Jul 2024 16:11:22 +0000

1. Background

Most developers have, at some point in time, interacted with storage systems. Databases like redis are almost guaranteed to be part of every tech stack active nowadays.

When using storage systems like this, understanding their nitty-gritty is really key to properly integrate them with your application and, most importantly, operate them correctly. One of the greatest resources on how storage systems work is the book Designing Data-Intensive Applications by Martin Kleppmann. This book is a comprehensive compilation of most of the algorithms and data structures that power modern storage system.

Now you may ask: why write a deep dive series if the book covers all the relevant topics already? Well, books like this are great but they lack concrete implementations. It's hard to tell if one actually understood all the concepts and how they are applied just by reading about them.

To close this gap between reading and building, I decided to write my own little storage system - rldb - a dynamo-like key/value database - ie: A Key/value database that implements the amazon's dynamo paper from 2007.

This is the first part of a blog post series that will go through every component described in the dynamo paper, discuss the rational behind their design, analyze trade-offs, list possible solutions and then walk through concrete implementations in rldb.

1.1. Reader requirement

The code will be written in Rust, so a reader is expected to understand the basics of the rust language and have familiarity with asynchronous programming. Other topics like networking and any other specific algorithm/data structure will be at least briefly introduced when required (and relevant links will be included for further reading).

1.2. What is rldb?

rldb is a dynamo-like distributed key/value database that provides PUT, GET and DELETE APIs over TCP. Let's break that apart:

dynamo-like - Our database will be based on the dynamo paper. So almost every requirement listed in the paper is a requirement of our implementation as well (aside from efficiency/slas that will be ignored for now)
distributed - Our database will be comprised of multiple processes/nodes connected to each other via network. This means that the data we store will be spread across multiple nodes instead of a single one. This is what creates most of the complexity around our implementation. In later posts we will have to understand trade-offs related to strong vs eventual consistency, conflicts and versioning, partitioning strategies, quorum vs sloppy quorum, etc.. All of these things will be explained in detail in due time.
key/value - Our database will only know how to store and retrieve data based on its associated key. There won't be any secondary indexes, schemas etc...
APIs over TCP - The way our clients will be able to interact with our database is through TCP messages. We will have a very thin framing/codec layer on top of TCP to help us interpret the requests and responses.

2. Dissecting the dynamo paper

Let's go through dynamo's requirements and architecture and make sure we can answer the following question: what is a dynamo-like database?

I have to start by saying: The aws dynamoDB offer IS NOT based on the amazon dynamo paper. This can make things extra confusing once we start looking at the APIs we are going to create so I want to state this clearly here at the beginning to avoid issues in the future.

2.1. The dynamo use-case

One of the most relevant use-cases that led to the development of the dynamo db was the amazon shopping cart.
The most important aspect of the shopping cart is: whenever a customer tries to add an item to the cart (mutate it), the operation HAS to succeed. This means maximizing write Availability is a key aspect of a dynamo database.

2.2. Dynamo requirements and design

2.2.1. Write availability

As explained in the previous section, one of the key aspects of a dynamo database is how important write availability is.
According to the CAP theorem when a system is in the presence of partition, it has to choose between Consistency(C) and Availability(A).

Availability is the ability of a system to respond to requests successfully (availability = (1 - (n_requests - n_error) / n_requests) * 100)

Consistency is related to the following question: Is a client guaranteed to see the most recent value for a given key when it issues a GET? (also known as read-after-write consistency).

In a dynamo-like database, Availability is always prioritized over consistency, making it an eventually-consistent database.

The way dynamo increases write (and get) availability is by using a technique called leaderless replication in conjunction with sloppy quorums and hinted hand-offs (sloppy quorum and hinted hand-off will be explained in future posts).

In leaderless replicated systems, multiple nodes in the cluster can accept writes (as opposed to leader-based replication) and consistency guarantees can be configured (to some extent) by leveraging Quorum. To describe Quorum, let's go through an example in which the Quorum Configuration is: replicas: 3, reads: 2, writes: 2

In this example, a client sends a PUT to Node A. In order for the Put to be considered successful using the quorum configuration stated, 2 out of 3 replicas need to acknowledge the PUT synchronously. So Node A stores the data locally (first aknowledged put) and then forwards the request to another 2 nodes (for a total of 3 replicas as per Quorum configuration). If any of the 2 nodes acknowledge the replication Put, Node A can responde with success to the client.

A similar algorithm is used for reads. In the configuration from our example, a read is only successful if 2 nodes respond to the read request successfully.

When deciding what your quorum configuration should look like, the trade off being evaluated is around consistency guarantees vs performance(request time).

if you want stronger consistency guarantees, the formula you have to follow is:

reads + writes > replicas

in our example, reads = 2, writes = 2, replicas = 3 -> we are following this equation and therefore are opting for strong consistency guarantees while sacrificing performance - every read and write require at least 2 nodes to respond. When we discuss sloppy quorums and hinted hand-offs we will be much more nuanced about our analysis on consistency but I'll table this discussion for now for brevity sake.

The problem with leaderless replication is: We are now open to version conflicts due to concurrent writes.
The following image depicts a possible scenario where this would happen. Assume Quorum configuration to be replicas 2, reads: 1, writes: 1.

To detect conflicts, dynamo databases use techniques like vector clocks (or version vectors). Vector clocks will be explained in great detail in future posts.
If a conflict is detected, both conflicting values are stored and a conflict resolution process needs to happen at a later stage. In the dynamo case, conflicts are handled by clients during read. When a client issues a GET for a key which has conflicting values, both values are sent as part of the response and the client has to issue a subsequent PUT to resolve to conflict with whatever value it wants to store. Again, more details on conflict resolution will be part of future posts on Get and Put API implementations.

2.2.2. System Scalability

Quoting Designing Data-Intensive applications - chapter 1
"Scalability is the term used to describe a system's ability to cope with increased load".
For the dynamo paper, this means that when the number of read/write operations increase, the database should
be able to still operate on the same availability and performance levels.

To achieve scalability, dynamo-like databases rely on several different techniques:

Replication
- scales reads
- as mentioned in the previous section, replication is implemented via leaderless-replication with version vectors for conflict detection and resolution
Partitioning - spreading the dataset amongst multiple nodes
- scales writes
- Dynamo relies on a technique called consistent-hashing to decide which nodes should own which keys. Consistent hashing and its pros and cons will be explained in future posts.

2.2.3. Data Durability

"Durability is the ability of stored data to remain intact, complete, and uncorrupted over time, ensuring long-term accessibility." (ref).
Storage systems like GCP object storage describe durability in terms of how many nines of durability they guarantee over a year.
For GCP, durability is guaranteed at 11 nines - ie: GCP won't lose any more than 0.000000001 percent of your data in a year. We won't be calculating how many nines of durability our database will be able to provide, but we will apply many different techniques that increase data durability in multiple different components.

Most relevant durability techniques that we will go over are:

Replication -> Adds redundancy to the data stored
Checksum and checksum bracketing -> guarantees that no corruptions (either network or memory) can lead to data loss
Anti entry / read repair -> whenever a node doesn't have data that it should have (eg: maybe it was offline for deployment while writes were happening), our system has to be able to back-fill it.

2.2.4. Node discovery and failure detection

Dynamo databases rely on Gossip protocols to discover cluster nodes, detect node failures and share partitioning assignments. This is on contrast with databases that rely on external services (like zookeeper) for this.

2.3. Dynamo-like database summary

The dynamo-like database key characteristics are:

Eventually consistent storage system (but with tunnable consistency guarantees via Quorum configuration)
Relies on leaderless-replication + sloppy quorums and hinted handoffs to maximize PUT availability
Relies on Vector clocks for conflict detection
Confliction resolution is handled by the client during reads
Data is partitioned using Consistent-Hashing
Durability is guaranteed by multiple techniques with anti-entropy and read-repair being the most relevant ones
Node discovery and failure detection are implemented via Gossip Protocol

Next steps

Based on the concepts introduced in this post and the use case of the dynamo paper, the next posts on this series will walk through each component of the dynamo architecture, explain how it fits into the overall design, discuss alternate solutions and tradeoffs and then dive into specific implementations of the chosen solutions.

Below I include a (non-comprehensive) list of topics/components that will be covered in the next posts:

Part 1 - Handling requests - a minimal TCP server
Part 2 - Introducing PUT and GET for single node
Part 3 - Bootstrapping our cluster: Node discovery and failure detection
Part 4 - Partitioning with consistent-hashing
Part 5 - Replication - the leaderless approach
Part 6 - Versioning - How can we detect conflicts in a distributed system?
Part 7 - Quorum based PUTs and GETs
Part 8 - Sloppy quorums and hinted handoffs
Part 9 - Re-balancing/re-sharding after cluster changes
Part 10 - Guaranteeing integrity - the usage of checksums
Part 11 - Read repair
Part 12 - Active anti-entropy (will likely have to be broken down into multiple posts since we will have to discuss merkle trees)

In order for me to focus on what you actually care about, please leave comments, complains and whatever else you might think while going through these posts. It's definitely going to be more useful the more people engage.

Cheers,