<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rafael Camargo Leite</title>
    <description>The latest articles on Forem by Rafael Camargo Leite (@rcmgleite).</description>
    <link>https://forem.com/rcmgleite</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1373846%2F74e0512b-154f-4b2a-8b42-01a6bcd1d9d7.jpeg</url>
      <title>Forem: Rafael Camargo Leite</title>
      <link>https://forem.com/rcmgleite</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rcmgleite"/>
    <language>en</language>
    <item>
      <title>Build your own Dynamo-like key/value database - Part 1 - TCP Server</title>
      <dc:creator>Rafael Camargo Leite</dc:creator>
      <pubDate>Tue, 12 Nov 2024 20:13:44 +0000</pubDate>
      <link>https://forem.com/rcmgleite/build-your-own-dynamo-like-keyvalue-database-part-1-tcp-server-oop</link>
      <guid>https://forem.com/rcmgleite/build-your-own-dynamo-like-keyvalue-database-part-1-tcp-server-oop</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;This is the second blog post part of the &lt;code&gt;Build your own Dynamo-like key/value database&lt;/code&gt;. You can find part one &lt;a href="https://dev.to/rcmgleite/dynamo-like-keyvalue-databases-a-deep-dive-part-0-intro-fhh"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1 - Exposing APIs to clients
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Background
&lt;/h3&gt;

&lt;p&gt;Every database needs to somehow vend its APIs for clients to consume.&lt;br&gt;
This is not strictly important for our goal of studying the internals of a database system but nevertheless we need to include an entry-point for clients to interact with, even if only to allow us to write proper end-to-end tests.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Requirements
&lt;/h3&gt;
&lt;h4&gt;
  
  
  2.1. Non-functional
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;readability&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;This is going to be a requirement for every component but since this is our first one, better to state it clearly. If, at any point, going through the code base yields too many "wtf"s, we messed up..&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal amount of dependencies&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;This another cornerstone of every component - our goal is to learn how all of these data structure and algorithms are implemented. if we just include dependencies for all of them, what's the point of even building this database!?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity and extensibility&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;similar to requirement 1. - making components as simple as possible definitely helps with readability.&lt;/li&gt;
&lt;li&gt;let's make sure adding new commands doesn't require lots of changes/boilerplate.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  2.2. Functional
&lt;/h4&gt;

&lt;p&gt;The only functional requirement I'll add is the ability to trace requests based on request_ids - ie: For every request our database receives, we have to be able to trace it for debugging purposes.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Design
&lt;/h3&gt;

&lt;p&gt;Different databases provide different interfaces for clients to interact with. A few examples are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQLite&lt;/strong&gt; - This is a sql database that runs embedded in the process that consumes it. For this reason, there's a C client library that exposes all necessary APIs for database management and query execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; - Redis is an in memory database that provides a wide variaty of APIs to access different data structures though TCP. It uses a custom protocol on top of TCP called &lt;a href="https://redis.io/docs/latest/develop/reference/protocol-spec/" rel="noopener noreferrer"&gt;RESP&lt;/a&gt;. So clients that want to integrate with Redis have to implement RESP on top of TCP to be able to issue commands against it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CouchDB&lt;/strong&gt; - exposes a RESTful HTTP API for clients to consume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As previously stated, in our case, the actual interface is less relevant - our goal is to study the internals of storage systems, not to write a production ready, performant database for broad consumption. For that reason, I'll use non-functional requirements 2.(as few dependencies as possible) and 3. (simplicity) to choose to expose our APIs via TCP using a thin serialization protocol based on JSON.&lt;/p&gt;

&lt;p&gt;Our serialization protocol will be based on &lt;code&gt;Message&lt;/code&gt; - a &lt;code&gt;struct&lt;/code&gt; composed of a small binary header followed by an optional JSON payload that represents the &lt;code&gt;Command&lt;/code&gt; we want to execute.&lt;br&gt;
After parsing a &lt;code&gt;Message&lt;/code&gt; from the network, we will then build a &lt;code&gt;Command&lt;/code&gt; which, in turn, can be executed.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Implementation
&lt;/h3&gt;

&lt;p&gt;We will start by defining what a &lt;code&gt;Command&lt;/code&gt; is. &lt;br&gt;
Think of a &lt;code&gt;Command&lt;/code&gt; as a &lt;code&gt;Controller&lt;/code&gt; on a &lt;a href="https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller" rel="noopener noreferrer"&gt;MVC&lt;/a&gt; application. A command basically interprets the request a client has issued and generates the appropriate response by calling internal methods where the business logic is.&lt;br&gt;
Let's get a bit more concrete by implementing &lt;code&gt;Ping&lt;/code&gt;.&lt;br&gt;
From a client's perspective, &lt;code&gt;Ping&lt;/code&gt; is as simple as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client sends a Ping request over TCP,&lt;/li&gt;
&lt;li&gt;Cluster node parses the received bytes into a &lt;code&gt;Message&lt;/code&gt;, &lt;/li&gt;
&lt;li&gt;Cluster node constructs a &lt;code&gt;Ping&lt;/code&gt; command from the &lt;code&gt;Message&lt;/code&gt; parsed in step 2&lt;/li&gt;
&lt;li&gt;Cluster node replies to the client with a &lt;em&gt;"PONG"&lt;/em&gt; message&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The ping command can be found &lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/cmd/ping.rs" rel="noopener noreferrer"&gt;here&lt;/a&gt; in &lt;code&gt;rldb&lt;/code&gt;. The core of it's implementation is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
#[derive(Debug)]
pub struct Ping;

#[derive(Serialize, Deserialize)]
pub struct PingResponse {
    pub message: String,
}

impl Ping {
    pub async fn execute(self) -&amp;gt; Result&amp;lt;PingResponse&amp;gt; {
        Ok(PingResponse {
            message: "PONG".to_string(),
        })
    }
}
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, how do we construct a &lt;code&gt;Ping&lt;/code&gt; &lt;code&gt;Command&lt;/code&gt; from a Tcp connection?&lt;br&gt;
That's where the &lt;code&gt;Message&lt;/code&gt; definition comes into play. &lt;code&gt;Message&lt;/code&gt; is our serialization protocol on top of TCP.&lt;br&gt;
It defines the format/frame of the bytes a client has to send in order for our server to be able to interpret it.&lt;br&gt;
Think of it as an intermediate abstraction that enables our database server to properly route requests to the specific &lt;code&gt;Command&lt;/code&gt; based on the arguments provided.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;Message&lt;/code&gt; is defined as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub struct Message {
    /// Used as a way of identifying the format of the payload for deserialization
    pub cmd_id: CommandId,
    /// A unique request identifier - used for request tracing and debugging
    /// Note that this has to be encoded as utf8 otherwise parsing the message will fail
    pub request_id: String,
    /// the Request payload
    pub payload: Option&amp;lt;Bytes&amp;gt;,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every field prior to the payload can be thought of as the header of the message while payload is what actually encodes the &lt;code&gt;Command&lt;/code&gt; we are trying to execute. Many other fields could be included as part of the header. Two fields will most likely be added in the future&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;a checksum of the payload (maybe one including the overall message as well?)&lt;/li&gt;
&lt;li&gt;the timestamp of the message (just for debugging purposes)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now let's construct a Message from a Tcp connection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;impl Message {
    pub async fn try_from_async_read(tcp_connection: &amp;amp;mut TcpStream) -&amp;gt; Result&amp;lt;Self&amp;gt; {
        let cmd_id = CommandId::try_from(tcp_connection.read_u8().await?)?;
        let request_id_length = tcp_connection.read_u32().await?;
        let request_id = {
            let mut buf = vec![0u8; request_id_length as usize];
            tcp_connection.read_exact(&amp;amp;mut buf).await?;
            String::from_utf8(buf).map_err(|_| {
                Error::InvalidRequest(InvalidRequest::MessageRequestIdMustBeUtf8Encoded)
            })?
        };

        let payload_length = tcp_connection.read_u32().await?;

        let payload = if payload_length &amp;gt; 0 {
            let mut buf = vec![0u8; payload_length as usize];
            tcp_connection.read_exact(&amp;amp;mut buf).await?;
            Some(buf.into())
        } else {
            None
        };

        Ok(Self {
            cmd_id,
            request_id,
            payload,
        })
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three notes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I removed most of the error handling from this snippet to make it more readable. Read the actual implementation &lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/server/message.rs#L68" rel="noopener noreferrer"&gt;here&lt;/a&gt; if curious.&lt;/li&gt;
&lt;li&gt;If you check the real implementation, you will see that the signature of the function is slightly different. Instead of
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub async fn try_from_async_read(tcp_connection: &amp;amp;mut TcpStream) -&amp;gt; Result&amp;lt;Message&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;we have&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub async fn try_from_async_read&amp;lt;R: AsyncRead + Unpin&amp;gt;(reader: &amp;amp;mut R) -&amp;gt; Result&amp;lt;Self&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is an important distinction that is worth discussing. If we had the &lt;code&gt;TcpStream&lt;/code&gt; as argument, every test would require that we setup a real Tcp server, create Tcp connections and send / receive bytes via localhost.&lt;br&gt;
Not only this would make tests slow, they would also make writing tests much more complicated than they had to be.&lt;/p&gt;

&lt;p&gt;Here is an example of a test that we can write without setting up any tcp connection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#[tokio::test]
    async fn test_max_message_size_exceeded() {
        let mut reader = MaxMessageSizeExceededAsyncRead::default();
        let err = Message::try_from_async_read(&amp;amp;mut reader)
            .await
            .err()
            .unwrap();

        match err {
            Error::InvalidRequest(InvalidRequest::MaxMessageSizeExceeded { max, got }) =&amp;gt; {
                assert_eq!(max, MAX_MESSAGE_SIZE);
                assert_eq!(got, MAX_MESSAGE_SIZE + 1);
            }
            _ =&amp;gt; {
                panic!("Unexpected error: {}", err);
            }
        }
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, I chose to implement &lt;a href="https://docs.rs/tokio/latest/tokio/io/trait.AsyncRead.html" rel="noopener noreferrer"&gt;AsyncRead&lt;/a&gt; for my custom struct &lt;code&gt;MaxMessageSizeExceededAsyncRead&lt;/code&gt;. But we could've used &lt;code&gt;UnixStream&lt;/code&gt;s or many other options of types that impl &lt;code&gt;AsyncRead&lt;/code&gt; to inject the bytes we are interested in.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You will likely notice that no timeouts are set to any of the tcp connection interactions. This is something that has to be included if we are building a production ready service but that I chose to skip at this point as it is not the focus of our work. But it has to be stated that if you plan on deploying anything that interacts with the network to a production environment, error handling of timeouts/slow requests is mandatory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now that we can build a &lt;code&gt;Message&lt;/code&gt; out of a &lt;code&gt;TcpStream&lt;/code&gt;, we must be able to serialize a &lt;code&gt;Message&lt;/code&gt; into bytes so that it can be sent via network. The snippet below depicts how this can be done (without error handling again)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;impl Message {
    pub fn serialize(self) -&amp;gt; Bytes {
        let payload = self.payload.clone();
        let payload_len = payload.clone().map_or(0, |payload| payload.len());
        let mut buf = BytesMut::with_capacity(
            self.request_id.len() + payload_len + 2 * size_of::&amp;lt;u32&amp;gt;() + size_of::&amp;lt;u8&amp;gt;(),
        );

        buf.put_u8(self.cmd_id as u8);
        buf.put_u32(self.request_id.len() as u32);
        buf.put(self.request_id.as_bytes());
        buf.put_u32(payload_len as u32);
        if let Some(payload) = payload {
            buf.put(payload);
        }

        assert_eq!(buf.capacity(), buf.len());
        buf.freeze()
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's go over some notes here as well:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You will quickly see that I rely on the &lt;a href="https://crates.io/crates/bytes" rel="noopener noreferrer"&gt;bytes&lt;/a&gt; crate heavily throughout this project. Bytes is a very useful tool to write network related applications that allows you to work with contiguous byte buffers while avoiding memcopies almost entirely. It also provides some neat traits like &lt;a href="https://docs.rs/bytes/1.6.1/bytes/buf/trait.BufMut.html" rel="noopener noreferrer"&gt;BufMut&lt;/a&gt; which gives use methods like &lt;code&gt;put_u32&lt;/code&gt; etc.. Please refer to its documentation for more information.&lt;/li&gt;
&lt;li&gt;I included an assertion about &lt;code&gt;buf&lt;/code&gt; len and capacity that is there to make sure that an important invariant is held: If we allocate a buffer of size &lt;code&gt;X&lt;/code&gt;, we must completely fill it without ever resizing it. This guarantees that we are properly computing the buffer size prior to filling it (which is important if we care about performance, for example)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, given that we have a &lt;code&gt;cmd_id&lt;/code&gt; field in &lt;code&gt;Message&lt;/code&gt;, we can easily choose how to serialize the payload field for each specific Command and vice versa.&lt;/p&gt;

&lt;p&gt;For Ping, a Request &lt;code&gt;Message&lt;/code&gt; would look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Message {
    cmd_id: 1,
    request_id: &amp;lt;some string&amp;gt;,
    payload: None 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and a response message would look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Message {
    cmd_id: 1,
    request_id: &amp;lt;some string&amp;gt;,
    payload Some(Bytes::from(serde_json::to_string(PingResponse {message: "PONG".to_string()})))
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it: A Client needs to know only about &lt;code&gt;Message&lt;/code&gt; in order to be able to interact with our database.&lt;/p&gt;

&lt;p&gt;If you want to walk through the code yourself, here are the pointers to the 4 larger components&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/server/mod.rs#L38" rel="noopener noreferrer"&gt;The TcpListener&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/server/message.rs" rel="noopener noreferrer"&gt;Message&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/cmd/mod.rs" rel="noopener noreferrer"&gt;Command&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rcmgleite/rldb/blob/b96e5a9be7ef72bce98e1f3359a0c8792ae9a59c/src/cmd/ping.rs" rel="noopener noreferrer"&gt;Ping&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  The IntoMessage trait and request_id (covers the functional requirement around tracing requests)
&lt;/h4&gt;

&lt;p&gt;For any type that we want to be able to be converted into a &lt;code&gt;Message&lt;/code&gt;, we can implement the &lt;code&gt;IntoMessage&lt;/code&gt; trait for it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub trait IntoMessage {
    /// Same as [`Message::cmd_id`]
    fn cmd_id(&amp;amp;self) -&amp;gt; CommandId;
    /// Same as [`Message::payload`]
    fn payload(&amp;amp;self) -&amp;gt; Option&amp;lt;Bytes&amp;gt; {
        None
    }
    fn request_id(&amp;amp;self) -&amp;gt; String {
        REQUEST_ID
            .try_with(|rid| rid.clone())
            .unwrap_or("NOT_SET".to_string())
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see that this trait has 2 default implementations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;payload -&amp;gt; Commands like Ping don't have a payload. So Ping doesn't have to implement this specific function and just rely on the default behavior&lt;/li&gt;
&lt;li&gt;request_id -&amp;gt; This is the more interesting one: Every message has to have a request_id associated with it. This is going to be extremely important once we have to analyze logs/traces for requests received by the database.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The way request_id is handled by &lt;em&gt;rldb&lt;/em&gt; at the time of this writing is: request_id is injected either by the client or by the server into &lt;a href="https://docs.rs/tokio/latest/tokio/macro.task_local.html" rel="noopener noreferrer"&gt;tokio::task_local&lt;/a&gt; as soon as the &lt;code&gt;Message&lt;/code&gt; is parsed from the network.&lt;br&gt;
If you really think about it, it's a bit strange that we let the client set the request_id that is internal to the database. We allow this to happen (at least for now) because rldb nodes &lt;strong&gt;talk to each other&lt;/strong&gt;. In requests like &lt;code&gt;Get&lt;/code&gt;, a rldb cluster node will have to talk to at least another 2 nodes to issue internal GET requests. In order for us to be able to connect all of the logs generated for a client request, we have to provide a mechanism for a rldb node to inject the request_id being executed when issuing requests to other nodes in the cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;This post covered how &lt;code&gt;rldb&lt;/code&gt; handles incoming client requests. It described how requests are serialized/deserialized and how request ids are propagated throughout the request lifecycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next chapter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 3 - Bootstrapping our cluster: Node discovery and failure detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which will cover how gossip protocols work and go over the &lt;code&gt;rldb&lt;/code&gt; gossip implementation.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>distributedsystems</category>
      <category>database</category>
      <category>learning</category>
    </item>
    <item>
      <title>Build your own Dynamo-like key/value database - Part 0 - Intro</title>
      <dc:creator>Rafael Camargo Leite</dc:creator>
      <pubDate>Sat, 20 Jul 2024 16:11:22 +0000</pubDate>
      <link>https://forem.com/rcmgleite/dynamo-like-keyvalue-databases-a-deep-dive-part-0-intro-fhh</link>
      <guid>https://forem.com/rcmgleite/dynamo-like-keyvalue-databases-a-deep-dive-part-0-intro-fhh</guid>
      <description>&lt;h1&gt;
  
  
  1. Background
&lt;/h1&gt;

&lt;p&gt;Most developers have, at some point in time, interacted with storage systems. Databases like &lt;a href="https://github.com/redis/redis" rel="noopener noreferrer"&gt;redis&lt;/a&gt; are almost guaranteed to be part of every tech stack active nowadays.&lt;/p&gt;

&lt;p&gt;When using storage systems like this, understanding their nitty-gritty is really key to properly integrate them with your application and, most importantly, operate them correctly. One of the greatest resources on how storage systems work is the book &lt;a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications&lt;/a&gt; by Martin Kleppmann. This book is a comprehensive compilation of most of the algorithms and data structures that power modern storage system.&lt;/p&gt;

&lt;p&gt;Now you may ask: why write a deep dive series if the book covers all the relevant topics already? Well, books like this are great but they lack concrete implementations. It's hard to tell if one actually understood all the concepts and how they are applied just by reading about them.&lt;/p&gt;

&lt;p&gt;To close this gap between reading and building, I decided to write my own little storage system - &lt;a href="https://github.com/rcmgleite/rldb" rel="noopener noreferrer"&gt;rldb&lt;/a&gt; - a dynamo-like key/value database - ie: A Key/value database that implements the &lt;a href="https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf" rel="noopener noreferrer"&gt;amazon's dynamo paper&lt;/a&gt; from 2007.&lt;/p&gt;

&lt;p&gt;This is the first part of a blog post series that will go through every component described in the dynamo paper, discuss the rational behind their design, analyze trade-offs, list possible solutions and then walk through concrete implementations in &lt;a href="https://github.com/rcmgleite/rldb" rel="noopener noreferrer"&gt;rldb&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.1. Reader requirement
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The code will be written in Rust&lt;/strong&gt;, so a reader is expected to understand the basics of the rust language and have familiarity with asynchronous programming. Other topics like networking and any other specific algorithm/data structure will be at least briefly introduced when required (and relevant links will be included for further reading).&lt;/p&gt;

&lt;h2&gt;
  
  
  1.2. What is &lt;em&gt;rldb&lt;/em&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;rldb&lt;/em&gt; is a dynamo-like distributed key/value database that provides PUT, GET and DELETE APIs over TCP. Let's break that apart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dynamo-like&lt;/strong&gt; - Our database will be based on the &lt;a href="https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf" rel="noopener noreferrer"&gt;dynamo paper&lt;/a&gt;. So almost every requirement listed in the paper is a requirement of our implementation as well (aside from efficiency/slas that will be ignored for now)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;distributed&lt;/strong&gt; - Our database will be comprised of multiple processes/nodes connected to each other via network. This means that the data we store will be spread across multiple nodes instead of a single one. This is what creates most of the complexity around our implementation. In later posts we will have to understand trade-offs related to strong vs eventual consistency, conflicts and versioning, partitioning strategies, quorum vs sloppy quorum, etc.. All of these things will be explained in detail in due time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;key/value&lt;/strong&gt; - Our database will only know how to store and retrieve data based on its associated key. There won't be any secondary indexes, schemas etc...&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;APIs over TCP&lt;/strong&gt; - The way our clients will be able to interact with our database is through &lt;a href="https://en.wikipedia.org/wiki/Transmission_Control_Protocol" rel="noopener noreferrer"&gt;TCP&lt;/a&gt; messages. We will have a very thin framing/codec layer on top of TCP to help us interpret the requests and responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  2. Dissecting the dynamo paper
&lt;/h1&gt;

&lt;p&gt;Let's go through dynamo's requirements and architecture and make sure we can answer the following question: &lt;code&gt;what is a dynamo-like database&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;I have to start by saying: The aws dynamoDB offer IS &lt;strong&gt;NOT&lt;/strong&gt; based on the amazon dynamo paper. This can make things extra confusing once we start looking at the APIs we are going to create so I want to state this clearly here at the beginning to avoid issues in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1. The dynamo use-case
&lt;/h2&gt;

&lt;p&gt;One of the most relevant use-cases that led to the development of the dynamo db was the amazon shopping cart.&lt;br&gt;
The most important aspect of the shopping cart is: whenever a customer tries to add an item to the cart (mutate it), the operation &lt;strong&gt;HAS&lt;/strong&gt; to succeed. This means maximizing write Availability is a key aspect of a dynamo database.&lt;/p&gt;
&lt;h2&gt;
  
  
  2.2. Dynamo requirements and design
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.2.1. Write availability
&lt;/h3&gt;

&lt;p&gt;As explained in the previous section, one of the key aspects of a dynamo database is how important write availability is.&lt;br&gt;
According to the &lt;a href="https://en.wikipedia.org/wiki/CAP_theorem" rel="noopener noreferrer"&gt;CAP theorem&lt;/a&gt; when a system is in the presence of partition, it has to choose between Consistency(C) and Availability(A).&lt;/p&gt;

&lt;p&gt;Availability is the ability of a system to respond to requests successfully (&lt;code&gt;availability = (1 - (n_requests - n_error) / n_requests) * 100&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Consistency is related to the following question: &lt;code&gt;Is a client guaranteed to see the most recent value for a given key when it issues a GET?&lt;/code&gt; (also known as read-after-write consistency).&lt;/p&gt;

&lt;p&gt;In a dynamo-like database, Availability is always prioritized over consistency, making it an eventually-consistent database.&lt;/p&gt;

&lt;p&gt;The way dynamo increases write (and get) availability is by using a technique called &lt;code&gt;leaderless replication&lt;/code&gt; in conjunction with &lt;code&gt;sloppy quorums&lt;/code&gt; and &lt;code&gt;hinted hand-offs&lt;/code&gt; (&lt;code&gt;sloppy quorum&lt;/code&gt; and &lt;code&gt;hinted hand-off&lt;/code&gt; will be explained in future posts).&lt;/p&gt;

&lt;p&gt;In leaderless replicated systems, multiple nodes in the cluster can accept writes (as opposed to leader-based replication) and consistency guarantees can be configured (to some extent) by leveraging &lt;code&gt;Quorum&lt;/code&gt;. To describe &lt;code&gt;Quorum&lt;/code&gt;, let's go through an example in which the &lt;code&gt;Quorum&lt;/code&gt; Configuration is: &lt;code&gt;replicas: 3, reads: 2, writes: 2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F950m4txgjfezr56s0gcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F950m4txgjfezr56s0gcd.png" alt="leaderless_replication_quorum" width="784" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, a client sends a PUT to &lt;code&gt;Node A&lt;/code&gt;. In order for the Put to be considered successful using the quorum configuration stated, 2 out of 3 replicas need to acknowledge the PUT &lt;strong&gt;synchronously&lt;/strong&gt;. So &lt;code&gt;Node A&lt;/code&gt; stores the data locally (first aknowledged put) and then forwards the request to another 2 nodes (for a total of 3 replicas as per Quorum configuration). If any of the 2 nodes acknowledge the replication Put, &lt;code&gt;Node A&lt;/code&gt; can responde with success to the client.&lt;/p&gt;

&lt;p&gt;A similar algorithm is used for reads. In the configuration from our example, a read is only successful if 2 nodes respond to the read request successfully.&lt;/p&gt;

&lt;p&gt;When deciding what your quorum configuration should look like, the trade off being evaluated is around consistency guarantees vs performance(request time).&lt;/p&gt;

&lt;p&gt;if you want stronger consistency guarantees, the formula you have to follow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;reads + writes &amp;gt; replicas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;in our example, &lt;code&gt;reads = 2, writes = 2, replicas = 3&lt;/code&gt; -&amp;gt; we are following this equation and therefore are opting for strong consistency guarantees while sacrificing performance - every read and write require at least 2 nodes to respond. When we discuss &lt;code&gt;sloppy quorums&lt;/code&gt; and &lt;code&gt;hinted hand-offs&lt;/code&gt; we will be much more nuanced about our analysis on consistency but I'll table this discussion for now for brevity sake.&lt;/p&gt;

&lt;p&gt;The problem with leaderless replication is: We are now open to version conflicts due to concurrent writes.&lt;br&gt;
The following image depicts a possible scenario where this would happen. Assume Quorum configuration to be &lt;code&gt;replicas 2, reads: 1, writes: 1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw6g359ibxvx1j8uyw5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw6g359ibxvx1j8uyw5g.png" alt="leaderless_replication_conflict" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To detect conflicts, dynamo databases use techniques like vector clocks (or version vectors). Vector clocks will be explained in great detail in future posts.&lt;br&gt;
If a conflict is detected, both conflicting values are stored and a conflict resolution process needs to happen at a later stage. In the dynamo case, conflicts are handled by clients during &lt;code&gt;read&lt;/code&gt;. When a client issues a GET for a key which has conflicting values, both values are sent as part of the response and the client has to issue a subsequent PUT to resolve to conflict with whatever value it wants to store. Again, more details on conflict resolution will be part of future posts on Get and Put API implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2.2. System Scalability
&lt;/h3&gt;

&lt;p&gt;Quoting &lt;em&gt;Designing Data-Intensive applications - chapter 1&lt;/em&gt;&lt;br&gt;
"Scalability is the term used to describe a system's ability to cope with increased load".&lt;br&gt;
For the dynamo paper, this means that when the number of read/write operations increase, the database should&lt;br&gt;
be able to still operate on the same availability and performance levels.&lt;/p&gt;

&lt;p&gt;To achieve scalability, dynamo-like databases rely on several different techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replication

&lt;ul&gt;
&lt;li&gt;scales reads&lt;/li&gt;
&lt;li&gt;as mentioned in the previous section, replication is implemented via leaderless-replication with version vectors for conflict detection and resolution&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Partitioning - spreading the dataset amongst multiple nodes

&lt;ul&gt;
&lt;li&gt;scales writes&lt;/li&gt;
&lt;li&gt;Dynamo relies on a technique called consistent-hashing to decide which nodes should own which keys. Consistent hashing and its pros and cons will be explained in future posts.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2.3. Data Durability
&lt;/h3&gt;

&lt;p&gt;"Durability is the ability of stored data to remain intact, complete, and uncorrupted over time, ensuring long-term accessibility." (&lt;a href="https://www.purestorage.com/knowledge/what-is-data-durability.html" rel="noopener noreferrer"&gt;ref&lt;/a&gt;).&lt;br&gt;
Storage systems like GCP object storage describe durability in terms of &lt;a href="https://cloud.google.com/storage/docs/availability-durability" rel="noopener noreferrer"&gt;how many nines of durability they guarantee over a year&lt;/a&gt;.&lt;br&gt;
For GCP, durability is guaranteed at 11 nines - ie: GCP won't lose any more than 0.000000001 percent of your data in a year. We won't be calculating how many nines of durability our database will be able to provide, but we will apply many different techniques that increase data durability in multiple different components.&lt;/p&gt;

&lt;p&gt;Most relevant durability techniques that we will go over are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Replication -&amp;gt; Adds redundancy to the data stored&lt;/li&gt;
&lt;li&gt;Checksum and checksum bracketing -&amp;gt; guarantees that no corruptions (either network or memory) can lead to data loss&lt;/li&gt;
&lt;li&gt;Anti entry / read repair -&amp;gt; whenever a node doesn't have data that it should have (eg: maybe it was offline for deployment while writes were happening), our system has to be able to back-fill it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2.2.4. Node discovery and failure detection
&lt;/h3&gt;

&lt;p&gt;Dynamo databases rely on &lt;a href="https://en.wikipedia.org/wiki/Gossip_protocol" rel="noopener noreferrer"&gt;Gossip protocols&lt;/a&gt; to discover cluster nodes, detect node failures and share partitioning assignments. This is on contrast with databases that rely on external services (like &lt;a href="https://zookeeper.apache.org/doc/r3.8.4/zookeeperUseCases.html" rel="noopener noreferrer"&gt;zookeeper&lt;/a&gt;) for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.3. Dynamo-like database summary
&lt;/h2&gt;

&lt;p&gt;The dynamo-like database key characteristics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eventually consistent storage system (but with tunnable consistency guarantees via &lt;code&gt;Quorum&lt;/code&gt; configuration)&lt;/li&gt;
&lt;li&gt;Relies on leaderless-replication + sloppy quorums and hinted handoffs to maximize PUT availability&lt;/li&gt;
&lt;li&gt;Relies on Vector clocks for conflict detection&lt;/li&gt;
&lt;li&gt;Confliction resolution is handled by the client during reads&lt;/li&gt;
&lt;li&gt;Data is partitioned using Consistent-Hashing&lt;/li&gt;
&lt;li&gt;Durability is guaranteed by multiple techniques with anti-entropy and read-repair being the most relevant ones&lt;/li&gt;
&lt;li&gt;Node discovery and failure detection are implemented via Gossip Protocol&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Next steps
&lt;/h1&gt;

&lt;p&gt;Based on the concepts introduced in this post and the use case of the dynamo paper, the next posts on this series will walk through each component of the dynamo architecture, explain how it fits into the overall design, discuss alternate solutions and tradeoffs and then dive into specific implementations of the chosen solutions.&lt;/p&gt;

&lt;p&gt;Below I include a (non-comprehensive) list of topics/components that will be covered in the next posts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Part 1 - &lt;a href="https://dev.to/rcmgleite/build-your-own-dynamo-like-keyvalue-database-part-1-tcp-server-oop"&gt;Handling requests - a minimal TCP server&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 2 - Introducing PUT and GET for single node&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 3 - Bootstrapping our cluster: Node discovery and failure detection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 4 - Partitioning with consistent-hashing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 5 - Replication - the leaderless approach&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 6 - Versioning - How can we detect conflicts in a distributed system?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 7 - Quorum based PUTs and GETs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 8 - Sloppy quorums and hinted handoffs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 9 - Re-balancing/re-sharding after cluster changes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 10 - Guaranteeing integrity - the usage of checksums&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 11 - Read repair&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 12 - Active anti-entropy&lt;/strong&gt; (will likely have to be broken down into multiple posts since we will have to discuss merkle trees)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In order for me to focus on what you actually care about, please leave comments, complains and whatever else you might think while going through these posts. It's definitely going to be more useful the more people engage.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

</description>
      <category>rust</category>
      <category>database</category>
      <category>distributedsystems</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
