Forem: Hayden Cordeiro

ELI25: Apache Kafka Quick Notes for Interviews

Hayden Cordeiro — Sun, 15 Feb 2026 20:23:35 +0000

Kafka was originally built by LinkedIn and later made open source under the Apache Software Foundation.

Goals

High throughput
Scalable
Reliable
Fault-tolerant
Pub/Sub architecture

Use Cases

Logging
Messaging
Data replication
Middleware logic

The Architecture

At a high level, the flow looks like this:

Producers -----> Brokers (Topics/Partitions) -----> Consumer Groups (Consumers)

1. Producers

These are client applications that produce (write) data to the system.

2. Brokers

Brokers are daemons (background processes) that run on hardware or virtual machines.

Cluster: Multiple brokers running together form a Kafka Cluster.
Storage: Their primary job is to take messages from producers and store them on disk.
Retention: Brokers have a defined retention policy (time-based or size-based). Once the limit is reached, old messages are deleted to make room for new ones.

3. Topics

Topics are logical collections of messages (e.g., orders, customers, clicks).

You can have as many topics as you want.
Topics are split into Partitions.

4. Partitions

A topic can have one or more partitions. These partitions are distributed across the brokers.

Example Scenarios:

Scenario A: 1 Topic (Orders), 2 Brokers, 2 Partitions.
- Broker 1: Holds Orders-Partition-0
- Broker 2: Holds Orders-Partition-1
Scenario B: 1 Topic, 3 Brokers, 2 Partitions.
- Broker 1: Holds Orders-Partition-0
- Broker 2: Holds Orders-Partition-1
- Broker 3: Unused for this topic.
Scenario C: 1 Topic, 1 Broker, 2 Partitions.
- Broker 1: Holds both Orders-Partition-0 and Orders-Partition-1

Important Note: Every message is written to a specific partition of a TOPIC. Each message gets a unique Offset ID. Kafka guarantees ordering only within a partition, not across the entire topic.

Consumer Groups

Consumers are the applications reading data from Kafka.

They sit in logical groupings called Consumer Groups.
Consumers can read from one or more partitions.

The Scaling Rule:
If you have more Consumers than Partitions, the extra consumers will sit idle.

Example: 2 Partitions, 3 Consumers ---> Consumer 1 reads Partition 1, Consumer 2 reads Partition 2, Consumer 3 sits idle.

Offset Management:
Consumers track the last message they read. If a consumer crashes, the group rebalances. A new consumer picks up the partition and resumes from the last committed Offset (the unique ID mentioned earlier).

Handling Failure (Fault Tolerance)

Understanding the "happy path" is great, but interviews often focus on failure.

Consumer Failure:
Straightforward. The new consumer looks up the last committed offset and continues reading.

Broker Failure:
Since messages are stored on disk inside the broker, what happens if the broker dies? Do we lose data?
No. This is where the Replication Factor comes in.

Kafka replicates partitions across different brokers.

Example: Orders Topic, 2 Brokers, Replication Factor of 2.
- Broker 1: Holds Partition 0 (Leader), Partition 1 (Replica)
- Broker 2: Holds Partition 1 (Leader), Partition 0 (Replica)

If Broker 2 crashes, Broker 1 typically takes over as the Leader for Partition 1.

When does replication happen?
When a message is written to the Leader partition, it is immediately relayed to the follower replicas.

Producer Acknowledgement (ACKS)

The producer sends a message to the leader partition. The leader writes it, replicates it, and sends an acknowledgment (ACK) back. You can configure how strict this is:

acks=0 (Fire and Forget): Producer sends data and doesn't wait for a response. Fastest, but risk of data loss.
acks=1 (Leader Ack): Producer waits for the Leader to confirm it wrote the message.
acks=all (Strongest): Producer waits for the Leader AND all in-sync replicas to confirm. Safest, but highest latency.

Callouts

1. Retention is NOT "One-in-One-out"

A common misconception is that Kafka deletes messages individually (e.g., as soon as a new message arrives, an old one is deleted).

Reality: Kafka deletes entire Log Segments.
How it works: Kafka writes messages to a file called a "segment." When that file gets full (e.g., 1GB), it closes it and starts a new one. A background process checks the closed segments. If a segment is older than the retention period (e.g., 7 days), the entire file is deleted.

2. Replication is PULL, not PUSH

How do followers stay in sync with the leader?

Reality: The Leader does not "push" data to followers. Followers PULL (fetch) data from the Leader.
The "Magic" of the Fetch Request: When a follower sends a FetchRequest, it does two things:
1. Asks for new data: "Give me everything starting from Offset 101."
2. Confirms old data: By asking for 101, it implicitly tells the Leader, "I have successfully written everything up to Offset 100."
The ISR List: The Leader uses this fetch request to update its In-Sync Replica (ISR) list. Once all replicas in the ISR list have fetched the message, the Leader advances the "High Watermark" and sends the ACK to the producer (if acks=all).

Kafka Scalability: Pull vs. Push

Scalability was the major driving factor in the design of Kafka. Because it is highly scalable, you can easily add a large number of consumers without affecting performance or requiring downtime.

High Throughput and Producer Performance

Kafka can handle a massive influx of data, often processing 100k+ events per second from producers.

Consumer Flexibility

Because Kafka consumers pull data from a topic, the system offers several advantages:

Variable Pacing: Different consumers can process messages at their own pace without affecting the broker or other consumers.
Diverse Consumption Models: Kafka supports multiple workflows simultaneously. For example, one consumer can process messages in real-time for immediate analytics, while another processes the same data in batch mode for long-term storage or reporting.

Comparison: Pull vs. Push at Scale

Feature	Push Model	Pull Model (Kafka)
Consumer Addition	Broker must manage a new connection and push state for every user.	Consumers simply point to an offset; minimal overhead for the broker.
Processing Speed	Broker might overwhelm a slow consumer.	Consumer requests data only when it has the resources to process it.
System Stability	High risk of "backpressure" issues if consumers lag.	Inherently stable as consumers manage their own flow control.

Learn Neural Networks: Build an XOR Gate From Scratch with Python Step by Step Walkthrough

Hayden Cordeiro — Mon, 02 Jun 2025 00:01:21 +0000

For the purpose of this blog we will be building a neural network from scratch using python.
The goal will be for the neural network to learn the XOR gate.

Step 1: Create the neural network

The neural network will have 3 layers, input, output and a hidden layer.
The image below showcases the neural network

Step 2: Randomly Assign Weights and biases to the network

Step 3: Forward Pass

Some background information you need to know before forward pass.

1) Activation functions (We will be using sigmoid)
2) Linear algebra (Honestly if you do not know this maybe you should stop reading?!)

Sigmoid Activation Function:

you do not have to understand the formula for now, just understand that this is the formula and the deravative

During the forward pass, the outputs of each neuron are calculated.

Forward Pass Algorithm

The algorithm is straight forward

Take the weighted sum of the incoming inputs * and the weights
Add the bias of the current neuron to the weighted sum
Apply the activation function And viola! you have the output of one neuron.

Let's do it for one neuron together

1) Weighted Sum = 0.46 * 1 + 0.03 * 1 = 0.49
2) Adding the bias to the weighted sum = 0.49 + 0.12 = 0.61
3) Applying the activation function (sigmoid) = sigmoid(0.61) = 0.6479 =~ 0.65
(I simply substituted x with 0.61 in (1/1 + e^-x))

Now you just need to this 7B times more if you want to train a large network, or in case 4 times more.

Completed Forward Pass

Step 4: The Dreaded Backpropagation

There are 2 part of backpropagation
1) Calculate the error and associate how much each neuron contributed to that error
2) Updating Weights to reduce error

Calculating error

For the last layer (output layer), if you think about it intuitively.
To improve we need two things
1) By how much of a difference are we wrong by
2) Should our outputted value increase or decrease to match the expected output

Achieving the first part is straightforward, we can calculate
(output- expected output).

For the second part, you need a little knowledge about calculus (ie derivatives).

Don't be scared, if you don't know ill help you understand..

A derivative can be defined as:

The slope of the tangent to a curve at this point is known as the derivative of the function with respect to x

The goal is to get the error to min (or in this case slope ).
If the slope is positive we must reduce our value and vice versa for a negative slope.

Error Calculation Final Layer

= (output- expected_output) * derivative(output)

The derivative of the output value can be easily found by putting the value of output in place of x in this formula
(output) * (1 - output) (See figure 2 -> Sigmoid and its derivative)

Lets solve it together for the first neuron in the last layer

= (0.77 - 1) * derivative(0.77)
= -0.23 * derivative(0.77)
= -0.23 * [(0.77) * (1-0.77)]
= - 0.040733
=~ -0.04

After doing it for both the neurons in the output layer the network will look something like this

For neurons in the hidden layers

For the final layers its straightforward, you know the expected output and the output you got.
For the hidden layers you have to determine how much the neuron contributed to the error of the next layer.
Key word : How much?

every neuron might play a part in the error, you want to change the neuron's weight proportional to the impact it has on the next layer.

Therefore the error can be calculated by

weighted sum of the product of
(the outgoing weights of from the neuron * error of the corresponding next layer neuron) * ofc lets not forget the derivative

Lets take the neuron with the output 0.65 (First neuron from top in the hidden layer).

(I had to bump up the accuracy, since the error deltas are too small)
= (0.37 * -0.040) + (0.13 * 0.146)
= -0.0148+ 0.01898
= 0.00418

To get error for this neuron
= weighted sum * derivative(output)
= 0.00418* [(0.65) * (1-0.65)]
= 0.00095
=~ 0.001

Completed backpropagation

Step 10000 Maybe? : Finally the last step updating the weights

Gradient descent is a topic I consider out of the scope of the blog, but to give you a quick summary.

You dont want to update weights too quickly, making large changes to ur weights will sway your networks output by a large value preventing you from getting a model with high accurarcy.
The parameter that controls how much a weight is tweaked is called the learning rate.

However, to small of a learning rate will lead to spending hours on training , it's a tradeoff that you have to play around with.

lets assume learning rate of 0.1.
learning_rate = 0.1

Tuning weights and biases

weight = weight - (error * learning_rate * input)
bias = bias - (error * learning_rate)

updating weight 1 and bias for the first neuron in hidden layer

weight = 0.46 - (0.001 * 0.1 * 1)
weight = 0.4599 =~ 0.46

bias = 0.12 - (0.001 * 0.1)
bias = 0.1199 =~ 0.12

(note: since we are rounding off numbers, it looks like the weight is not updating, but every micro adjustments can make a large change to the network)

Weights and biases updated.

Although we had rounded off the values some weights and biases did change (marked as green).

PHEW!! We are finally done with one pass

The same process takes places 100s of times depending on the number of epochs, batch size and input data set
But lucky for us, we have computers to do that job for us.

Here is my implementation of a neural network from scratch.
https://github.com/haydencordeiro/NNFromScratch
Although it might not be a 100% replica, it’s good enough.

Here is a video of the training of this network (the same example)
https://www.youtube.com/watch?v=PkKaKe0_xCA

Congrats you now understand backpropagation (Hopefully!)

Disclaimer:

The purpose of this blog is to help you understand how a neural network learns, not to build a production-ready model. For simplicity, we use Mean Squared Error (MSE) as the loss function.

While MSE is mathematically easier to work with and helps explain the backpropagation process, it isn't ideal for classification problems like XOR. In real-world scenarios, especially for binary classification, Cross-Entropy Loss is generally preferred because it’s better at guiding the model when outputs are close to 0 or 1.

Ensuring Data Consistency Across Microservices: Herding Cats with Saga & Transactional Outbox

Hayden Cordeiro — Wed, 28 May 2025 02:04:54 +0000

If you're diving into the world of microservices, you've likely heard of scalability, independent deployments, and tech-stack freedom.
That flexibility is great, but keeping their data in sync can feel like herding cats. This article hands you the super‑power to keep every “cat” in line (ie ensure Data Consistency.)

Remember the good old days of monoliths? You had one big, cozy database. You wrapped everything in a beautiful, all-or-nothing ACID transaction (Atomicity, Consistency, Isolation, Durability), and life was... well, simpler in that aspect.

But in Microservice-Land, each service proudly has its own database. Our Order Service has its database, and our Inventory Service has its own. They might even be different types of databases (SQL! NoSQL! Ooh, fancy!). This is great for loose coupling, but how do you make sure that when an order is placed, inventory actually gets reduced, and what if one of them fails? You can't just stretch an ACID transaction across the network and different databases; that way lies madness (and often, something called Two-Phase Commit (2PC), which comes with its own box of horrors like blocking locks and lower availability ).

So, how do we keep our data from becoming a chaotic mess without tying our services back together? Enter our heroes: The Saga Pattern and the Transactional Outbox!

The Saga Pattern: Your Business Workflow's Storyteller

Think of a Saga as a story i.e. a sequence of events. In our world, it's a sequence of local transactions. Each microservice performs its own piece of the work in its own local transaction and then signals the next service to pick up the task.

Let's say a customer orders 'Product 1'. The Order Service starts the Saga:

Order Service: Creates an order, marks it 'PENDING' (Local Transaction 1). Publishes an 'OrderPlaced' event.
Inventory Service: Hears 'OrderPlaced', reserves inventory (Local Transaction 2). Publishes 'InventoryReserved'.
Payment Service: Hears 'InventoryReserved', charges the card (Local Transaction 3). Publishes 'PaymentProcessed'.
Order Service: Hears 'PaymentProcessed', updates order to 'CONFIRMED' (Local Transaction 4).

"But Hayden," you ask, "what if the Payment Service fails? We've already reserved inventory!"

Aha! That's where the magic of Sagas lies: Compensating Transactions. If any step fails, the Saga triggers 'undo' operations for all the preceding steps that succeeded. So, if payment fails, the Saga would tell the Inventory Service to 'ReleaseInventory' and the Order Service to 'CancelOrder'. Neat, huh? We aim for eventual consistency (the system will become consistent, just maybe not instantaneously).

Choreography vs. Orchestration: Who's Calling the Shots?

Sagas come in two main flavors:

Choreography (The Dance Floor): Services talk to each other by publishing and listening to events. The Order Service shouts "Order Placed!", Inventory hears it and shouts "Inventory Reserved!", Payment hears that and shouts "Paid!".
- Pros: Simple for small workflows, no central bottleneck.
- Cons: Can get messy fast, hard to track, risk of circular matches.

Orchestration (The Conductor): A central Orchestrator acts like a conductor, telling each service what to do and when. "Order Service, create order! Inventory, reserve! Payment, charge!".
- Pros: Easier for complex flows, avoids cycles, better visibility.
- Cons: Introduces a central component (potential bottleneck/failure point).

Know Your Steps: Compensable, Pivot, Retryable

To build a good Saga, you need to know your transaction types:

Compensable: These can be undone. Think 'Reserve Inventory', you can always 'Release Inventory'.
Pivot: The point of no return. Once this step succeeds, the Saga must finish; no going back. Often, this is something irreversible, like charging a credit card.
Retryable: These come after the pivot. Since we can't go back, these steps must succeed eventually, even if we have to retry them a bunch of times (making them idempotent is key!). Think 'Send Confirmation Email'.

Understanding these helps you design robust Sagas that know when to roll back and when to push forward.

Transactional Outbox: The Unshakeable Mailman

Okay, Sagas are cool. They coordinate the workflow. But how does the Order Service reliably tell the Inventory Service that an order was placed? What if it updates its own database but then crashes before sending the message to Kafka (or RabbitMQ, or whatever you're using)?

Boom! Inconsistency. The Order Service thinks an order exists, but nobody else knows. This is the dreaded dual-write problem.

"But Kafka is fault-tolerant!" you cry. Yes, it is... once the message gets there. The gap between your local DB commit and the message hitting the broker is the danger zone.

This is where the Transactional Outbox pattern saves the day. It's surprisingly simple:

When the Order Service creates an order, it does two things in the same, single, local ACID transaction:
- Inserts the 'Order' row into its Orders table.
- Inserts an 'OrderPlaced' event message into a special Outbox table.
Because it's one transaction, either both writes succeed, or neither does. No more dual-write gap!
A separate, reliable Message Relay process monitors the Outbox table.
This relay picks up new events and reliably publishes them to your message broker.
Once successfully published, the relay marks the event as 'sent' or deletes it from the Outbox.

The relay can use Polling (periodically checking the table ) or Change Data Capture (CDC) (tailing the database logs, often preferred for lower latency and load ).

The Outbox acts like a durable, guaranteed to send mailbox inside your service's database.

Better Together: Saga + Outbox

So, is it Saga or Outbox? Nope! It's usually Saga and Outbox.

Saga defines the business logic and workflow across services.
Outbox provides the reliable messaging infrastructure that allows each step in the Saga to communicate without losing messages.

Think of Outbox as the durable pipe and Saga as the traffic controller. Every time a Saga step (whether choreographed or orchestrated) needs to send a message, it uses its local Outbox to ensure that message gets out if, and only if, its local work was successfully committed.

When Do I Need This Stuff?

Should every microservice use Sagas and Outboxes? Probably not! Always aim for simplicity.

Here’s a quick-and-dirty decision tree:

Does your action touch > 1 service's DB?
- No: Hallelujah! Use a simple local ACID transaction. You're done!
- Yes: Keep going...
Do you need instantaneous, strict global consistency?
- Yes: Oof. Maybe microservices aren't the best fit here? Consider consolidating or facing the 2PC giant.
- No (Eventual is OK): Keep going...
Do you need guaranteed messaging (no lost events)?
- Yes: You need the Transactional Outbox.
- No: Maybe simple fire-and-forget messaging is enough (but be careful!).
Is there a multi-step flow that might fail midway and needs coordination/rollback?
- Yes: You need a Saga.
- No: Outbox alone might be sufficient.

In All Seriousness: Practices & Recommendations

Saga Pattern

Ensure idempotent steps – each service can safely retry without side-effects.
Clearly label transaction types – identify Compensable, Pivot, and Retryable steps to steer the saga correctly.
Concurrency & isolation – sagas don’t give strict isolation like ACID; design for eventual consistency.
Orchestrator fault tolerance – persist saga state in durable storage so a new instance can resume mid-flow.
Retry & recovery policies – set sensible back-off and timeout values to avoid deadlocks or message storms.
Loose coupling & contracts – publish well-defined event schemas and keep services independently deployable.

Transactional Outbox

Automatic outbox writes – record the event in the same database transaction as the business change.
Idempotency & retries – embed de-duplication keys; consumers must handle duplicates gracefully.
Append-only log schema – treat the outbox table as an immutable log; avoid in-place updates.
Preserve event ordering – maintain ordering per aggregate when relaying events.
Mark items as processed – flag or move rows once the broker confirms delivery.
Cleanup policy – archive or delete processed rows regularly to keep the table small.

The Takeaway

Distributed systems are hard, and maintaining data consistency across microservices is one of the trickier puzzles. But by understanding the Saga pattern (for workflow orchestration) and the Transactional Outbox pattern (for reliable messaging), you have powerful tools to build resilient, scalable, and eventually consistent systems without falling back into the monolithic abyss or getting burned by 2PC.

Now go forth and build amazing things, may the force be with you!

Database Consistency in Microservices!

Hayden Cordeiro — Wed, 28 May 2025 01:55:50 +0000

If you're diving into the world of microservices, you've likely heard the of scalability, independent deployments, and tech-stack freedom. But beneath these shimmering waters is a beast, one that no man can tame (PS: this article will unlock a superpower which you can use to tame it), known as Data Consistency.

But in Microservice-Land, each service proudly has its own database. Our Order Service has its database, and our Inventory Service has its own. They might even be different types of databases (SQL! NoSQL! Ooh, fancy!). This is great for loose coupling, but how do you make sure that when an order is placed, inventory actually gets reduced, and what if one of them fails? You can't just stretch an ACID transaction across the network and different databases – that way lies madness (and often, something called Two-Phase Commit (2PC), which comes with its own box of horrors like blocking locks and lower availability ).

So, how do we keep our data from becoming a chaotic mess without tying our services back together? Enter our heroes: The Saga Pattern and the Transactional Outbox!

The Saga Pattern: Your Business Workflow's Storyteller

Think of a Saga as a story – a sequence of events. In our world, it's a sequence of local transactions. Each microservice performs its own piece of the work in its own local transaction and then signals the next service to pick up the baton.

Let's say a customer orders 'Product 1'. The Order Service starts the Saga:

Order Service: Creates an order, marks it 'PENDING' (Local Transaction 1). Publishes an 'OrderPlaced' event.
Inventory Service: Hears 'OrderPlaced', reserves inventory (Local Transaction 2). Publishes 'InventoryReserved'.
Payment Service: Hears 'InventoryReserved', charges the card (Local Transaction 3). Publishes 'PaymentProcessed'.
Order Service: Hears 'PaymentProcessed', updates order to 'CONFIRMED' (Local Transaction 4).

"But Hayden," you ask, "what if the Payment Service fails? We've already reserved inventory!"

Aha! That's where the magic of Sagas lies: Compensating Transactions. If any step fails, the Saga triggers 'undo' operations for all the preceding steps that succeeded. So, if payment fails, the Saga would tell the Inventory Service to 'ReleaseInventory' and the Order Service to 'CancelOrder'. Neat, huh? We aim for eventual consistency – the system will become consistent, just maybe not instantaneously.

Choreography vs. Orchestration: Who's Calling the Shots?

Sagas come in two main flavors:

Choreography (The Dance Floor): Services talk to each other by publishing and listening to events. The Order Service shouts "Order Placed!", Inventory hears it and shouts "Inventory Reserved!", Payment hears that and shouts "Paid!".
- Pros: Simple for small workflows, no central bottleneck.
- Cons: Can get messy fast, hard to track, risk of circular matches.

Orchestration (The Conductor): A central Orchestrator acts like a conductor, telling each service what to do and when. "Order Service, create order! Inventory, reserve! Payment, charge!".
- Pros: Easier for complex flows, avoids cycles, better visibility.
- Cons: Introduces a central component (potential bottleneck/failure point).

Know Your Steps: Compensable, Pivot, Retryable

To build a good Saga, you need to know your transaction types:

Compensable: These can be undone. Think 'Reserve Inventory' – you can always 'Release Inventory'.
Pivot: The point of no return. Once this step succeeds, the Saga must finish; no going back. Often, this is something irreversible, like charging a credit card.
Retryable: These come after the pivot. Since we can't go back, these steps must succeed eventually, even if we have to retry them a bunch of times (making them idempotent is key!). Think 'Send Confirmation Email'.

Understanding these helps you design robust Sagas that know when to roll back and when to push forward.

Transactional Outbox: The Unshakeable Mailman

Boom! Inconsistency. The Order Service thinks an order exists, but nobody else knows. This is the dreaded dual-write problem.

"But Kafka is fault-tolerant!" you cry. Yes, it is... once the message gets there. The gap between your local DB commit and the message hitting the broker is the danger zone.

This is where the Transactional Outbox pattern saves the day. It's surprisingly simple:

When the Order Service creates an order, it does two things in the same, single, local ACID transaction:
- Inserts the 'Order' row into its Orders table.
- Inserts an 'OrderPlaced' event message into a special Outbox table.
Because it's one transaction, either both writes succeed, or neither does. No more dual-write gap!
A separate, reliable Message Relay process monitors the Outbox table.
This relay picks up new events and reliably publishes them to your message broker.
Once successfully published, the relay marks the event as 'sent' or deletes it from the Outbox.

The relay can use Polling (periodically checking the table ) or Change Data Capture (CDC) (tailing the database logs, often preferred for lower latency and load ).

The Outbox acts like a durable, guaranteed-to-send mailbox inside your service's database.

Better Together: Saga + Outbox

So, is it Saga or Outbox? Nope! It's usually Saga and Outbox.

Saga defines the business logic and workflow across services.
Outbox provides the reliable messaging infrastructure that allows each step in the Saga to communicate without losing messages.

When Do I Need This Stuff?

Should every microservice use Sagas and Outboxes? Probably not! Always aim for simplicity.

Here’s a quick-and-dirty decision tree:

Does your action touch > 1 service's DB?
- No: Hallelujah! Use a simple local ACID transaction. You're done!
- Yes: Keep going...
Do you need instantaneous, strict global consistency?
- Yes: Oof. Maybe microservices aren't the best fit here? Consider consolidating or facing the 2PC hydra.
- No (Eventual is OK): Keep going...
Do you need guaranteed messaging (no lost events)?
- Yes: You need the Transactional Outbox.
- No: Maybe simple fire-and-forget messaging is enough (but be careful!).
Is there a multi-step flow that might fail midway and needs coordination/rollback?
- Yes: You need a Saga.
- No: Outbox alone might be sufficient.

The Takeaway 🎉

Now go forth and build amazing things!

Dealing with Mountains of Data: Why Just Buying a Bigger Hard Drive Won't Cut It

Hayden Cordeiro — Fri, 09 May 2025 01:57:39 +0000

Large scale applications have large amounts of data. I'm not talking about a few gigabytes or terabytes, rather petabytes of data. Although 30TB hard drives might be a thing now (Thanks Seagate! [https://www.techradar.com/pro/seagate-confirms-that-30tb-hard-drives-are-coming-in-early-2024-but-you-probably-wont-be-able-to-use-it-in-your-pc]), they simply don't suffice for these truly large-scale applications.

Let's take Instagram for example:

They probably have a user table that resembles something like this (If you are a dev from Meta reading this and this is exactly how your table looks, I'm a Genius! I leaked your user data!!!! Evil laugh)

user_id	user_name	phone_number
1	user1	12269765432
2	user2	12269765432
3	user3	12269765432

Coming back to the topic, this user data of 3 rows seems pretty harmless, right? RIGHT! But now imagine 2 billion of these records (Yes, that's how many of you are spending time scrolling, including you reading this article, I see you!). If each row takes around 10KB to store (Just an arbitrary value, don't sue me!), the 3 rows would barely take 30KB. However, for 2 billion, that's a whole other story.

Size per row: 10 KB
Number of rows: 2 billion (2,000,000,000)

Total size = Size per row × Number of rows
Total size = 10 KB/row × 2,000,000,000 rows
Total size = 20,000,000,000 KB

Yeah, I'm not reading that number out loud, but I think we both can agree that it's a lot! Much larger than a single hard disk can store!

Oh, you don't believe me? Fine, take more numbers, that might help:

1 MB = 1024 KB
1 GB = 1024 MB
1 TB = 1024 GB
1 PB = 1024 TB

Total size in MB = 20,000,000,000 KB / 1024 ≈ 19,531,250 MB
Total size in GB = 19,531,250 MB / 1024 ≈ 19,073 GB
Total size in TB = 19,073 GB / 1024 ≈ 18.63 TB (Terabytes)
Total size in PB = 18.63 TB / 1024 ≈ 0.018 PB (Petabytes)

Now I hope we are both on the same page (both figuratively and literally).

Okay, but hypothetically, what if some genie or a magician waves a magic wand and we now have a disk that can store such large amounts of data?

It would still not solve all our problems. We would now have a SPOF (you read that right, not SPF you skincare freaks!) – A SINGLE POINT OF FAILURE. If that one giant disk fails, your entire application is down. Game over.

At this point, you might be thinking, "Hey Hayden, why are you telling me this? I already have enough problems in my life!"

I'll let my buddy Dan answer that:

Dan Martell’s : "A problem well defined is a problem half solved."

See, he's the best!

Now, back to our topic! We have two main problems:
1) No single disk large enough to hold the large amount of data.
2) We do not want a single point of failure.

If you're jumping up right now with the answer, give yourself a pat on the back! Yes, the solution is: why don't we have multiple copies of the database? That would fix the issue with a single point of failure, right? But wait, we can't just duplicate the entire dataset, right? Since it doesn't fit on one disk to begin with!

So, we have to divide (partition) the data across multiple databases or servers.

And yes, that's our answer: Partitioning / Sharding.

You've probably heard this before, but I hope that you now understand the crucial use case as to why it's required for large-scale applications.

Now, let's come back to the topic of how we divide this data. We could shard or partition the database using multiple strategies.

One approach would be partitioning users based on their region. For example, users from the US could be written to a database server located in the US, the same for EU or APAC (Asia Pacific for those of you who do not know).

The merits of this would be:

Reduced Latency: Since the servers are located closer to the users, query time would generally be faster.
Easier Compliance: Data can be stored within specific geographical boundaries, helping with data residency regulations.

However, nothing is always a bed of roses, right? What happens if a user moves from one region to another? Or how about regions where traffic is significantly higher than others? There are a lot of other issues that we could go into with regional partitioning (like complex cross-region queries).

But coming to the point that got me writing this blog in the first place: Hashing, Consistent Hashing to be more precise.

Yes, we could divide and partition the databases according to the user's region, but what if we had something smarter decide which partition a user's record should be written to? It takes the username, user ID, phone number and, voila, tells you that you belong to partition 2. All hail our AI overlords! Just kidding (riding on the AI hype, lol!), we do not need AI for this. We have something very cool: Hash Functions!

If you are a CS major or know a thing or two about crypto, you've probably heard phrases like these associated with hash functions:

Deterministic
Efficient to Compute
Uniform Distribution

You know who could really benefit from these attributes? Yes, Databases! (You are so smart, light pat on your head!)

Okay, let's say we give the hash function a few attributes from our user table. If you ask why more than one (don't worry, we'll get back to that later).

Hashing:

Hash(user_id, user_name, phone_number) -> SomeUniqueOrSemiUniqueHashValue

Now the question is, how do we take this SomeUniqueOrSemiUniqueHashValue and map that to a database partition? Do you remember how we might make a circular array? Yes, that's the modulo (%) operator!

So, if we had 10 servers (or partitions), we could simply take the SomeUniqueOrSemiUniqueHashValue % 10, which would give us an output between 0 and 9 (since 10 % 10 = 0).

If SomeUniqueOrSemiUniqueHashValue % 10 returned 2, we would simply insert the user record in partition 2. Easy peasy!

Coming back to the point on why we are giving 3 attributes instead of just 1 to the hash function:

Using a combination of attributes like user_id, user_name, and phone_number can contribute to a more uniform distribution of hash values. If we only used user_id and they were assigned sequentially, we might still end up with hot spots as new users are constantly being added to a specific range of IDs. By including other attributes, we increase the entropy and randomness of the hash output, which helps in distributing the data more evenly across the partitions. It's like mixing colors to get a more complex and evenly spread shade.

Now, let me throw the curveball! Are you ready?

Let's say our user base grows 10x tomorrow. That would mean we would have to add more database servers, right? And that would also mean that our calculation SomeUniqueOrSemiUniqueHashValue % N, where N was initially 10, would change to something more like 20.

So, if SomeUniqueOrSemiUniqueHashValue % 10 returned 2 earlier, SomeUniqueOrSemiUniqueHashValue % 20 might not necessarily return 2. It could return 3, 4, 5, or any other number. You get my point?

Let's maybe run through an example:

When you change N to a new number (N_new), the result of the modulo operation (hash() % N_new) will be different for almost every existing piece of data.

Imagine you have 4 partitions (N=4) and a user's data hashes to 10. 10 % 4 = 2, so their data is in Partition 2.
Now, you add 4 more partitions, so you have 8 (N_new=8). The same user's data with the hash 10 would now be assigned to 10 % 8 = 2. In this specific case, the partition number is the same, but this is purely coincidental.

Consider another user whose data hashes to 11. With 4 partitions, 11 % 4 = 3, so their data is in Partition 3. With 8 partitions, 11 % 8 = 3. Still the same.

Okay, let's try a hash of 13. With 4 partitions, 13 % 4 = 1. With 8 partitions, 13 % 8 = 5. This user's data needs to move from Partition 1 to Partition 5.

How about a hash of 14? With 4 partitions, 14 % 4 = 2. With 8 partitions, 14 % 8 = 6. This user's data needs to move from Partition 2 to Partition 6.

As you can see, a significant portion of your data would need to be re-calculated and moved to a different partition whenever you change the number of nodes. This process, known as rehashing, is very disruptive, resource-intensive, and can lead to significant downtime for your application while the data is being migrated. Not ideal for a system that needs to be available 24/7!

Consistent Hashing (Our Superhero!)

So, you see the problem, right? Simple hashing with that % operator is like trying to organize a growing party where every time a new guest arrives or someone leaves, you have to ask everyone to move to a completely new spot. Chaos!

Consistent Hashing walks in, puts on its cape, and says, "Hold my perfectly distributed data." It tackles this rehashing headache with a fundamentally different approach.

Instead of thinking about servers numbered 0 to N, imagine a ring. A giant, conceptual ring representing the space of all possible hash values. We take our hash function – that cool, deterministic one that takes your user_id, user_name, phone_number, and spits out a value – and we use that same function to map both our users (data) and our database servers (nodes) onto this ring.

So, your user data hashes to a point on the ring, and each of your database servers also hashes to one or more points on this same ring.

Now, when we want to store or find a user's data, we take their hash value, find its spot on the ring, and then we move clockwise around the ring until we hit the first server node. That server node is where that user's data lives.

Think of the ring like a timeline or a sorted list of all possible hash values. Each server is responsible for a segment of that ring, from the last server it encountered going counter-clockwise, up to its own position.

Okay, superhero time! What happens when we need to add a new database server? We hash our new server, and place it onto the ring at its hash position. This new server now takes responsibility for a section of the ring that was previously handled by the server immediately clockwise to it.

And here's the magic: Only the data keys that fall into that specific segment of the ring that the new server just claimed need to move! All the other users, whose hash values land in segments of the ring not affected by the new server's arrival, stay exactly where they are. No mass migration needed!

Similarly, if a server has to leave the ring (sad face, maybe it retired), the server immediately clockwise to the departing server on the ring simply takes over the range of hash values that the old server was responsible for. Again, only the data from the leaving server needs to be moved to its new neighbor.

This is why consistent hashing is crucial in building massively scalable and resilient distributed systems. It allows us to grow or shrink our database infrastructure without causing an application-wide meltdown. It's the unsung hero keeping your favorite large-scale apps running smoothly as their data (and user base!) explodes.