Forem: Sriram R

Vector Clocks

Sriram R — Mon, 27 Feb 2023 05:35:12 +0000

The main limitation of Lamport Clocks was that they were unable to assist us in determining the causal relationship between two events.

If you're not sure what a causal relationship is, read my article on Message Ordering first.

The reason for this is that we had a single timestamp that was being updated globally, which did not provide us with a complete picture of the sequence of events that occurred.

Vector Clocks

Instead of each node maintaining a single timestamp, we maintain a list of timestamps - one for each node in every node.

For example, if we have a three-node system, the vector timestamp would look like [ T1, T2, T3 ], where T1 is the logical time of Node 1, T2 is the logical time of Node 2, and T3 is the logical time of Node 3.

The entire list forms a single logical clock, known as vector clocks.

How do Vector Clocks work?

When a node boots up, it determines the number of nodes in the cluster (N) and fills an array of size N with 0's.
Before executing an event, the node increments the clock of that node's time in the list. N[i] = N[i] + 1
When a message is sent, the time for that node is incremented, and the vector clock is attached to the message.
When a message is received,
1. It performs the action
2. Update each list element by taking the maximum of that element's own list and the list received.

We say Event A caused Event B if A's vector timestamp is strictly less than B's vector timestamp.
We can also say that Event A and Event B are concurrent if their timestamps cannot be compared.

Pseudocode

on initialisation at node Ni do
    T := [0,0,..0] //One element for each node
end on

on any event occuring at node Ni do
    T[i] := T[i] + 1
end on

on request to send message m from node Ni do
    T[i] := T[i] + 1
    send_message(m, T)
end on

on receiving message (m, T') at node Ni do
    T[j] := max(T[j], T'[j]) for j in (1..n)
    T[i] := T[i] + 1;
    process_message(m)
end on

Issues with Vector Clocks

The main problem with vector clocks is that they require a large amount of storage because each node must store N timestamps based on the number of nodes.

Furthermore, as the number of nodes increases, we will be sending a significant amount of data because the entire vector clock with all node timestamps must be sent even if only two nodes are interacting.

Lamport Clocks

Sriram R — Fri, 24 Feb 2023 12:40:07 +0000

We looked at Logical Time in the previous article and how it can help us reason about time in a single machine. Let's now think about how we can think about this in a distributed system.

Consider sending an email to a friend and having him read it. If we think about it, any action you took before sending the email, such as writing it, checking for grammatical errors, and attaching an attachment, must have occurred prior to your friend reading the email, as must any action that follows it, such as forwarding the mail to someone else and printing it.

Logical Time is simply a counter that is incremented from an arbitrary point in time for a specific node. It means that the logical timestamps of two distinct machines cannot be compared.

If this is the case, how do we determine the order of these events when logical timestamps cannot be compared?

Lamport Clocks

The Lamport Clock, invented by Leslie Lamport, was one of the first and simplest types of Logical Clocks.

This is how Lamport Clocks function.

When a node is first started, it keeps a local logical time that starts at zero.
Prior to performing an event, the node increases its local logical time counter by one ('Ci = Ci + 1').
The incremented logical timestamp is sent along with the message when it is sent over the network.
The message's receiver does the following:
1. It updates its own Logical Clock with the maximum of its own time and the time of the received message.
2. It performs the required action

This allows us to directly compare and order the logical timestamps of multiple nodes.

The order is determined by

If the timestamp of Event A is less than Event B, then Event A happened before Event B
If the timestamps are same, then they are concurrent.

Lamports clocks are designed for a Crash Stop model, but they are simple to implement in a Crash Recovery model where the timestamp must be persisted on disk.

Pseduo Code for Lamport Clocks

on initialisation do
    t := 0
end on

on any event occuring in the node do
    t := t + 1
end on

on request to send message m do
    t := t + 1;
    send_message(m, t)
end on

on receiving (m, t1) do
    t := max(t, t1) + 1
    process_message(m)
end on

Limitations of Lamport Clocks

Lamport Clocks are easy to implement and assist us in reasoning about the happens before relationship in a distributed system.

According to Lamport Clocks, if Event A caused Event B, then Event A's Lamport Time will be less than Event B's. However, we cannot reverse it and say that if Event A's timestamp is less than Event B's, then they are causally related.

Because Lamport Timestamps cannot be used to infer causal relationships, we require a new type of Logical Clock that can assist us in determining causal relationships.

In the next article, we'll look at Vector Clocks, which will help us detect causal relationships.

The Notion of Logical Time

Sriram R — Thu, 23 Feb 2023 12:52:40 +0000

In the previous two articles, we learned about Physical Time and the nuances of time.

We concluded the thought by attempting to synchronise two clocks, but this presents a new challenge.

The very idea of synchronising time means that when a drift is found, time can jump forward or backward depending on the drift.

Consider the following scenario:
![[Screenshot 2023-02-19 at 7.47.25 PM.png]]

We wanted to calculate the time it takes to execute a function, but NTP intervenes and pushes time back.
Now, when we figure out how much time has passed, startTime is greater than endTime, so the amount of time that has passed is negative.

This is a significant issue because many critical usecases rely on calculating the time elapsed between two events.

Use of timeouts
TTL expiration
Retry Timers
Performance Evaluation, and so on.

Logical Time

Looking at the usecase, all we need to know is how long something took. When we think about it, it has nothing to do with the exact date and time. We only need something like 'This function took 100ms to execute'.

As a result, we created Logical Time. The time measured from an arbitrary point is known as logical time. In a computer, for example, we can consider booting up to be an initial action with time 'T=0' and increment this by one every second.

This time does not indicate the date or time. It only knows that X time has passed since something happened.

This concept, it turns out, is very useful in detecting time differences between two points because there is no concept of time drift or skew here. This time can only go up and never down.

We also discussed how to order events in Message Ordering and how Physical Time is ineffective for determining the order of two messages.

Because logical time is only incremental, we can deduce that Event A occurs before Event B if Event A's logical timestamp is less than Event B's.

Logical Time can also be used to detect Causality between two messages.

There are two major algorithms of Logical Time

Lamport Clocks
Vector Clocks

Both clocks will be examined in depth in subsequent articles.

Message Ordering

Sriram R — Wed, 22 Feb 2023 05:42:38 +0000

The order in which messages are processed determines the final outcome of the actions in any distributed system. This is actually more difficult than it appears to be.

Why is the order of messages important?

Consider the following scenario: a user dislikes a post but quickly realises they want to like it and does so almost immediately.

Because network delays can vary, suppose the request for like arrives before the request for dislike. We would dislike the post if we processed the messages in the order they were received, but the user's intent was to like the post.

Thus, determining the order of two messages is critical for our applications.

Types of Ordering

There are two ways to order.

Total Order
Partial Order

Total Order

Given a set of events 'A,B,C,D,' there exists a single order of all messages in total order.

For instance, if we have a set of numbers 7,4,5,9,1 and want to order them using the < relation, there is only one order 1,4,5,7,9.

Total order is used in single machine systems where it is easy to determine the order of events as they occur sequentially, and a global clock (the machine's clock) can be used for any parallel events.

Total ordering, on the other hand, is not possible in distributed systems because different machines have different clocks and concurrent events cannot be ordered.

Partial Order

Only some events can be ordered in a partial order, while others cannot. As a result, multiple orders are generated for a given set of values.

For example, if we have a list of events 'A,B,C,D' and we know that events A and B are ordered, but Event C occurred at the exact same millisecond as Event A and Event D occurred at the exact same millisecond as Event B, there is no way to order these four events in a single order because they occurred concurrently.

Partial Order is commonly used in Distributed Systems because some events occur in order while others do not, which is consistent with the theme of Partial Ordering.

Happens Before Relation

Given two events A and B, we say that A happens before B ('A -> B') if

Event A occurs before Event B in the Node Execution Order
Event A is the transmission of Message M, and Event B is the reception of the same message, because a message cannot be received before it is transmitted.
There is a Message C such that A occurs before C and C occurs before B.

If we are unable to establish a happens-before relationship between two messages, it indicates that they occurred concurrently. ( a || b ).

In this diagram

Events A and B occurred in the same node, and A occurred before B in the execution order of the node.
Event A is the transmission of Message M, and Event B is the arrival of the same message, because a message cannot be received before it is transmitted.
There is a Message C such that A occurs before C and C occurs before B.

Causality

Take, for example, social media. Do we really care about the order in which we see two unrelated posts? Most likely not.
As a result, the system could use Partial Ordering here, where posts that cannot be ordered are displayed in a random order.

However, there are some events that must be shown in chronological order. For example, if User A responds to Comment C1 by User B, User A's comment must appear after User B's comment. The conversation would be difficult to follow otherwise.

What we described in the preceding example is the concept of Causality, which states that one event contributes to the occurrence of another, i.e. Event B occurred solely because Event A occurred.

Based on what we've seen so far, we can also conclude that if Event A happened before Event B, then Event A may have caused Event B. It is critical to emphasise that this is a possibility, not a guarantee.
However, if Events A and B occur concurrently, we can be certain that Event A did not cause Event B and vice versa.

Establishing Causal Ordering

Assume we want to determine the causal ordering of two events. Is it accurate to rely on Physical Time, especially since two nodes can drift at any point in time?

This gives rise to the concept of Logical Time, in which we don't need the date and time to order events causally. In the following article, we'll go over the specifics of Logical Time.

Physical Time

Sriram R — Tue, 21 Feb 2023 05:37:52 +0000

Physical Time

Have you ever thought about how time works? How do your wall clocks and phone clocks know when a second has passed?

Why do different timezones around the world have different times for the same event?

Why does a clock in India show 10:30 AM on February 19th, but a clock in the United States shows 11:00 PM on February 18th?

Even if we are convinced that there are differences in how the sun rises and sets and have adjusted the time accordingly, how do you justify your watch and wall clock showing different times, even if the difference is only in minutes?

This is strange. Time was supposed to be something we could rely on, but it turns out to be more complicated than we thought.

Understanding how time works and its nuances is critical when developing distributed systems.

History of Time

Egyptian Standard Time

The first recorded time measurements were made in Egypt around 1500 B.C. They divided the time between sunrise and sunset into 12 equal parts. However, this meant that each section was not identical, as sunrise and sunset are not identical.

This was the first definition of time based on the motion of the earth and the sun.

Quartz Clock

As time passed, scientists began to notice an unusual behaviour in a material known as Quartz. Quartz is a piezoelectric material, which means that when electricity is passed through it, it generates mechanical stress, causing it to vibrate.

Scientists discovered that under static conditions, the vibration frequency remained constant no matter where you went. Quartz’s frequency can be controlled by the crystal’s size, shape, and temperature.

This made the material ideal for creating a new mechanical clock because you could adjust its vibration and define time based on it.

This was a revolution in timekeeping. Quartz crystals are inexpensive to produce, and minor changes in environmental conditions have little effect on frequency.

Even today, Quartz Clocks are used in computers, watches, clocks, microwaves, and pretty much any other device that keeps time.

Cons of Quartz Clocks

This, however, was not a perfect solution. The size, shape, and temperature of a Quartz Crystal determine its frequency. It is impossible to make identical crystals because they must be cut. There will always be minor differences and manufacturing defects, no matter how hard we try.

This difference causes time differences between the two clocks, also known as Skew.

This is why we rarely notice a difference in time between two devices.

Atomic Clocks

As science progressed, scientists discovered atoms and discovered that each atom has its own frequency. Because these frequencies are inherently stable over space and time, the idea of using atomic frequency to build clocks was proposed.

Many atoms were tested, but Ceasium emerged as the winner in accurately measuring time, and clocks using Ceasium to tell time were built.

These timepieces were known as Atomic Clocks.

The frequency was so stable that it was formally recognised as the international time unit.

1 second = 9,192,631,770 Ceasium-113 atom frequency oscillations

This is so precise that Atomic Clocks have an error of one over three million years.

Problems with Atomic Clock

You should have realised by now that nothing is perfect, including atomic clocks. The issue with atomic clocks is not their margin of error, but rather their high cost. Atomic clocks cost around $1500, making them prohibitively expensive for use in every device. Cheap atomic clocks are available, but they are not as accurate.

This is why, even today, we use Quartz Clocks, which are less expensive and more convenient.

Satellites and GPS systems use atomic clocks.

So, what’s up with time zones?

They recognised that this still does not work with the motion of the earth as we built highly accurate atomic clocks.

When you define AM as morning and PM as evening, you must be consistent.

However, because of the earth’s motion, when one half of the world experiences day, the other half experiences night.

By this point, we had two definitions of time: * GMT — Time defined by the rotation of the Earth, and * TAI — International Atomic Time based on Atomic Clocks.

We needed a means to strike a balance between these two formats of time, and thus the UTC was born.

Coordinated Universal Time — UTC

UTC is essentially TAI with corrections for the TAI definition of a second and the rotation of the Earth. This is known as Leap Seconds.

The difference is calculated on a regular basis, and extra seconds are either added to or subtracted from TAI depending on how many Leap Seconds occurred in that timeframe.

Understanding how it’s calculated and how these differences are applied isn’t critical for us, but it is important to know that something like this occurs in the real world.

Time Synchronization

We saw how time can differ between devices, causing Skew. However, as you may have noticed, this drift no longer occurs in digital systems. This is due to NTP (Network Time Protocol).

Network Time Protocol (NTP)

This protocol was designed to synchronise time in devices all over the world.

As previously stated, Atomic Clocks provide the most accurate time measurement, but they are too expensive to include in every device.

So, instead of having it in every machine, why not have it in thousands of machines all over the world, with normal machines communicating with these special servers to synchronise time?

On a high level, this is how NTP operates.

Skew Adjustment

Assume your computer communicated with an NTP server and discovered a 200ms skew from the standard time.

Your computer’s NTP client adjusts the time without your knowledge. This is accomplished through the use of Time Smearing.

Because softwares rely on time and instructions on a computer run in nanoseconds, a time correction in the order of milliseconds can be disastrous. As a result, when a drift is detected, the difference is applied incrementally over several minutes.

Consider adding 200ms over the course of 10 minutes. The time difference will be insignificant.

Failure Detectors

Sriram R — Mon, 20 Feb 2023 02:16:44 +0000

In distributed systems, it is crucial to understand failure and how to mitigate it in order to maintain high availability.
As we saw in System Models, the characteristics of Nodes and Network in distributed systems make it difficult to detect failure, particularly given the asynchronous nature of networks.

Why is it so difficult?

Most of the time, we find out about failures by sending a request to a node and waiting for a response for a certain amount of time. If we try X times and don't get a response, we say that the node has failed.

But it's not that easy. We only know that our requests to the node went unanswered X times. We don't know if the node has crashed, if there's a problem with the network between nodes, or if the node is just slow to respond because it's busy with other work.

A node might not respond for more than one reason. If that's the case, how do we tell the difference between these scenarios to know for sure if a node has crashed?

When we rely on timeouts, we can also get false positives. Think about this situation.

In this case, Node B crashed as soon as it replied. But since Node B answered before the crash, Node A now thinks that Node B is still alive, even though it isn't.

The timeout you choose is also a very important part of how accurate the results are.

If you choose a short timeout, there will be a lot more false positives because the node might still be alive, but it took longer than the set timeout to process the message. Choosing a longer timeout is also not a good idea because the node could have crashed and you could be waiting for a response that will never come.

Failure Detector

A failure detector is an algorithm that checks to see if other nodes in the network have failed.

How Failure Detectors Work

We can tell the different types of failure detectors apart by looking at two basic features that show the different trade-offs.

Completeness is the percentage of crashed nodes that a "failure detector" is able to find in a certain amount of time.
Accuracy is the number of mistakes that a failure detector makes in a certain amount of time.

Perfect Failure Detector

A perfect failure detector is an algorithm that is both Complete and Accurate to the highest degree. This detector detects every faulty process without assuming a node has crashed before it actually does and has no margin of error.

Unfortunately, as we saw earlier, building a Perfect Failure Detector in purely Asynchronous Systems is impossible.

Eventually Perfect Failure Detectors

Perfect Failure Detectors are impossible in Asynchronous Distributed Systems, so we use Eventually Perfect Failure Detectors instead.

These detectors have the following qualities

A node can be marked as "crashed" even if it's still alive
A node may be temporarily labelled as alive even if it has crashed.
Eventually, a node will be marked as crashed only if it has really crashed.

A perfect example of an Eventually Perfect Failure Detector is a timeout. When a timeout ends, we might not know right away if the node has crashed, but if we keep trying and the node doesn't respond for more than X times, we know it has crashed and mark it as crashed.

Exactly Once Processing

Sriram R — Sun, 19 Feb 2023 04:58:44 +0000

Multiple Deliveries of a Message

Nodes in a distributed system talk to each other over a network.
As we saw in System Models, most networks are Fair Loss Links, where data can be lost while in transit.

We use retries to deal with this. When a node finds out that the message it tried to send might not have reached its destination, it tries again in the hopes that the message will eventually get through.

This seems like a good way to solve the problem, but there's a catch.
When a node sends a message, it waits for the receiving node to confirm that it got the message. There is a chance, though, that the message was received at its destination but that the acknowledgment message was lost on the way.

Imagine this happening!

In this case, the message was received by Node B, but the acknowledgement got lost along the way. Not knowing this, Node A did a retry, which means Node B got the same message twice and processed it twice.

Real World Example

Imagine that you're sending money to someone.

This is a bad situation because you're transferring money twice when you only meant to do it once.

Difference between Delivery and Processing

Before we can figure out how to avoid these kinds of situations, we need to know how Delivery and Processing are different.

Delivery is the hardware-level event of a message arriving in a specific node while processing is taking action on the received message on the Software Layer.

For example, if you send a request to a server, it is called "Delivery" when the server gets the request, and "Processing" when the server does something with the request.

This is important to know because we can't stop duplicate messages from being sent. Networks aren't reliable, and it's impossible to stop duplicate messages from being sent.

But we can use algorithms that act on duplicate messages to make sure that duplicate actions are only processed once.

So it's important to know that "exactly-once delivery" is not possible, but "exactly-once processing" is.

How do we avoid this?

To deal with these kinds of situations, we take precautions to make sure that a node only processes a message once, even if it is sent more than once.

Idempotency

An operation is idempotent if an action can be repeated multiple times with the same result.

Example

Let's say you work for a social media site and your job is to build the "likes" feature.

There is a chance that you could like a post more than once because of retries, which we talked about above.

Let's look at two ways to do this and see which one is better.

Method 1

We keep track of how many people like each post, and when we get a new like, we add one.

This operation is not retryable because if you use the increment operation more than once, you will end up with duplicate likes.

newLikeCount = oldLikeCount + 1

Method 2

We keep a set of users who have liked a specific post, and for each new like, we add the 'userId' to that list.
The property of a set is that it will not contain any duplicates.

This operation is idempotent because even if you try it more than once, the result won't change. It will still show that the user has liked the post once, since a set doesn't have any duplicates.

usersLiked = post.getExistingLikedUsers().add(userId);

Remove duplicates

There are times when it's hard to get idempotency. In these cases, the caller module will send a unique ID to the processing module, which will keep track of every action it takes. If it gets a duplicate request with the same ID, the processing module throws away the message.

Example

Sending an email isn't an idempotent action, and you can't make it one. Instead, when a new email is sent, you can send a UniqueID for that particular email. Because it tracks the process using the Unique ID, the application sending the mail won't send it again if it has already sent it.

Idempotency and Deduplication: Pros and Cons

Idempotency is easier to set up and manage because nodes don't have to work together. Deduplication, on the other hand, needs all the nodes in your system to work together. This is because the original request and the retry request could end up in different nodes.

Since idempotency is built into the processing module, it can be used even if you don't have control over the caller module. But for DeDuplication, you need to be in charge of the caller module because it generates IDs for every request.

Byzantine Generals Problem

Sriram R — Sat, 18 Feb 2023 12:49:25 +0000

Up until now, we thought that none of the nodes could be malicious and try to do something other than what they were supposed to do. Unfortunately, because there is inherent evil in the real world, any node has the potential to turn evil and attempt to upset the peace among other nodes.

In this example, we will look at what could go wrong if a node turned out to be bad.

You might be asking. What do you mean when you say someone is malicious? How does this affect building systems? Here's a good example:

Problem Statement

Similar to the Two Generals Problem we saw in the previous article, let's imagine a theoretical situation where multiple generals want to attack a fort.

In the Two Generals Problem, we saw that the messages between two generals could be intercepted. Let's assume, for the sake of simplicity, that in this situation, all messages are sent to the other generals and none are lost.

But it's possible that one of the generals is a traitor who really works for the fort and is trying to stop the plan of the other generals to attack. Unluckily for the honest generals, there's no way to know which one or possibly more than one general is a traitor.

Let's say a successful attack is when all honest generals attack, and a successful retreat is when all honest generals retreat. Even if one honest general attacks when others retreat or retreats when others attack, our solution is flawed.

Given these restrictions, the honest generals will have to agree on whether to attack or retreat.

Scenario

Let's think about these two situations.

General 1 tells both General 2 and General 3 to attack, but General 2 tells General 3 that General 1 told him to retreat. In this situation, General 2 is the traitor.

"General 1" told "General 3" to attack, but "General 2" was instructed to retreat. Since General 2 has to tell General 3, General 3 again gets two instructions from General 1 that differ from one another. In this situation, General 1 is the traitor.

Since we are the central entity monitoring all communications, it is clear to us who the traitor is. But when there isn't a central entity keeping an eye on all messages, it's impossible for General 3 to tell who the traitor is and, by extension, what the valid message is.

This is called "Byzantine Behaviour," and we saw a little bit of it in System Models

Real Life Scenario

The most well-known Byzantine System is Blockchain. Because it is distributed and decentralised by design, it is bound to have nodes that behave in a Byzantine way.

User A claims to have given User B $200 while User B claims to have received $100 from User A. Which one of these is the Byzantine Node?

Byzantine systems are also used when you add a third party to your system, like a payment gateway.
Byzantine systems are also used when you add a third party to your system, like a payment gateway.
Let's say you're building an online wallet solution and the user says he added Rs 200 to the wallet but the payment gateway says he only added Rs 100.
Who is the Byzantine one here?

To handle these behaviours, Quorum is used, where the truth is what the majority of nodes agree to. Blockchain uses Proof of Work to deal with these kinds of mistakes.

Two Generals Problem

Sriram R — Fri, 17 Feb 2023 17:12:24 +0000

This is a thought experiment designed to help us understand a very basic problem in distributed systems.

Scenario

There are three cities named A, B, and C. Both City B and City C's generals want to attack City A, but City A is well-defended enough that neither City B nor City C can win an invasion against City A on their own.

The smart generals in Cities B and C come up with a plan. They think that if they both attack City A at the same time, they will be able to take it down.

Both generals agree to do their own reconnaissance and send a messenger when they find the best time to attack.

The forest between City B and City C is also connected to City A. When City A hears about this plan, it sends people to stop any messengers going through the forest to stop them from carrying out their plan.

This means that it's possible that City C will never get a message sent by City B, and the same is true for City B.

Our challenge is to come up with a way for both cities to agree on when to attack City A.

Solutions

Solution 1

City B knows that City A might catch its messenger, so it sends more than one, hoping that one of them will get through. Then, City B attacks City A without hearing back from City A, hoping that at least one of its messengers would have reached City C.

But there's a good chance that City A will catch all the messengers, and City B will lose because City C never got the message.

Solution 2

What if City B waited until it heard back from City C that it got the message before attacking? So, City B won't have to fight a battle it can't win, right?

City B might not be fighting a losing battle, but City C might be, since the messenger sent by City C to confirm the invasion date could be caught by City A, and City C has no way of knowing if their confirmation got through.

Is this practical?

Sure, and it happens more often than you think.

Imagine we are building an online shop and want to connect it to a Payment Service.

Customer, Online Shop, and Payment Service must all agree on a common message ( Money Charged ).

It's possible that the money was charged to the customer, but when the payment service tried to tell you, the message got lost. This is a problem because money is taken from the customer but our Online Shop doesn't know it.

If this really happens, how do online shops work?

In real life, there are reconciliation systems that track these types of extra money movement and delay your refunds unlike our thought experiment where there was no scope of fixing things later.

If you've ever had a problem with Amazon or Swiggy where your payment went through but your order wasn't placed, even if Swiggy didn't know about it right away and couldn't place your order, it eventually learned about it and processed your refund, and you'd have received your money back.

There's nothing we can do to keep messages from getting lost, but we can figure out what went wrong and fix it later.

Distributed System Models

Sriram R — Fri, 17 Feb 2023 03:18:44 +0000

System Models

As we saw in the first part of this series, a distributed system is made up of two parts: the Node and the Network. Based on how these two parts work, we can create different kinds of behaviour that should be taken into account when building distributed systems. We call these System Models.

The behaviours we use to create variations are usually based on two things:

How the different parts of a distributed system work together.
How the parts of a distributed system fail.

Network Behavior

Reliable Link

A reliable link says that if you send a message across a network, it will be delivered to its destination every time.
Most of the time, these links are used in single-machine systems where components can talk to each other reliably.

Fair Loss Link

A fair loss link means that a message sent across a network might get lost, duplicated, or reordered. This can be changed to a Reliable Link if we keep trying in case we lose the connection. If we keep trying, the message will eventually get to its destination, but there's no way to know when it will get there by the latest. In theory, it could take up to 100 years.

Arbitrary Link

This link says that any party can intercept messages sent between nodes and change, spoof, eavesdrop on, block, or replay messages.

This model is a very good representation of what happens when we use the internet in places like Starbucks or a Coffee Shop that aren't very reliable. The owner of the Coffee Shop can easily exploit the network packets and use them in a bad way.

With the arrival of TLS, it's no longer possible to intercept packets, but that doesn't stop a third party from blocking the communication.

Node Behaviour

Crash Stop

This model says that when a node becomes faulty, it will never recover and cease to function permanently.
For example, if you drop your phone under a train, it won't work again.

Crash Recovery

In this model, a node that stops working properly can get back to normal after any amount of time.
For example, if the operating system in a virtual machine (VM) crashes, a machine restart can fix it and make the node healthy again.

Byzantine

If a node departs from what it is supposed to perform, it is deemed to be defective. A byzantine node can break for no clear reason or reason at all.

This typically occurs when a hostile actor or a flaw compromises the node's algorithm.

There's a famous thought experiment called the Byzantine General Problem that explains byzantine behaviour in detail.

Blockchains are a great example of a Byzantine system, and all of the algorithms built for them assume that they will behave in a Byzantine way.

Timing Behaviour

Synchronous System

Every synchronous system sets a maximum time limit for a message to reach its destination and a maximum expected duration for a message to be processed.

For example, if you want to write something into RAM, the RAM has guarantees about how long it could take in the worst case.

Creating a synchronous distributed system is nearly impossible, and presuming Synchrony can be devastating.

Let's say you think a node can process a message in 5ms, but in the middle of the process, the Operating System does a Context Switch or a [long GC pause](https://docs.datastax.com/en/dse-trblshoot/doc/t
In this case, the assumption is wrong, and your system goes down with it.

Asynchronous System

We make no assumptions about processing time or message delivery time across a network in this model. Any message can be delayed at random and without warning.

Algorithms built for asynchronous systems are very strong because they are not affected by network delays or latency. However, building asynchronous systems is hard.

Partially Synchronous System

In this model, we assume that the system is mostly synchronous, but that it can randomly change into an asynchronous system. This provides a decent compromise between synchronous and asynchronous systems.

Why should you care about System Models?

Your Distributed System is only as strong as the assumptions you use to develop it. These models show how different systems can work, and if you make a wrong assumption, your system could break.

For instance, when blockchain algorithms were developed, they assumed a Byzantine Node Behaviour, which means any node might be evil because blockchain algorithms must be precise even if someone tries to tamper with the ledger. If they had assumed a Crash-Stop model rather than Byzantine, Blockchain algorithms would have been flawed since they would not have handled the scenario in which a node may be bad, and malicious people would have exploited that weakness.

As a result, understanding System Models and selecting an accurate System Model based on your use case is critical when designing large-scale systems.

Fallacies of Distributed System

Sriram R — Thu, 16 Feb 2023 09:18:15 +0000

The fallacies of distributed computing are a list of false assumptions that architects and developers of distributed systems may make. In this post, we'll look at what these fallacies are, how they came to be, and how to avoid them to build reliable distributed systems.

The network is reliable

When we build systems with only one machine, we usually assume that the components can talk to each other in a reliable way. All the parts are in one computer, so talking to RAM or HDD is easy and doesn't cause a lot of communication errors.

When we build distributed systems, our computers are spread out over a network, which is not very reliable.

There have been times when Google Cloud's network went down, and when it was looked into, it was found that sharks were nibbling on the underwater optic cable that powers their network.

There are many things that could cause a network to go down. Because of this, whenever we build distributed systems, we should always assume that the network is not reliable.

Latency is Zero

When we build systems with just one computer, we assume there is no latency because reading from RAM, HDD, or SSD takes almost no time.

But when it comes to distributed systems, networks do have latency because the machines can be in different places and the signals have to travel from one machine to the next.

There's also a chance that your computers are on different continents, which makes the difference in latency even more obvious.

Always think of latency as a cost that you have to try to avoid, and we should try to keep this cost as low as possible.

Bandwidth is Infinite

It's a mistake to think that you can add more data to a network channel without taking its limits into account.

Let's say you make a website that pulls 5MB of data from your servers every time it loads. As the size of your business grows, your systems will soon reach a limit because of bandwidth limits.

This problem can't be fixed by adding more nodes, so be careful with every bit you send or receive.

The Network is Secure

Every system with many parts needs a network. Without a network, nodes can't talk to each other. But thinking that the network is safe and that no one can get into it is false and often disastrous. If you don't think about security in your system, hackers will be very happy to hear that from you.

Every system needs to take security seriously and keep looking for security vulnerabilities.

Topology doesn't change

We think that once a system is set up and working perfectly, it will continue to work the same way in the future, but nothing could be further from the truth. Any node can break, and any network between nodes can temporarily stop working. The topology of your system can change.

Always ask yourself, "Does my system have one place where it could go wrong?"
If you have a system, ask yourself, "If this part of the system fails, will my whole system fail?" What would happen if this part of my system stopped working?

This problem is so common that tools like Zookeeper and Consul were made to detect topological changes and respond to them.

It can be hard, but building systems that can adapt to changes in their topology makes for a strong system.

There is one administrator

This mistake says that you can't be in charge of everything.

In a single-machine system, there will be one administrator who knows everything and can control every part of the system.

But in a distributed system, as your system grows, it will start relying on systems that you don't control.

For example, if you are building an online store, you will add a payment gateway that will be run by another company. If that goes down for a short time, you can't do much. The payment gateway will have to fix it, and all you can do is wait until their system is back up and running.

Transport cost is Zero

Cloud service providers will charge you based on how much data your node sends and receives. Each bit you send between your nodes costs you money.

As your systems grow, it makes sense to optimise them for these costs.
For example, if you have two systems that talk to each other using JSON, you might want to look into transfer-optimized formats like Protobuf. These transfer-optimized formats can help you save a lot of money and time.

Global Clock Fallacy

In a single machine, any part that reads the date or time at the same point will always get the same date or time.

But because of how clocks work, there can be a difference in time between two or more machines. This is often called a "clock skew."

In distributed systems, you cannot simply assume that time on one machine will be the same as time on other machines.

Introduction to Distributed Systems

Sriram R — Wed, 15 Feb 2023 16:54:25 +0000

What's a Distributed System?

A distributed system is a system whose components are on different computers connected by a network. These computers send messages to each other to talk to each other and coordinate their actions.

Components of a Distributed System

Nodes - The various components that comprise a distributed system
Network: The way that the nodes of Distributed Systems talk to each other

Why do we need Distributed Systems?

Performance

There are limits to what a single node can do. Each machine has hardware-based limits.
We can scale up the hardware of a machine by adding more RAM and CPU, but it gets very expensive after a certain point to improve the performance of a single computer.

Instead, we can get the same results with fewer, less expensive machines.

Scalability

Most computer systems deal with information. They are in charge of storing and processing data.
Since a single machine's performance can only be scaled to a certain point, we need more than one machine to handle the amount of data we get today.
One computer won't be able to handle all of your requests.

With multiple machines, we'll be able to store and process data more efficiently by splitting them up.

Availability

Most services need to be available 24 hours a day, 7 days a week, which is a big challenge. Any time, a single machine can break down.
If your service goes down, you'll lose money right away.
If you store all of your data on a single machine, and that machine crashes, you lose all of your data.

To be highly available, we need multiple machines so that if one fails, we can quickly switch to another.

Difficulties Designing Distributed Systems

Network Asynchrony

Communication networks have a property called asynchrony, which says that there is no way to know how long it will take to transfer an event from one machine to another. Sometimes things can occur out of order.
This makes it hard to build systems that are spread out.

To understand better, let's take an example.
Let's say a user disliked a post on a social media site, but then realised they meant to like it and changed their vote.
Since the network is asynchronous, it's possible that the like was received and processed first.
The real goal was for the post to be liked, but since the messages were sent out of order and the dislike was the second message sent, the system marked the post as disliked.

Partial Failures

Failure of some components of your system is called "partial failure." If the application doesn't take this into account, it could lead to bad results.

For example, let's say you have multiple machines where your users' data is spread out, and you lose connection with one of them. Users whose data was stored on that machine will have to wait for it to come back up.

It also makes things much more complicated when we need to do atomic transactions while some nodes are down.

Concurrency

Concurrency is the ability to do more than one computation at a time, possibly on a single data set. This makes things more complicated because the two computations can mess with each other and cause bad results.

Measuring Correctness

How do we know that a system is correct or working as it should?
There are two main factors that determine whether a system is right or wrong:

Safety

A safety property says that something in the system must never happen.

If we think of a bicycle as a system, for example, the safety property says that the wheel must always be attached to the bike when it is running. If the wheel is taken off the cycle while it is running, bad things can happen.

Liveliness

A liveliness property defines something that must eventually occur in a system.

In the case of a bicycle system, the liveliness might mean that the bike should move when pedalled. The cycle should stop when breaks are applied.