Forem: Daniel Elegberun

Building a personalized workout recommender using content-based filtering with ML.NET

Daniel Elegberun — Wed, 19 Feb 2025 20:23:27 +0000

We are currently experiencing the Artificial Intelligence(AI) revolution. The explosion of large language models (LLMs) and machine learning has transformed multiple industries, including the health and fitness space.

One of the most exciting applications of AI in fitness is the creation of personalized workout plans. With libraries like ML.NET, developers can leverage the power of machine learning in the C# and .NET ecosystem. In this article, we’ll explore how to build a simple workout recommender application using content filtering in ML.NET.

What is content filtering?

Content-based filtering is a recommendation technique that suggests items (e.g., workouts, movies, products) to users based on the similarity between the item’s features and the input preferences of the system. Content filtering is used in recommendation systems and information retrieval (e.g., searching for similar documents).

In our workout recommendation, the user would enter the query “beginner chest exercises with dumbbells” and get exercise recommendations that are most similar to the query.

How does content-based filtering work?

Content-based filtering works by analyzing the characteristics of the features of items in a dataset comparing it to the user’s query and then recommending items or features that are similar.

1. Exercise representation

Each exercise in our dataset is represented by its features, such as the body part(s) it targets (chest, legs, arms), equipment needed (dumbbells, resistance bands), or difficulty level (beginner, intermediate, advanced).

2. User query as input

The user enters a query, such as "beginner exercises for the chest with dumbbells"

This query is processed and transformed into a feature vector that represents the user’s intent. For example:

Query: "Exercises for the chest with dumbbells"
Features: {"chest": 1, "equipment": "dumbbell", "difficulty": beginner}

3. Similarity measurement

The system compares the user’s query vector with the feature vectors of all exercises in the database.

For this project, we’ll be making use of cosine similarity. Cosine similarity is a common metric used to measure how similar two vectors are. It calculates the cosine of the angle between the two vectors, providing a score between 0 and 1, where 1 indicates a perfect match and 0 indicates no match.

4. Recommendation generation

Based on the similarity scores, the system ranks the exercises and recommends the top-N exercises that best match the user’s query.

Cosine similarity in depth

Cosine similarity is particularly useful for comparing the user’s query with the exercise features because it focuses on the direction of the vectors rather than their magnitude. This makes it ideal for comparing sparse or unevenly weighted features.

To compute cosine similarity, we first need to represent the text as numerical vectors. One common way is to use the term frequency or TF-IDF (Term Frequency-Inverse Document Frequency).

Let's consider a simple user query looking for exercises targeting the chest. The original query is “recommend chest exercises”. From the query, we can extract the word: chest. We want to compare this to existing chest workouts in our database for similarity.

User query: "Chest"
Database entry: "Chest press"

We can represent these exercises as vectors based on their features. For simplicity, let’s use "Chest," and "press”, as the key features:

Words	Chest	Press
Chest	1	0
Chest press	1	1

The resulting vectors are:
Vector A (Chest): [1, 0]
Vector B (Chest press): [1, 1]

The angle between the two vectors determines the cosine similarity. In this case, the angle is 45°, and the cosine of 45° is approximately 0.707.

Cosine similarity ignores the frequency of the words

Words	Chest	Press
Chest chest chest	3	0
Chest press	1	1

The point of the Chest chest chest on the graph will be further out on the X axis but the angle will remain the same and thus the same cosine similarity. Cosine similarity is determined by the angle between the lines and ignores the magnitude of the vectors.

Cosine similarity when the words are the same

Words	Chest	Press
Chest press	1	1
Chest press	1	1

Since the 2 words are the same the angle between the lines will be 0 and the cosine similarity will be Cos 0 = 1.

While these examples work well in a 2-dimensional space, real-world scenarios often involve higher-dimensional data (e.g., 5 or more features). In such cases, we use the cosine similarity formula:

Similarity=X⋅Y∣X∣∣Y∣\text{Cosine Similarity} = \frac{\mathbf{X} \cdot \mathbf{Y}}{|\mathbf{X}| |\mathbf{Y}|}

x → The embedding vector of an item you liked in the past.
y → The embedding vector of another item (e.g., a recommendation candidate).
x · y (dot product) → Measures how similar the two vectors are in direction.
||x|| * ||y|| (product of magnitudes) → Normalizes the similarity score so it’s between -1 and 1.

How content-based filtering works in our fitness app

Let’s break it down step by step:

1. Define item features

The first step in the process is determining what the item’s features are. In our dataset, our item features are:

Bodypart: chest, legs, abs.
Level: beginner, intermediate, expert.
Equipment: dumbbell, barbell, bodyweight.

Example:
Dumbbell Bench Press → Features: BodyPart=chest, Level=beginner, Equipment=dumbbell.

2. Create a user profile

When the user types a query like “beginner chest exercises with dumbbells”, the system:

Extracts keywords: beginner, chest, dumbbells.
Expands synonyms: Maps chest to ["chest", "pectoral"] and legs to [“quads”, “hamstrings”]
Encodes preferences: Converts these terms into a numerical vector (a list of numbers representing the user’s preferences). User Vector: (0.6, 0.3, 0.1, 1, 0, 0] {chest, beginner, dumbbell}

3. Choose a similarity metric

As discussed previously we use cosine similarity to compare the user’s preferences to exercises.

Example:

User vector: [0.6, 0.3, 0.1, 1, 0, 0] {chest, beginner, dumbbell}
Exercise vector: [0.5, 0.3, 0.2, 1, 0, 0] {chest, beginner, dumbbell}
Similarity Score: 0.98 {nearly identical!}

4. Score and rank exercises

The system calculates the similarity between the user’s vector and every exercise’s vector, then ranks them from most to least similar.

Exercise	BodyPart	Level	Equipment	Similarity Score
Dumbbell Bench Press	chest	beginner	dumbell	0.98
Push-ups	chest	beginner	bodyweight	0.85
Barbell Squats	legs	intermediate	barbell	0.12

The user gets the top recommendations: Dumbbell Bench Press and Push-ups.

The dataset

We’ll be using the Gym Exercise Dataset from Kaggle. This dataset contains the following columns:

Title: The name of the exercise (e.g., Barbell Squats).
BodyPart: The muscle group targeted (e.g., legs, chest).
Equipment: The equipment required (e.g., dumbbell, barbell).
Level: The difficulty level (e.g., beginner, intermediate).
ExerciseType: The type of exercise (e.g., strength, cardio).

Building the code

Prerequisites

Vscode or any code editor of your choice
Download .NET (I'm using .NET 8)
Full GitHub code here

Step 1: Setting up the project

Create a .NET console app:

dotnet new console -n WorkoutRecommender
cd WorkoutRecommender

Install the ML package dotnet add package Microsoft.ML
Download the dataset: Place the gym_exercise_data.csv file in a Data folder within your project.

Step 2: Loading and preprocessing the data

The first step is to load the dataset and preprocess it for machine learning.

Create a class to represent the exercises: I added a processed exercises class which contains an attribute for the body part synonyms.

using Microsoft.ML.Data;

public class Exercise
{
   [LoadColumn(1)] public string Title { get; set; }
   [LoadColumn(2)] public string Desc { get; set; }
   [LoadColumn(3)] public string Type { get; set; }
   [LoadColumn(4)] public string BodyPart { get; set; }
   [LoadColumn(5)] public string Equipment { get; set; }
   [LoadColumn(6)] public string Level { get; set; }
   // Used for synonym mapping
}

public class ProcessedExercise : Exercise
{
   public string BodyPartSynonyms { get; set; }
}

public class ExerciseVector
{
   [VectorType] // Indicates this is a numerical vector
   public float[] Features { get; set; }
}

Create a body part synonym: Add a recommendation class file and add a dictionary. Since users might enter terms like “legs” instead of “glutes”, we’ll create a synonym dictionary to map related terms.

private static readonly Dictionary<string, List<string>> BodyPartSynonyms = new()
   {
       { "chest", new List<string> { "chest", "pectoral" } },
       { "legs", new List<string> { "legs", "glutes", "quads", "hamstrings" } },
       { "abs", new List<string> { "abs", "core", "abdominals" } },
       { "arms", new List<string> { "arms", "biceps", "triceps" } }
   };`

Load the CSV file: Use ML.NET’s LoadFromTextFile to read the dataset and convert it to a list of Exercise objects. This method loads the list of exercises and converts it to the processed exercises object containing the body part synonyms.

// Method to load exercises from CSV
   public List<ProcessedExercise> LoadExercises()
   {
       // Step 1: Load raw CSV data using ML.NET
       var mlContext = new MLContext();
       var dataPath = Path.Combine(Directory.GetCurrentDirectory(), "Data", "megaGymDataset.csv");
       var dataView = mlContext.Data.LoadFromTextFile<Exercise>(
           path: dataPath,
           separatorChar: ',',
           hasHeader: true // If your CSV has headers
       );

       // Convert to list of Exercise objects
       var exercises = mlContext.Data.CreateEnumerable<Exercise>(dataView, reuseRowObject: false).ToList();

       // Convert to ProcessedExercise and add synonyms
       var processedExercises = exercises.Select(e => new ProcessedExercise
       {
           Title = e.Title,
           Desc = e.Desc,
           BodyPart = e.BodyPart,
           Equipment = e.Equipment,
           Level = e.Level,
           Type = e.Type,
           BodyPartSynonyms = string.Join(",", GetSynonymsForBodyPart(e.BodyPart)) // Join synonyms into a single string
       }).ToList();


       return processedExercises;
   }


   private List<string> GetSynonymsForBodyPart(string bodyPart)
   {
       // Normalize body part to lowercase
       var normalizedBodyPart = bodyPart.Trim().ToLower();


       // Find the synonym group that contains this body part
       var matchingGroup = BodyPartSynonyms
           .FirstOrDefault(kvp => kvp.Value.Contains(normalizedBodyPart));

       return matchingGroup.Value ?? new List<string> { normalizedBodyPart };
   }

Step 3: Building the ML pipeline

The ML pipeline is the heart of our application. It transforms raw data into a format the computer can understand. Here’s how we build the ML pipeline in the Program.cs file.

 var pipeline = mlContext.Transforms.Text.FeaturizeText(
       outputColumnName: "BodyPartFeatures",
       inputColumnName: nameof(ProcessedExercise.BodyPartSynonyms))
   .Append(mlContext.Transforms.Categorical.OneHotEncoding(
       outputColumnName: "LevelFeatures",
       inputColumnName: nameof(Exercise.Level)))
   .Append(mlContext.Transforms.Categorical.OneHotEncoding(
       outputColumnName: "EquipmentFeatures",
       inputColumnName: nameof(Exercise.Equipment)))
   .Append(mlContext.Transforms.Concatenate(
       outputColumnName: "Features",
       "BodyPartFeatures",
       "LevelFeatures",
       "EquipmentFeatures"));`

FeaturizeText: Converts body part synonyms into numerical vectors.
OneHotEncoding: Converts categorical features like Level and Equipment into binary vectors.
Concatenate: Combines all features into a single vector for each exercise.

Step 4: Preprocessing and prediction engine

var preprocessedData = pipeline.Fit(dataView);
var predictionEngine = mlContext.Model.CreatePredictionEngine<ProcessedExercise, ExerciseVector>(preprocessedData);`

Fit: Trains the pipeline on the exercise data.
predictionEngine: Generates feature vectors for new user query inputs.

Step 5: Handling user queries

To make the system user-friendly, we’ll parse natural language queries and extract keywords.

// Get user input
Console.WriteLine("Enter your query:");
var query = Console.ReadLine();


var userQuery = helper.ParseInput(query); // "Recommend leg workouts for intermediates"

1.ParseInput: Extracts key components (e.g., BodyParts, Level, Equipment) from the user’s query using Regex.

public UserQuery ParseInput(string query)
   {
       var userQuery = new UserQuery();

       // Case-insensitive regex patterns
       const string bodyPartPattern = @"(?i)\b(chest|legs|abs|arms|core|glutes|back|traps|neck|shoulders)\b";
       const string levelPattern = @"(?i)\b(beginner|intermediate|expert|advanced)\b";
       const string equipmentPattern = @"(?i)\b(dumbbell|barbell|kettlebells|bodyweight|bands|cable|machine|body)\b";
       // Extract body parts
       userQuery.BodyParts = Regex.Matches(query, bodyPartPattern)
           .Select(m => m.Value.ToLower())
           .ToList();

       // Extract fitness level (default to "beginner" if unspecified)
       var levelMatch = Regex.Match(query, levelPattern);
       userQuery.Level = levelMatch.Success ? levelMatch.Value.ToLower() : "beginner";

       // Extract equipment (optional)
       var equipmentMatch = Regex.Match(query, equipmentPattern);
       userQuery.Equipment = equipmentMatch.Success ? equipmentMatch.Value.ToLower() : null;

       return userQuery;
   }

Expand Synonyms: Map user-friendly terms like "legs" to dataset-specific terms like "glutes".

public string ExpandQuery(List<string> userBodyParts)
   {
       var expandedTerms = new List<string>();
       foreach (var term in userBodyParts)
       {
           var normalizedTerm = term.Trim().ToLower();
           if (BodyPartSynonyms.ContainsKey(normalizedTerm))
           {
               expandedTerms.AddRange(BodyPartSynonyms[normalizedTerm]);
           }
           else
           {
               var matchingGroup = BodyPartSynonyms
                   .FirstOrDefault(kvp => kvp.Value.Contains(normalizedTerm));
               if (matchingGroup.Value != null)
               {
                   expandedTerms.AddRange(matchingGroup.Value);
               }
               else
               {
                   expandedTerms.Add(normalizedTerm);
               }
           }
       }
       return string.Join(",", expandedTerms.Distinct()); // Join into a single string
   }

Step 5: Generating recommendations

Finally, we’ll compare the user’s input to the dataset and recommend exercises.

var recommendations = exercises
   .Select(e => new
   {
       Exercise = e,
       Similarity = helper.ComputeSimilarity(userVector, predictionEngine.Predict(e).Features)
   })
   .OrderByDescending(x => x.Similarity)
   .Take(5);

ComputeSimilarity: Calculates cosine similarity between the user’s vector and each exercise’s vector.
OrderByDescending: Ranks exercises by similarity score.
Take(5): Returns the top 5 recommendations.

Full code on GitHubhere.

Output

In this article, we built a personalized workout recommendation engine by leveraging ML.NET., We created a system that understands user queries like "leg workouts for intermediates" and ranks exercises based on cosine similarity.

If you have questions please leave them in the comment section below. I would be happy to discuss further.

Also, please share if you found it helpful.

Follow me on Dev.to, Medium or Instagram for more AI, .NET, and fitness-related content.

Introduction to algorithms: Big O notation, time complexity

Daniel Elegberun — Tue, 27 Sep 2022 17:21:04 +0000

We live in a world of algorithms. From the posts we see on Twitter to the people we swipe on Tinder. Algorithms surround us, controlling machines, computers, and robots, but what exactly are they, how do we analyze them, and why do we need them to get a job?

This article discusses algorithms, how to represent them using asymptotic notations like Big O, and how to analyze their time complexities.

What is an algorithm?

An algorithm is a step-by-step procedure that details how to solve a problem. A more formal definition is "an unambiguous specification of how to solve a class of problems." Going by these definitions, the recipe for baking a cake, the method for solving long division problems, and the process of planning your route for a trip are all algorithms.

Algorithmic programming is all about writing a set of rules that instruct the computer on how to perform a task. Algorithms are written using a particular syntax, depending on the programming language being used.

Suppose we have this array of numbers in ascending order, and we want to search if this array contains the number 9.

Input array

There are several ways we can tackle this problem. The first that comes to mind is:

Approach 1: Linear Search

Start from the beginning and iterate through the entire array
Compare the value at the current index of the array to the target value.
Repeat the process until the target value is found
Return the target value

Even though this approach has solved our problem, we must consider if it is the most optimal solution. Let's consider another method.

Approach 2: Binary Search

a. Find the midpoint of the array and split it into two halves.

b. Compare the value at the midpoint of the array to the target value.

Input array halved

c. Since the target value is greater than our midpoint value, the value we are searching for must be on the right side of the array division. So, we can discard the left side of the array.

Array halved

d. We repeat the process of splitting the right side into two halves and comparing the value at the midpoint to our target value until the target value is found

Target found

e. Return the target value

In this algorithm, we reduce the search area of our array by half each time we iterate through it. Since we are searching only half of the array in each iteration, we can find the target element faster.

If we compare the two approaches, while it took nine comparisons to guess the target element in the first approach, the second only took 3. This approach to solving problems by dividing them into smaller subsets is called a divide and conquer approach. And the algorithm used in approach B is used in many search engines and databases today and is referred to as binary search.

We cannot use the same algorithm for every problem; knowing when to use a particular algorithmic approach is essential. For example, even though the binary search was more efficient for solving our example above, it can only work if the input array is sorted.

How do we solve this same problem if we are given an array of unsorted numbers and asked to find a target element? If we use a binary search, we will have to sort the array first before performing the binary search, which will be less efficient than the first linear search approach. Therefore, we must constantly analyze our algorithmic approach to ensure we are considering the optimal solution.

Why do we analyze algorithms?

We analyze algorithms to evaluate their performance and suitability for the problem we want to solve. When we design algorithms, we are not just interested in the algorithm's correctness but also its efficiency and scalability as the input size grows.

In the previous example, even though the linear search approach was slower, the speed difference was negligible because the input size was small. However, think about this problem from the perspective of companies like Facebook and Google, with billions of users. Imagine how long it will take to search for data in a dataset containing billions of users using the linear search approach vs. the binary search approach. We can see how the binary search approach is much more efficient in that case.

How to analyze algorithms

We analyze an algorithm's complexity in two ways, time and space.

Time complexity is crucial because we need our programs to run as quickly as possible to deliver the results. Space complexity is essential because machines have only a limited amount of memory to spare.

A good algorithm takes less time in execution and saves space during the process. Ideally, we want to optimize for both time and space, but sometimes that may not be possible, and we can settle for a middle ground. The method we use in analyzing algorithms must:

a. Be independent of the machine and its configuration on which the algorithm is running: This is important because we do not know the actual specifications of the computer that our algorithm will run on, so we cannot factor in hardware or CPU specs when analyzing algorithms.

b. Show a direct correlation between the algorithm and the input size: How does the algorithm perform as our input size approach infinity (an asymptote)?

When analyzing the complexity of any algorithm in terms of time and space, we can never provide an exact number to define the time required and the space needed by the algorithm. So instead, we express them using standard notations, known as Asymptotic Notations.

What are asymptotic notations?

Asymptotic notations are the mathematical notations used to describe the running time of an algorithm when the input tends towards a particular or infinite value.

The complexity of algorithms can be represented using three asymptotic notations: Big O, Big Theta (Θ), and Big Omega (Ω).

Big O notation

Big O notation represents the upper bound running time complexity of an algorithm and "can" be used to describe the worst-case scenario of an algorithm. A common misconception about Big O notation is that it represents the worst-case scenario of an algorithm. This is somewhat incorrect as it doesn't mean the worst case, but "can" be used to describe the worst case. We can also use it to represent the best and average cases.

Big O Graph

Look at it this way, Big O notation answers the question, "In the worst case, what's the longest time this algorithm will take to run?

It can also answer the question. "In the best case, what's the longest time our algorithm will take to run." And for the average case, "In the average case, what's the longest time our algorithm will take to run." We can see the common factor here is the longest time(upper bound) our algorithm will take to run. Best, worst, and average are scenarios, and we are upper bounding those scenarios using Big O.

Because we're always interested in the longest running time of the worst-case scenario, we mainly represent our algorithms using Big O.

Big Omega (Ω)

Big O notation represents the lower-bound running time complexity of an algorithm and "can" be used to describe the best-case scenario of an algorithm. Similar to Big O, Big Ω can be used to represent the best, average, and worst-case scenarios. Big Omega (Ω) answers the questions. "In the best/worst/average case, what's the fastest time this algorithm will take to run? We can use Big Ω to lower-bound our best, worst, and average running time scenarios.

Big Ω Graph

Big Theta (Θ)

Big Theta (Θ) represents the upper and the lower bound of the running time complexity of an algorithm. Like the first two, this "can" represent the best, average, and worst cases. For example, "In the average/best/worst case scenario, what's the average time this algorithm will take to run"

Big Θ Graph

As stated earlier, we mainly use Big O notation because we're always interested in measuring the longest time an algorithm runs in the worst-case scenario. For the rest of the series, we will measure our algorithms' complexity in Big O.

How to find the time complexity of an algorithm

We typically consult Big-O because we must always plan for the worst case. In our linear search example, the best case is that the number we are searching for is the first in the list, and the worst case is that the number we are searching for is at the end of the array. So, for an array of size 10, we iterate 10 times, and for size 1000, we iterate 1000 times. We can say that the running time of our linear search algorithm grows as the input increases. We represent this as O(n).

In the binary search approach, the best case complexity will be when the target element is at the center of the list, and the worst case will be when the target element is at either extremity of the list. However, since we half the array at every iteration, we are reducing the size of the problem by half. 𝑛,𝑛/2,𝑛/4,𝑛/8

We call this type of growth logarithmic (O(log N)). O(log n) means that as the input increases exponentially, operation time increases linearly. So if it takes binary search 1 second to compute 10 elements, it will take 2 seconds to compute 100 elements, 3 seconds to compute 1000 elements, and so on.

The common algorithmic runtimes from slowest to fastest are:

O(1) < O(log n) < O(n) < O(n log n) < O( n2) < O(n2 logn) < O(n3) < O(2n) < O(n!)

Plotting them on a graph of operation time vs. input size, we can see that the rate of growth of the running times of the best algorithms is much slower than the input size growth.

Big O comparison chart Hackr
Let's look at some Big O notation examples.

Big O Notation examples

O(1)

Print the first element in an array.

[apple, orange, cherry, mango, grapes]

apple

This would always be constant time irrespective of whether the input is 100 or 10,000

O(n)

Iterate through an array and print all items.

[apple, orange, cherry, mango, grapes] 

apple
orange
cherry
mango
grapes

The time complexity of this algorithm will always depend on the input size.

O(log n)

Using binary search to find a given element in an array. Like in our example above

O(n²)

Print all possible combinations of elements in an array.

[a,b,c,d,e] => [abcde, bacde, cabde, dabce, eabcd]

This time complexity of this algorithm grows exponentially as the input size grows. This is because, for each item in the array, you need to iterate through the rest of the array. The number of operations your program has to do is the number of inputs (or n) times itself (n squared).

General tips for asymptotic analysis:

When an entire array gets iterated over, it is most likely in O(n) time.
When you see an algorithm where the number of elements in the problem space gets halved after each iteration, it will probably be in O(logn) runtime.
Whenever you have a nested loop, the problem is most likely in quadratic time.

Improve your algorithmic thinking with practice

Algorithms are behind most of the impressive applications we use every day. Learning common algorithms are helpful, but what's even better is improving our algorithmic thinking. Algorithmic thinking allows us to break down problems, analyze them and implement solutions. Like all skills, algorithmic thinking is learnable, and with enough practice, we can train our brains to become better at it.

Helpful links

Introduction to data structures

Daniel Elegberun — Tue, 20 Sep 2022 16:38:16 +0000

Cover Photo by Alvaro Pinot

Data by itself is not very valuable; what gives data value is when it is organized in a way that can help us solve problems. We see this every day in our lives. For example, the words in the dictionary are organized alphabetically to help us search faster. Likewise, products on shopping sites are arranged by price and category to navigate products quickly. Similarly, computers need to organize data to analyze and process it efficiently. The format used to manage data in a computer's memory is called a data structure.

What is a data structure?

A data structure is a format for storing and organizing data in a way that can be accessed and modified efficiently in a computer's memory. Knowing how data is structured in the computer's memory helps design and develop efficient software systems that scale effectively. It also helps in creating algorithms that help us solve complex problems. For instance, GPS systems and Google Maps use a Graph data structure to find the shortest path between two distances.

Classification of data structures

There are two general types of data structures: Linear and non-linear.

Linear data structures: These are data structures where all elements are arranged linearly/sequentially. Every element in the structure is attached to its previous and next parts. Examples of linear data structures are arrays,linked lists, stacks and queues.

Linear data structure

Non-linear data structure: These are data structures where the elements are not arranged in a linear format. Mainly, data elements are arranged in hierarchal order without forming a linear system. Examples of non-linear data structures are trees, heaps, tries, graphs.

Non-linear data structure

When we study data structures, we usually define them in 2 ways.

Mathematical/logical models: Here, we look at the data structures from a high level. We only define the logic or the behavior of the data types but not any implementations. This representation is referred to as Abstract Data Types(ADT). ADTs do not specify how the data structure must is implemented or laid out in memory but provides a minimal expected interface and set of behaviors. For example, we define a car as an ADT by defining the properties a car should have.

Four wheels
Steering
Brake
Seats

Similarly, a List ADT contains methods to

Get elements
Insert elements
Modify elements

Concrete models: This is a direct implementation of an ADT. In the example above, where we defined a Car ADT. The concrete implementation could be a Tesla, Audi, or Ford. The concrete implementation of a List ADT can be represented as an Array or a Linked List because these contain properties that satisfy the condition of the List ADT.

Unless you have some specific requirements, you generally don't create these concrete implementations of basic data structures yourself. Instead, you'd use one provided by the programming language's standard library. These data structures tend to be well-tested and complete, so using them saves you time compared to rolling your own. However, knowing how they operate under the hood helps you know when to use what.

Let's now take a look at a few of the more common data structures.

Overview of common data structures

Arrays

An array is a linear data structure that holds an ordered collection of elements stored at contiguous memory locations (next to each other). Example use cases of arrays are to store collections of elements of the same type, like a collection of integers or the letters of an alphabet.

Array data structure (Source: GeeksforGeeks)

Applications of Arrays

Useful when storing elements of the same data type. A collection of countries, alphabets, products, etc.
Serves as a fundamental building block for implementing other data structures: Stacks, Queues

Advantages of arrays
They provide fast access to any element in the array. Because elements are stored next to each other in memory, it's faster for the computer to access each element randomly.

Disadvantages of arrays
Arrays are declared with a fixed size and cannot grow dynamically.

Linked List

Similar to arrays, Linked Lists are linear data structures that hold a collection of elements. However, unlike arrays, they do not occupy contiguous blocks of memory. Instead, each element (called a node) in a Linked List consists of value/data and a pointer/link to the address of the next node in the linked List.

Linked List data structure (Source: Javatpoint)

Applications of Linked Lists

Image viewer software uses a linked list to view the previous and the next images
Web pages can be accessed using the previous and the next button in a browser
We can use linked Lists to implement other data structures: Stacks, Queues, Graphs, and Trees

Advantages of Linked Lists
Unlike Arrays, Linked Lists can grow dynamically because the elements are not stored in contiguous memory blocks.

Disadvantages of Linked Lists
Linked List uses extra memory to store the links to the next node. Additionally, because the nodes are not stored next to each other(contiguously), it is not as easy to access elements randomly, like arrays.

Stacks

Stacks are linear data structures used for storing a collection of elements with the constraint that the last element must be the first out(LIFO). Think of an analogy of placing plates above each other. Stacks can be implemented using arrays or Linked List as long as it meets the condition that the last element in is the first out.

Stack data structure (Source: Tutorialspoint)

Applications of Stacks
Suitable for applications where the most recently added element appears first (LIFO). Website history, call history, Undo/Redo button/operation in word processors, recursion.

Advantages of Stacks
Useful in managing data problems that meet the requirements of the LIFO format.

Disadvantages of Stacks
Stacks implemented using arrays have a fixed size. An attempt to add an element to a full stack results in the famous "stack overflow" error.

Queues

Queues are linear data structures used for storing a collection of elements with the constraint that the first element added to the queue must be the first one out(FIFO). Just like a queue in the real world. In software systems, queues are very effective for managing systems that involve scheduling. For example, when you want to execute a series of tasks sequentially. Like stacks, queues can also be implemented using arrays and LinkedList.

Queue data structure (Source: GeeksforGeeks)

Applications of Queues
Suitable for applications following the FIFO principle. For example, job scheduling, escalators, networked printers, internet requests, and processes.

Advantages of Queues
Useful for services that follow the FIFO principle.

Disadvantages of Queues
Like stacks, queues implemented using arrays have a fixed size.

Trees

A tree is a non-linear data structure that stores a collection of elements called nodes linked together to simulate a hierarchy. Trees usually have a root or parent node referencing one or more child nodes. Mainly, tree data structures represent hierarchical data. An example is the folder and file system on your computer. A folder (as a parent node) can contain files and sub-folders, which are its children.

Trees data structure (Source: GeeksforGeeks)

The most common type of tree is the one with the constraint that any node can contain at most two child nodes. This type of tree is called a binary tree.

Applications of Trees

Useful in hierarchical parent/child data representations. For example, folders and subfolders systems in computers and genealogical information in biological species.
Databases use B-Tree data structures for indexing.

Advantages of Trees
Efficient way of storing data that is naturally hierarchal.

Disadvantages of Trees
Uses extra memory to store nodes and address to child nodes

Graphs

A graph is a non-linear data structure consisting of a collection of elements called vertices, connected through links called edges. Going by this definition, a tree is a special type of graph. However, unlike trees, graphs have no rules in how the nodes are connected.

Graphs data structure (Source: GeeksforGeeks)

Applications of Graphs
Systems that use a relationship (graph-like) structure: Social media applications like Facebook and LinkedIn to show user relationships, Maps and GPS systems, telecom and flight networks
Advantages of Graphs

Useful in representing hierarchical data and solving algorithms like finding the shortest path between several points.

Disadvantages of Graphs

Graphs can be complex to handle due to the different nodes and pointers
Graphs use a lot of memory allocation because you need memory to store the addresses of the nodes and their pointers to the nodes connected to them.

Heaps

Heap is a special tree-based data structure that satisfies the heap property. Heap has two unique properties:

Max-Heap: The root node must have a higher value than all its children (sub-trees)
Min-Heap: The root node must have a lower value than all its children (sub-trees)

Heaps data structure - Source: GeeksforGeeks

Applications of Heaps
Useful in scenarios where fast access to the highest or lowest element is needed. For example, in operating systems, to assign resources to specific tasks and Priority queues.

Advantages of Heaps

Efficient for scenarios where you need quick access to the largest (or smallest) element.
Heaps typically use an array-based data structure; as such, they do not require extra memory for pointers.

Disadvantages of Heaps
Searching for elements in a heap requires traversing the entire heap.

Tries

A Trie is a special kind of tree-based data structure that simply stores a set of strings. It is known by many names, including prefix tree, digital search tree, and retrieval tree. Every node (except the root/parent node) in the trie stores a letter of an alphabet from the string, and strings or words can be retrieved by traversing the trie. For example, if we have a set of strings {cat, bat, ball, rat, cap, be}, our trie will look like this.

Tries data structure (Source:Btechsmartclass)

Any traversal from the root to the end of any nodes will represent one of the words in our set of words.

Applications of Tries
Tries are very beneficial in solving problems related to strings, most especially searching in strings. A good example is the autocomplete and spell check feature in applications.

Advantages of Tries
Very efficient for searching elements

Disadvantages of Tries
Tries require a lot of memory for storing each of the strings.

Hashtable

Hashtable is a data structure that stores elements in an associative or dictionary-like manner(key/value pairs). A hash table always uses some function, known as a hash function, acting on the key to compute an index location where the computer will store the value. To look a value up given a key, you hash the key and get back the location of the corresponding value.

Hash Table data structure (Source: Wikipedia)

Applications of Hash Tables
Useful in scenarios where associative data is required. Data stored in databases is generally of the key-value format, which is done through hash tables.

Advantages of Hashtable
They are efficient for fast look-ups and in cases where you have associative data.

Disadvantages of Hashtable
The ordering of elements in Hash tables is not guaranteed.

Conclusion
Data structures are essential to many computer algorithms because they allow programmers to manage data efficiently. The right data structure can significantly increase the performance of a computer program or algorithm. Ultimately, the data structures you choose will depend on how much data you have, what that data looks like, and what operations you want to perform.

In the rest of the series, I will be going into depth about these data structures'concrete implementations, use cases, and the different operations we can perform on them.

Unboxing a Database-How Databases Work Internally

Daniel Elegberun — Fri, 30 Jul 2021 15:23:53 +0000

Cover Photo by Javier Miranda on Unsplash

Databases are one of those abstract, mysterious things that "just work" when you run an insert statement, where's the data stored?. How is it stored? Why are queries so fast? What's underneath the black box of a database? Sometimes it all just feels like magic.

It's 1 am in Lagos and I can't sleep. I pick up my phone and head to Google to help me demystify this black box. The next words you read are my attempt to unbox a database.

My focus on this article will be on SQL databases but I believe the underlying concepts can be passed to other types of databases. Before we go on let us define some terms.

Database

A database is a set of physical files(data) on a hard disk stored and accessed electronically from a computer system. Usually created by the CREATE DATABASE statement.

Database management system

A database management system is software that handles the storage, retrieval, and updating of data in a computer system.

Popular database management systems

Database engine

A database engine is the underlying software component that a database management system uses to create, read, update and delete data from a database.

What is the difference between a database management system and a database engine?

The database management system is the software with its functions that allow us to connect to a database engine. The database engines are the internal tools that allow or facilitate a certain number of operations on the tables and their data.

How does a database management system store data?

Most database management systems store data in files. MySQL for example stores data in files in a specific directory that has the system variable "datadir". Opening a MySQL console and running the following command will tell you exactly where the folder is located.

mysql>  SHOW VARIABLES LIKE 'datadir';
+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| datadir       | /var/lib/mysql/ |
+---------------+-----------------+
1 row in set (0.01 sec)

This stack overflow answer explains it really well.

As you can see from the above command, my "datadir" was located in /var/lib/mysql/. The location of the "datadir" may vary in different systems. The directory contains folders and some configuration files. Each folder represents a MySQL database and contains files with data for that specific database, below is a screenshot of the "datadir" directory in my system.

a data dir folder in a system

Each folder in the directory represents a MySQL database. Each database folder contains files that represent the tables in that database. There are two files for each table, one with a .frm extension and the other with a .idb extension. See the screenshot below.

Files in a database folder

The .frm table file stores the table's format. Details: MySQL .frm File Format
The .ibd file stores the table's data. Details: InnoDB File-Per-Table Tablespaces

When we insert a record into a table we are actually inserting into a datafile. A page (representing the rows of the table)is created in that datafile. By default, all datafiles have a page size of 16KB, you can reduce or increase the page size depending on the database engine you are using.

As more and more records are inserted into the table(datafile) several data pages are created.

How Pages Relate to Table Rows

The maximum row length is slightly less than half a database page. For example, the maximum row length is slightly less than 8KB for the default 16KB InnoDB page size. For 64KB pages, the maximum row length is slightly less than 16KB.

If a row does not exceed the maximum row length, all of its data is stored locally within the page. If a row exceeds the maximum row length the database engine stores a 20-byte pointer to the next page locally in the row, and stores the remaining rows externally in overflow pages.

These two articles do a wonderful job of describing how data pages look in sql server.

Let us assume we have a table(tblEmployees) and we insert a single record into it.

INSERT INTO tblEmployees VALUES (1,'Abhishek')

This is a sample data page of that insertion into the datafile. It is divided into 3 main sections

Section 1:Page Header

Page Header

m_type =1 indicates that it is a data page.
m_nextpage: This is the link to the memory location of the next data page that will be created, in this case, we have a single data page so it is(0:0).
m_Prevpage: This is the link to the memory location of the previous data page. Since we have a single data page the value is(0:0).

Section 2:Actual Data

The actual data that we insert into our table is stored in this section. If you remember, we inserted 1 record with an employee named "Abhishek". That record will be saved here, in this section as shown below.

Actual Data

Record Type = PRIMARY_RECORD, which means it's our actual data.
Memory Dump = This points to the Actual data's location in memory.

Section 3:Offset Table

Offset Table: This section of the data file tells you where the record Abhishek is saved exactly in memory.

Offset Table

If you see the row offset, it's pointing to the actual data's location.

These diagrams show how rows are stored in a datafile.

How does indexing work?

A database index is a data structure that improves the speed of data retrieval operations on a database table.

Indexing is the way to get an unordered table into an order that will maximize the query efficiency. A Clustered Index is a special type of index that reorders the way records in the table are physically stored on the disk. So how does it work?

In reality, the database table does not reorder itself every time the query conditions change to optimize the query performance, what happens is that when you create an index you cause the database to create a data structure which in most cases is likely to be a B+Tree. The main advantage of this data structure is that it is sortable and this makes our search more efficient.

A B+Tree is a type of dictionary, no more and no less. If you think about a linguistic dictionary, it's ordered by "words", and associated with each word is a definition. You look up a word and get a definition.

An Indexed Dictionary

So the context of a map data structure is that you have keys ("words") and you want to map this to values ("definitions").

B+trees have an advantage for certain types of queries. For example, you can do range queries, say if you want to find all entries where the key is between two values (e.g. all words in the dictionary starting with "q").

B+trees are page-structured (meaning they can be implemented on top of fixed-size disk pages; which minimizes the number of disk accesses needed to perform a query.

Example

Let us assume we have a table called Employee_Detail. We can create a clustered index with the following command on the Emp_Iid column.

Create Clustered Index My_ClusteredIndex  
on Employee_Detail(Emp_Iid)

Now let's insert some records

Head over to this site and insert records from 1 to 6 simulating how records will be inserted in a database. You will see how the tree automatically adjusts as records are being inserted.

Another thing to note the data value locations never change but the (pointers to those values are the ones that are constantly shifting).

The B+Tree will be formed like this. - The center point of the records which in our case is 3 will be the head node. All the Ids that are lower than 3 will be moved to the left and the Ids greater than 3 to the right as shown in this diagram.

BTree Visualized

The left side value of each node is always less than the node itself and the right-side value is always greater than the node. The last set of values are called leaf nodes and they contain the actual data value while the intermediate rows hold pointers to the actual data value location.

Think of it like a dictionary that contains a name tag. All the words with "c" are labeled under the "c" tag. words higher than "c" are shifted to the right and words lower than "c" to the left. The tag "c" does not contain the value but a (pointer) to the actual words.

From the earlier explanation on how SQL stores data in data pages we can infer that the leaf nodes represent data pages containing the table rows.

If we want to get the employees where Emp_Iid is 4.

select * from employee_Detail where Emp_Iid=4

In a normal case, the system will perform 4 comparisons, the first for 1, the second for 2, and the third for 3 and in the fourth comparison, it will find the desired result.

Using an index, the system only does a single comparison because 3 is the head node of the B+Tree and it knows that 4 is greater than 3 so the record will be on the right. Once it checks the next key It will find a pointer to the data value 4 which is the value that is being requested.

From this example, we can say that by using an index we can increase the speed of data retrieval.

Components of a Database Engine

Components of a database engine

All SQL database engines have a compiler to translate the SQL statement into byte code and a virtual machine to evaluate the byte code.

The RDBMS processes the SQL statement by:

1.Parsing: Validates the statement by checking the SQL statement against the system’s catalog and seeing if these databases, tables, and columns that the user wants exist, and if the user has privileges to execute the SQL query.
Under the parsing stage, there is a syntax check, semantic check, and shared pool check.

Syntax check

A statement that breaks a rule for well-formed SQL syntax fails the check. For example, the following statement fails because the keyword FROM is misspelled as FORM:

SQL> SELECT * FORM employees;
SELECT * FORM employees
         *
ERROR at line 1:
ORA-00923: FROM keyword not found where expected

Semantic Check

The semantics of a statement is its meaning. A semantic check determines whether a statement is meaningful, for example, whether the objects and columns in the statement exist.

SQL> SELECT * FROM nonexistent_table;
SELECT * FROM nonexistent_table
              *
ERROR at line 1:
ORA-00942: table or view does not exist

Shared Pool Check

During the parse, the database performs a shared pool check to determine whether it can skip resource-intensive steps of statement processing.

To this end, the database uses a hashing algorithm to generate a hash value for every SQL statement.

2.Compiling (Binding): Generates a query plan for the statement which is the binary representation of the steps required to carry out the statement. In almost all SQL engines, it will be byte code. What has now been compiled is a command-line shell — a program that reads SQL statements and now sends them to the database server for optimization and execution.

3.Optimizing: Optimizes the query plan and chooses the best algorithms such as for searching and sorting. This feature is called the Query Optimizer. The Query Optimizer devises several possible ways to execute the query i.e. several possible execution plans. An execution plan is, in essence, a set of physical operations (an index seek, a nested loop join, and so on) to be performed.
Once this is done, we now have a prepared SQL statement.

Example

This example shows the execution plan of a SELECT statement when AUTOTRACE is enabled. The statement selects the last name, job title, and department name for all employees whose last names begin with the letter A.

SELECT e.last_name, j.job_title, d.department_name 
FROM   hr.employees e, hr.departments d, hr.jobs j
WHERE  e.department_id = d.department_id
AND    e.job_id = j.job_id
AND    e.last_name LIKE 'A%';

Execution Plan
----------------------------------------------------------
Plan hash value: 975837011

--------------------------------------------------------------------------------
| Id| Operation                     | Name        |Rows|Bytes|Cost(%CPU)|Time  |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT              |             |  3 | 189 | 7(15)| 00:00:01 |
|*1 |  HASH JOIN                    |             |  3 | 189 | 7(15)| 00:00:01 |
|*2 |   HASH JOIN                   |             |  3 | 141 | 5(20)| 00:00:01 |
| 3 |    TABLE ACCESS BY INDEX ROWID| EMPLOYEES   |  3 |  60 | 2 (0)| 00:00:01 |
|*4 |     INDEX RANGE SCAN          | EMP_NAME_IX |  3 |     | 1 (0)| 00:00:01 |
| 5 |    TABLE ACCESS FULL          | JOBS        | 19 | 513 | 2 (0)| 00:00:01 |
| 6 |   TABLE ACCESS FULL           | DEPARTMENTS | 27 | 432 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
   2 - access("E"."JOB_ID"="J"."JOB_ID")
   4 - access("E"."LAST_NAME" LIKE 'A%')
       filter("E"."LAST_NAME" LIKE 'A%')

4.Executing: The RDBMS executes the SQL statement by running the query plan.

For an in-depth view, check out this tutorial

Summary

This article has covered a lot of ground, but by now you should have an understanding (or at least an appreciation) of the components and processes that form the databases we use every day.

Thank you for reading.

Follow me here and across my social media for more content like this Linkedin. Twitter

REFERENCES AND MORE

1.How a sql database engine works by Andres reyes

2.How a sql database engine works by Dennis Pham

3.The sql server query optimizer by Benjamin Nevarez

4.An in-depth look at Database Indexing by Kousik Nath

5.Database btree indexing in sqlite by Dhanushka Madushan

6.Inside the storage engine by Paul Randal

7.B-tree and how it is used in practice answered by Pseudonym

8.Index in sql by Pankaj Kumar Choudhary

9.How sql server stores data in data pages part 1 by Abhishek Yadav

10.How sql server stores data in data pages part 2 by Abhishek Yadav

11.SQL Processing by Oracle

12.How does a sql query work by

13.How sql database engine works by Vijay Singh Khatri

How Search Engines Work: Finding a Needle in a Haystack

Daniel Elegberun — Fri, 16 Jul 2021 15:10:51 +0000

Every time I go to bed I consider myself lucky to have been born at a time like this. In one-tenth of a second I can find out information about anything. It's incredible!.

As of June 18, 2021, there are currently over 1.86 billion websites online. A report by sitefy estimates 547,200 new websites are created every day. It's estimated that Google processes approximately 63,000 search queries every second, translating to 5.6 billion searches per day and approximately 2 trillion global searches per year. These are outrageous numbers by any standard, Yet to the end-user, it's as easy as a click of a button - The true beauty of distributed systems.

If you have ever wondered how search results get delivered to you in a split second, then this article is for you. I will be explaining how search engines work and the different components that work together to deliver results to you - FAST.

How do search engines work?

On a basic level, search engines do three main things:

Organize the entire web on their database
Instantly match your search
Present the results in the way you want them.

1. Organize the web on their database

Contrary to what you might first think. Search Engines do not actually search the internet when you type in a query. That will take yearsss. A lot of things go on behind the scenes to make sense of the internet before you type in your search. So when you hit that little microsope. It only searches its own organized version of the internet which is a lot faster.

Search Engines organize the web through three fundamental actions: crawling, indexing, and ranking.

Search Engines Actions

CRAWLING A search engine navigates the web by downloading web pages and following anchor links on these pages to discover new pages that have been made available.

These search engine bots popularly called(Spiders)start from a seed or a list of known URLs, and crawl the web pages at these URLs first. As they crawl those web pages, they find hyperlinks to other pages, and add those to the list of pages to crawl next. They keep going until they have finally crawled the entire web.

web crawler: Image from code.org

Interestingly enough only about 70% of the internet has been crawled. This is because there isn't a central registry of all web pages, so Google must constantly search for new pages and add them to its list of known pages. Other pages are discovered when a website owner submits a list of pages (a sitemap) for Google to crawl.

How does the web crawler know what to crawl?

A web crawler will follow certain policies that make it more selective about which pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content updates. Some of these policies are :

Web page importance - The crawlers use the number of active visitors to that site and the number of other pages that link to that page. If people are visiting the site and referencing its content then it must provide reliable information.
By using the Robots.txt file- A robots.txt file is a set of instructions for bots. This file is included in the source files of most websites. It guides the web bots on what to crawl.

Example:
Check out this site robots.txt for Netflix's own Robot.txt file.

Some examples of popular bots are:

Google: Googlebot (actually two crawlers, Googlebot Desktop and Googlebot Mobile, for desktop and mobile searches)
Bing: Bingbot
Yandex (Russian search engine): Yandex Bot
Baidu (Chinese search engine): Baidu Spider.

INDEXING

Imagine having to look for a word in a dictionary that is not arranged or a dictionary that does not have a name tag on each alphabet. Now compare it to a dictionary that has all these. Searching for a word in dictionary 2 will be considerably faster than dictionary 1.

An Indexed Dictionary

Indexing is done to arrange web pages in a way that information can be easily retrieved from it.

After the pages have been crawled the page URL, along with the full Html document of the page are sent to the store server to compress and store in the database. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.

An indexer function reads the pages from the database, uncompresses the documents, and parses them to form an inverted index.

An inverted index is a database index which stores a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents.

A sample Inverted Index

A sentence is broken into words stripped of the stopped words and stored in a row alongside the frequency of occurrence of the word and its position in the HTML document.(stopped words are words that will occur regularly in any sentence such as "the", "is", "at", "which").

If you have a query for example:How long does it take to travel to mars, multiple query words like ("travel" and "mars") are searched independently in the Inverted Index. If travel occurs in documents (2,4,6,8) and mars occurs in documents (3,7,8,11,19). You can take the intersection of both lists where the words occur frequently,(8) in this case, which is the document in which both query words occur.

Storing the documents this way makes search considerably faster. Once a search query comes in. The search engine just needs to get the most frequently occurring words, intersects them and parse them through its own custom search ranking algorithm to deliver the results.

Inverted Index processing

Search the Inverted Index

RANKING The algorithms used to rank the most relevant results differ for each search engine. For example, a page that ranks highly for a search query in Google may not rank highly for the same query in Bing. The ranking is accomplished by assessing a number of different factors based on an end user’s query for quality and relevancy.

The exact algorithms of these search engines have been kept a secret by these companies but a guess can be made based on certain factors.
In addition to the search query, search engines use other relevant data to return results, including:

Page Quality- Several factors contribute to the page quality such as: how many links references that page, how many page visits the site receives and how long users spend on the page. All these factors determine if the page is quality and if search engines should rank it highly.
Location – Some search queries are location-dependent e.g. ‘restaurants near me’ or ‘movie times’.
Language detected – Search engines will return results in the language of the user if they can be detected.
Previous search history – Search engines will return different results for a query dependent on what the user has previously searched for.
Device – A different set of results may be returned based on the device from which the query was made.
The freshness of data- How relevant is the data the user needs.

What database do search engines use to store data?

Google uses Bigtable which is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.BigTable

Bigtable is not a relational database. Tables consist of rows and columns, and each cell has a timestamp. There can be multiple versions of a cell with different time stamps. The timestamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."What Database does Google Use?.

Here is a statement from Google's Big table research paper.

A slice of Big Table Row

A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

To manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

2. How are search engines so fast? (Instantly matching your search)

If you have N computers and want to search a large document, you can split the document into N ranges and let each computer search that range. So, with string parallelization, you can speed up search queries faster. Your query is split into individual words and sent to a bunch of computers to handle different parts of the query, If you throw in an already inverted indexed database like the one we discussed earlier then the speed could go up faster. If we now store this indexed information on the RAM of these database servers we can take speed to another level. (Information in the RAM is 30 times faster than the real-world performance of an SSD)

Another interesting thing is that search engines do not have to provide you the most absolute correct information all the time. They can cache results and just show the ones that are already high in their ranking for similar queries or show cached information that they have shown you before.

Search Engines also predict words for you. This is done to speed up the search process and reduce the load on the server by preparing already made search results for you so you get things faster. Immediately you start typing Google has already started to look for the first N results so that once you hit enter, the result is already there.

In summary, Search engines return relevant results in response to user queries quickly by using their index and taking advantage of parallelism to break up the large search problem into many small pieces that are solved in parallel.

3. Presenting the results in the way you would want them.

Now let us walk through the process from end to end. When I type into my search bar.

Sample search Query

My browser first completes a DNS lookup mapping www.google.com to a specific IP address. At this stage, Google’s DNS load balancer determines which cluster of computers at which of Google’s data centers will process the query. Once a data center has been determined, the query is transmitted via “HTTP” to a specific data center and individual clusters of servers.

Upon arrival at the data center cluster, each query is greeted by Google’s second load balancer. The Google hardware load balancer consists of 10 to 15 machines and determines which machines are available to process the query.

The query is then split into words and executed, simultaneously hitting 300 to 400 back-end machines representing Google’s verticals, advertising, and spell check among others. At this point, the best results are gathered and the query data returns to the Google Mixer.

The mixer takes this data, blends Universal elements with ads while pasting results in order, based on relevancy criteria set by Google's Ranking Algorithm. The ordered results then go back to the Google web server for HTML coding. Once the HTML is completed and pages are formatted, the search engine results are marked “done” by the load balancer and returned to the user as search engine results pages (SERPs). The entire process taking, about 3 centiseconds".

Search Query Result

Follow me here and across my social media for more content like this Linkedin. Twitter

SUMMARY

Search engines have changed how we access information and understanding their anatomy helps us to truly appreciate the beauty behind the madness.

REFERENCES

6.How do search Engines Work

7.Behind the scenes of a Google Query

8.Stack Overflow- How does google perform search very fast?

9.How Google searches one document among Billions of documents quickly?

10.Search: The Inverted Index

11.The Anatomy of a Large-Scale Hypertextual
Web Search Engine

12.Bigtable: A Distributed Storage System for Structured Data

13.The Internet: How Search Works

Netflix System Design- Backend Architecture

Daniel Elegberun — Thu, 24 Jun 2021 08:09:38 +0000

Cover Photo by Alexander Shatov on Unsplash

Netflix accounts for about 15% of the world's internet bandwidth traffic, serving over 6 billion hours of content per month to nearly every country in the world. Building a robust, highly scalable, reliable, and efficient backend system is no small engineering feat, but the ambitious team at Netflix has proven that problems exist to be solved.

This article analyzes the Netflix system architecture as researched from online sources. Section 1 provides a simplified overview of the Netflix system. Section 2 provides an overview of the backend architecture, and section 3 provides a detailed look at the individual system components. For a complete guide on modern system design, you can try grokking the modern system design for engineers course.

1. Overview

Netflix operates in two clouds Amazon Web Services and Open Connect(Netflix content delivery network).

The overall Netflix system consists of three main parts.

Open Connect Appliances(OCA) - Open Connect is Netflix’s custom global content delivery network(CDN). These OCA servers are placed inside internet service providers (ISPs) and internet exchange locations (IXPs) networks around the world to deliver Netflix content to users.
Client — A client is any device from which you play Netflix videos. This consists of all the applications that interface with the Netflix servers.

Netflix supports many different devices, including smart TVs, Android and iOS platforms, gaming consoles, etc. All these apps are written using platform-specific code. The Netflix web app is written using reactJS, which was influenced by several factors, some of which include startup speed, runtime performance, and modularity.

Backend - This includes databases, servers, logging frameworks, application monitoring, recommendation engine, background services, etc... When the user loads the Netflix app, all requests are handled by the backend server in AWS Login, recommendations, the home page, users history, billing, customer support. Some of these backend services include (AWS EC2 instances, AWS S3, AWS DynamoDB, Cassandra, Hadoop, Kafka, etc).

2. Backend Architecture

Netflix is one of the major drivers of microservices architecture. Every component of their system is a collection of loosely coupled services that collaborate. The microservice architecture enables the rapid, frequent, and reliable delivery of large, complex applications. The figure below is an overview of the backend architecture.

Backend Architecture

The Client sends a Play request to a Backend running on AWS. Netflix routes traffic to its services using Amazon's Elastic Load Balancer (ELB) service.
AWS ELB will forward that request to the API Gateway Service. Netflix uses Zuul as its API gateway, which is built to allow dynamic routing, traffic monitoring, security, and resilience to failures at the edge of cloud deployment.
The application API component is the core business logic behind Netflix's operations. Several types of API correspond to different user activities, such as the Signup API and the Discovery/Recommendation API for retrieving video recommendations. In this scenario, the forwarded request from the API Gateway Service is handled by the Play API.
Play API will call a microservice or a sequence of microservices to fulfill the request.
Microservices are mostly stateless small programs. Thousands of these services can communicate with each other.
Microservices can save or get data from a data store during this process.
Microservices can send events to track user activities or other data to the Stream Processing Pipeline for either real-time processing of personalized recommendations or batch processing of business intelligence tasks.
The Stream Processing Pipeline data can be persistented to other data stores such as AWS S3, Hadoop HDFS, Cassandra, etc.

3. Backend Components

Open Connect

Open Connect handles everything that happens after you hit play on a video. This system is responsible for streaming video to your device. The following diagram illustrates how the playback process works.

Open Connect Design Image

OCAs ping AWS instances to report their health, the routes they have learned, and the files they have on them.
A user on a client device requests playback of a title (TV show or movie) from the Netflix application in AWS.
The Netflix playback service checks for the user's authorization, permission, and licensing, then chooses which files to serve the client taking into account the current network speed and client resolution.
The steering service picks the OCA from which the files should be served, generates URLs for these OCAs, and hands them back to the playback service.
The playback service hands over the URLs of the OCA to the client, and the client requests the video files from that OCA.

Zuul2-API GATEWAY

Netflix uses Amazon's Elastic Load Balancer (ELB) service to route traffic to services. ELBs are set up such that the load is balanced across zones first, then instances.

Amazon Elastic Load Balancer

This load balancer routes requests to the API gateway service; Netflix uses Zuul as its API gateway; it handles all the requests and performs the dynamic routing of microservice applications. It works as a front door for all the requests.

For Example, /api/products is mapped to the product service, and /api/user is mapped to the user service. The Zuul Server dynamically routes the requests to the respective backend applications. Zuul provides a range of different types of filters that allow them to quickly and nimbly apply functionality to the edge service.

The Cloud Gateway team at Netflix runs and operates more than 80 clusters of Zuul 2, sending traffic to about 100 (and growing) backend service clusters which amount to more than 1 million requests per second.

open-sourcing-zuul-2

Zuul Architecture

The Netty handlers on the front and back of the filters are mainly responsible for handling the network protocol, web server, connection management, and proxying work. With those inner workings abstracted away, the filters do all the heavy lifting.

The inbound filters run before proxying the request and can be used for authentication, routing, or decorating the request.
The endpoint filters can either return a static response or proxy the request to the backend service. The outbound filters run after a response has been returned and can be used to add or remove custom headers or metrics.

The Zuul 2 Api gateway forwards the request to the appropriate Application API.

Application API

Currently, the Application APIs are defined under three categories: Signup API -for non-member requests such as sign-up, billing, free trial, etc., Discovery API-for search, recommendation requests, and Play API- for streaming, view licensing requests, etc. When a user clicks signup, for example, Zuul will route the request to the Signup API.

If you consider an example of an already subscribed user. Supposing the user clicks on play for the latest episode of Peaky Blinders, the request will be routed to the playback API. The API, in turn, calls several microservices under the hood. Some of these calls can be made in parallel because they don’t depend on each other. Others have to be sequenced in a specific order. The API contains all the logic to sequence and parallelize the calls as necessary. The device, in turn, doesn’t need to know anything about the orchestration that goes on under the hood when the customer clicks “play.”

Netflix API Architecture

Signup requests map to signup backend services, Playback requests, with some exceptions, map only to playback backend services, and similarly, discovery APIs map to discovery services.

Hystrix- Distributed API Services Management

Hystrix

Hystrix Architecture

In any distributed environment (with a lot of dependencies), inevitably, some of the many service dependencies will fail. It can be unmanageable to monitor the health and state of all the services as more and more services will be stood up, and some services may be taken down or simply broken down. Hystrix comes with help by providing a user-friendly dashboard. Hystrix library is used to control the interaction between these distributed services by adding some latency tolerance and fault tolerance logic.

Consider this example from Netflix: They have a microservice that provides a tailored list of movies back to the user. If the service fails, they reroute the traffic to circumvent the failure to another vanilla microservice that simply returns the top 10 family-friendly movies. So they have this safe failover that they can go to, and that is the classic example of the first circuit breaking.

Note:

Netflix Hystrix is no longer in active development and is currently in maintenance mode. Some internal projects are currently being built with resilience4j

https://github.com/resilience4j/resilience4j

Titus- Container Management

Titus
Titus is a container management platform that provides scalable and reliable container execution and cloud-native integration with Amazon AWS.

Titus Architecture

It is a framework on top of Apache Mesos, a cluster management system that brokers available resources across a fleet of machines.
Titus is run in production at Netflix, managing thousands of AWS EC2 instances and launching hundreds of thousands of containers daily for both batch and service workloads. Just think of it as the Netflix version of Kubernetes.

Titus runs about 3 million containers per week.

Datastores

EVCache

A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer. Trading off capacity for speed, a cache typically stores a subset of data transiently.
EVCache

Two use cases for caching is to:

Provides fast access to frequently stored data.
Provides fast access to computed(memoized) data. Netflix's microservices rely on caches for fast, reliable access to multiple types of data like a member’s viewing history, ratings, and personalized recommendations. EVCache Diagram

EVCache is a Memcached and spymemcached-based caching solution mainly used for caching frequently used data on AWS EC2 infrastructure.
EVCache is an abbreviation for:

Ephemeral - The data stored is for a short duration as specified by its TTL (Time To Live).
Volatile - The data can disappear at any time (Evicted).
Cache - An in-memory key-value store.

SSDs for Caching

Traditionally, caching is done on RAM. Storing large amounts of data on RAM is expensive, so Netflix decided to move some caching data to SSD.

Modern disk technologies based on SSD are providing fast access to data but at a much lower cost when compared to RAM. The cost to store 1 TB of data on SSD is much lower than storing the same amount using RAM.

Evolution of application Data Caching

MySQL

Netflix uses AWS EC2 instances of MYSQL for its Billing infrastructure. Billing infrastructure is responsible for managing Netflix members' billing states. This includes keeping track of open/paid billing periods, the amount of credit on the member’s account, managing the member's payment status, initiating charge requests, and what date the member has paid through.

The payment processor needed the ACID capabilities of an RDBMS to process charge transactions.

Netflix Datastore

Apache Cassandra

Cassandra is a free and open-source distributed wide-column store. The NoSQL database is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Netflix uses Cassandra for its scalability, lack of single points of failure, and cross-regional deployments. ” In effect, a single global Cassandra cluster can simultaneously service applications and asynchronously replicate data across multiple geographic locations.

Netflix stores all kinds of data across its Cassandra DB instances, including all user-collected event metrics.

As user data began to increase, there needed to be a more efficient way to manage data storage. Netflix Redesigned data storage architecture with two main goals in mind:

Smaller Storage Footprint.
Consistent Read/Write Performance as viewing per member grows.

The solution to the large data problem was to compress the old rows. Data were divided into two types:

Live Viewing History (LiveVH): Small number of recent viewing records with frequent updates. The data is stored in uncompressed form.
Compressed Viewing History (CompressedVH): A large number of older viewing records with rare updates. The data is compressed to reduce the storage footprint. Compressed viewing history is stored in a single column per row key. Compressed Viewing History

Stream Processing Pipeline

Did you know that Netflix personalizes movie artwork just for you? You might be surprised to learn the image shown for each video is selected specifically for you. Not everyone sees the same image.

Netflix tries to select artwork that highlights the most relevant aspect of a video based on the data it has learned about you, such as your viewing history and interests.

Stream Processing Data Pipeline has become Netflix’s data backbone of business analytics and personalized recommendation tasks. It is responsible for producing, collecting, processing, aggregating, and moving all microservice events to other data processors in near real-time.

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

AWS- What is streaming Data?

This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows and used for a wide variety of analytics, including correlations, aggregations, filtering, and sampling.

Information derived from such analysis gives companies visibility into many aspects of their business and customer activity, such as service usage (for metering/billing), server activity, website clicks, and the geo-location of devices, people, and physical goods, and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams and responding promptly as necessary.

The stream processing platform processes trillions of events and petabytes of data per day.

The Viewing History Service captures all the videos played by members. Beacon is another service that captures all impression events and user activities within Netflix. All the data collected by the Viewing History and Beacon services is sent to Kafka.

Apache Kafka- Analyzing Streaming Data

Kafka is open-source software that provides a framework for storing, reading, and analyzing streaming data.

Netflix embraces Apache Kafka® as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio-wide communications.

How Netflix uses Kafka

Apache Chukwe- Analyzing Streaming Data

Apache Chukwe is an open-source data collection system that collects logs or events from a distributed system. It is built on top of HDFS and Map-reduce framework. It comes with Hadoop’s scalability and robustness features. It includes a lot of powerful and flexible toolkits to display, monitor, and analyze data. Chukwe collects the events from different parts of the system; From Chukwe, you can do monitoring and analysis, or you can use the dashboard to view the events. Chukwe writes the event in the Hadoop file sequence format (S3).

Apache Spark - Analyzing Streaming Data

Netflix uses Apache Spark and Machine learning to recommend movies. Apache Spark is an open-source unified analytics engine for large-scale data processing.

On a live user request, the aggregated play popularity(how many times a video is played) and take rate(Fraction of play events over impression events for a given video) data, along with other explicit signals such as members’ viewing history and past ratings, are used to compute personalized content for the user. The following figure shows the end-to-end infrastructure for building user movie recommendations.

Data Processing Engine

Elastic Search - Error Logging and Monitoring

Netflix uses elastic search for data visualization, customer support, and error detection in the system.

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

With elastic search, they can easily monitor the state of the system and troubleshoot error logs and failures.

Conclusion

This article provides a detailed analysis of the Netflix Backend architecture. To test your knowledge of Netflix system design, try out this interactive Quiz on Netflix's Design. For more information, refer to the references in the section below.

If you're looking for a detailed guide to system design, check out this system design interview prep.

Follow me here and across my social media for more content like this Linkedin. Twitter

REFERENCES

Netflix System Design- How Netflix Onboards New Content

Daniel Elegberun — Tue, 15 Jun 2021 16:49:43 +0000

In the battle for the Game of Attention, content is the two-edged sword and user experience is the horse that leads it to battle.

The streaming wars are in full blast with Netflix positioned for continual dominance, armed with over 50,000 individual titles, 200million subscribers in 180+ countries, even more impressive is the underlying technology powering this growth. In this series, my goal is to scratch more than the surface of how Netflix operates, digging deeper into the technicalities of the end-to-end processes involved in delivering content at scale. In the maiden edition of this series, I focus on the Netflix content onboarding system and open connect.
This diagram shows an overview of the system we will be discussing in this article.

Content Content Content.

Netflix has a combined library of over 50,000 titles, supporting over 2200 devices, each device with its resolution and network speed. For them to be able to serve that many devices at different Network speeds they need to have the original video in different formats. Netflix receives the video from the production houses in the best possible format. The thing is the videos from the production houses are large, very large. For a commercial blu ray 2hr movie, you're looking at 15-25Gb. Serving this to users in that format will consume data and bandwidth, So Netflix performs a series of preprocessing on the original videos to convert them into different file formats. These preprocessing are referred to as Encoding and Transcoding.

Encoding is the process of compressing video and audio files to be compatible with a single target device. Transcoding, on the other hand, allows for already encoded data to be converted to another encoding format.(MP4,WLM,MOV,MPEG-4) This process is particularly useful when users use multiple target devices, such as different mobile phones and web browsers, that do not all support the same native formats or have limited storage capacity.
The reasons for this preprocessing are fairly simple.

Reduce file size.
Reduce buffering for streaming video.
Change resolution or aspect ratio.
Change audio format or quality.
Convert obsolete files to modern formats.
Make a video compatible with a certain device (computer, tablet, smartphone, smartTV, legacy devices).
Make a video compatible with certain software or service.

Compressing a 25Gb movie will take a lot of time, to solve this problem. Netflix breaks the original video into different smaller chunks and using parallel workers in AWS EC2, it performs encoding and transcoding on these chunks converting them into different formats (MP4, MOV, etc) across different resolutions(4k, 1080p, and more).

Netflix also creates multiple replicas of the same video chunk to cater to different network speeds. About 1300 replicas of a video chunk.

Example

-High Quality ------4K,1080p,720p,360p
-Medium Quality ------4K,1080p,720p,360p
-Low Quality ------4K,1080p,720p,360p

Netflix stores all these processed video data on Amazon S3.Which is a highly scalable and available storage platform for storing static data. Each video file is stored in chunks of scenes.

Netflix uses AWS for nearly all its computing and storage needs, including databases, analytics, recommendation engines, video transcoding, and more—hundreds of functions that in total use more than 100,000 server instances on AWS.

https://aws.amazon.com/solutions/case-studies/netflix
By default, most chunks are 10 seconds of video. Each time you skip to different parts of a movie you are querying Netflix's Playback API for different video chunks. You usually have one chunk actively playing, one chunk ready to play when the current chunk is done, and one chunk downloaded. This is done to deliver a seamless watch experience to the user and maximize the best possible speed at that moment.

The speed at which these chunks arrive determines the bit rate(bit rate refers to the number of bits used per second) of the following chunks.

If the first one takes longer than the chunk’s playback duration, it grabs the next lower bitrate chunk for the following one. If it comes faster, it will go for a higher resolution chunk if one is available. That's why sometimes your video resolution can be grainy at first and then adjust to a better definition.

Netflix has users in over 200 countries. If a user in Nigeria wants to watch a movie on an amazon instance hosted in America. The internet service providers will need to travel to America to access those servers which will take time and bandwidth. In an era where fractional delays in serving content can lead to declines in revenue, it is paramount that users get access to the content fast. Netflix solved this problem by creating mini servers inside ISPs and IXP(internet exchange points) known as open connect. These boxes are capable of storing 280 terabytes of data. After the videos have been compressed and transcoded on Amazon S3 they are then transferred to these open connect boxes during an off-peak period (Let's say 4 am). When the user requests for a video instead of hitting Netflix servers directly in the US. It hits the open connect boxes. This ensures less bandwidth, faster playtime, and ultimately a better user experience. Movies can also be localized depending on the region. You could store different movies on the open connect appliances in Nigeria and Brazil. About 90% of Netflix content is served this way.

Open Connect

A CDN is a geographically distributed group of services that work together to provide fast delivery of internet content. Open Connect is Netflix’s custom global content delivery network (CDN).
Everything that happens after you hit play on a video is handled by Open Connect.

Open Connect stores Netflix video in different locations throughout the world. When you press play the video streams from Open Connect, into your device.
Netflix has been in partnership with Internet Service Providers (ISPs) and Internet Exchange Points (IXs or IXPs) around the world to deploy specialized devices called Open Connect Appliances (OCAs) inside their network. Open Connect Design Image from Netflix
These servers periodically report health metrics optimal routes they learned from IXP/ISP networks and what videos they store on their SSD disks to Open Connect Control Plane services on AWS.
When new video files have been transcoded successfully and stored on AWS S3, the control plane services on AWS will transfer these files to OCAs servers on IXP sites. These OCAs servers will apply cache fill to transfer these files to OCAs servers on ISP's sites under their sub-networks.
When an OCA server has successfully stored the video files, it will be able to start the peer fill to copy these files to other OCAs servers within the same site if needed.
Between 2 different sites which can see each other's IP address, the OCAs can apply the tier fill process instead of a regular cache fill. Open Connect Transfering content from Netflix

Conclusion

In summary:

Netflix performs compression and transcoding of the original video file.
The file is split into chunks using parallel workers in AWS for faster processing.
Each chunk is subdivided across several resolutions and internet speeds.
These chunks are stored on Amazon S3.
During the off-peak period. The files are transferred to open connect boxes, which is Netflix's custom content delivery network spread out across the world.
These boxes are capable of communicating and sharing content between themselves.

In the next part of the series, I will be explaining Netflix's core backend architecture.
Follow me here and across my social media for more content like this Twitter Linkedin

Introduction To Solidity

Daniel Elegberun — Mon, 07 Jun 2021 16:18:18 +0000

In the previous article we discussed about smart contracts and tokens. In this article I will be taking a look at solidity the language for programming smart contracts. Solidity is a high-level programming language designed for implementing smart contracts. It is statically-typed object-oriented(contract-oriented) language. Solidity is highly influenced by Python, c++, and JavaScript which runs on the Ethereum Virtual Machine(EVM). In this article we will provide an introduction into the solidity language.

The first thing we need is an IDE to write our solidity code. One of the most popular development environments for programming solidity is the Remix IDE and it is what we will be using in this tutorial. Luckily we can access it online here.

.SOL

Solidity Files are saved with the .sol extension to indicate that it is a solidity file.

PRAGMA

- The first line of a solidity file is the pragma statement. It indicates the solidity version that is being used. It helps ensure compatibility in code files.

pragma solidity ^0.8.2;

CONTRACT

This keyword is used to create a smart contract. By convention the name of the contract is usually the name of the solidity file. Every function and variable declaration in the file will be encapsulated within the smart contract.

contract Test{ 
 Functions and Data 
}

VARIABLES

variables are reserved memory locations to store value.
You may like to store information of various data types like character, wide character, integer, floating point, double floating point, boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.
examples of variables are-- integer,string,bool.

ADDRESS

This is a variable type that holds the 20 byte value representing the size of an Ethereum address.

 address x = 0x212;

MAPPING -

A mapping holds a reference to a value. Below is the syntax. They act as hash tables which consist of key types and corresponding value type pairs.

mapping(_KeyType => _ValueType)

mapping(address => uint) public balances;

This maps the address variable as a key to an integer variable and assigns the mapping to a public variable called balances.

https://medium.com/upstate-interactive/mappings-in-solidity-explained-in-under-two-minutes-ecba88aff96e

We can assign a value to a key adress like this

balances[keyAddress] =  value;

Solidity supports. State,local and global variables.

State Variables − Variables whose values are permanently stored in a contract storage.

contract SolidityTest {
   uint storedData;      // State variable
   constructor() public {
      storedData = 10;   // Using State variable
   }
}

Here we declared an integer variable called storedData. And we assign a value to it in the constructor of the contract.This value will be available throughout the contract context.

Local Variables − Variables whose values are present only within a function.
Global Variables − Special variables exists in the global namespace used to get information about the blockchain. Common Examples are :
block.coinbase (address payable) which returns Current block miner's address. See the list of variables here

Image Source [TutorialPoint](https://www.tutorialspoint.com/solidity/solidity_variables.htm)

FUNCTION

A function is a group of resuable code that can be used anywhere in your application. They perform a specific task. The most common way to define a function in Solidity is by using the function keyword, followed by a unique function name, a list of parameters (that might be empty), and a statement block surrounded by curly braces.

Functions can be specified as being external, public, internal or private, where the default is public.

Public: Public functions are part of the contract interface and can be either called internally or via messages.
Internal: Those functions and state variables can only be accessed internally (i.e. from within the current contract or contracts deriving from it).
Private: Private functions and state variables are only visible for the contract they are defined in and not in derived contracts.
Functions can be declared view in which case they promise not to modify the state. Read only functions

function function-name(parameter-list) scope returns() {
   //statements
}

Example

contract BlogDemo {
   function addNumbers() public view returns(uint){
      uint a = 1; // local variable
      uint b = 2;
      uint result = a + b;
      return result;
   }
}

In this example we named our function addNumbers, it is declared as public view Which means it does not modify any contract state, It just adds two numbers together.It returns an integer and it does not take in any parameters.

function with multiple return Parameters.

contract BlogDemo {
   function addNumbers() public view returns(uint sum, uint product){
      uint a = 1; // local variable
      uint b = 2;
      sum = a + b;
      product = a * b;    
   }
}

This function will return both the product and sum.

require keyword. The require keyword in a Solidity function guarantees validity of conditions that cannot be detected before execution. It checks inputs, contract state variables and return values from calls to external contracts. If I wanted to execute a function only if a particular condition is met, I add the required keyword.

contract BlogDemo {
uint value1 = 5;
uint value2 = 4;

function addNumbers() public view returns(uint sum, uint product){
    require(  value1 > value2 ,'5 is not greater than 4')
      uint a = 1; // local variable
      uint b = 2;
      sum = a + b;
      product = a * b;    
   }
 }
}

This function will only execute if value1 is greater than value 2. if the condition is not met It will return the error message ('5 is not greater than 4').

Modifiers

Modifier allow control to the behaviour of a function. They can be used in a vareity of scenarios. Like for example checking who has access to a function before executing that function.

contract Test {
  address testAddress;
  constructor() {
    testAddress = msg.sender;
  }

  // Check if the function is called by the owner of the contract
  modifier onlyOwner() {
      if (msg.sender == testAddress) {
         _;
      }
}

The function body is inserted where the special symbol "_;" appears in the definition of a modifier. So if condition of modifier is satisfied while calling this function, the function is executed and otherwise, an exception is thrown.

We can then use this function modifier as a condition checker in other functions. For example to only execute the function if it is called by the sender.

  // Can only be called by the owner cause I am using the onlyOwner modifier
  function test() public onlyOwner {
  }

Constructors

A constructor is an optional function declared with the constructor keyword which is executed only upon contract creation. Constructor functions can be either public or internal. If there is no constructor, the contract will assume the default constructor.

contructor() public {}
contract SolidityTest {
   uint storedData;      // State variable
   constructor() public {
      storedData = 10;   // Using State variable
   }
}

Events.

An event stores arguments passed to it in the transaction logs of the blockchain. If you want to store something like transfer information. You could do so using an event.

Event Syntax

event Transfer(address indexed from, address indexed to, uint _value);

To write to an event. You emit that event. To write to event Transfer. I emit it using the following syntax.

//Emit an event
emit Transfer(msg.sender, receiverAddress, msg.value);

In this article we have explained some common syntax and terms in the solidity language. In the next article in the series we will be building our own smart contract using solidity and deploying to the Binance Smart Chain.

Follow me here and across my social media for more content like this Twitter Linkedin

An Introduction to Cryptocurrency Tokens

Daniel Elegberun — Mon, 31 May 2021 17:06:02 +0000

Hey👀!.Ever thought of creating your own currency?. With your name,identity and stuff?. Well you can(kinda🤫)One of the best ways to learn about the inner workings of the Cryptocurrency universe is to build your very own token. If you are new to cryptocurrency, you can check out my previous article where I demistify blockchain and cryptocurrency.

This is part 1 of a 3 part series on building our your token on the Binance Smart Blockchain with Solidity.

Before we proceed we must differentiate between a cyptocurrency and a token. The main difference being that a cryptocurrency is the native currency of a blockchain network whereas tokens are built ontop an existing blockchain.

Etherum and bitcoin like other crypto currencies allow you to transfer and trade digital currencies. Some cyptocurrencies like Etherum took it one step further. Ethereum introduced an Etherum Virtual Machine(EVM) which enables developers to create and launch code which runs on the etherum decentralized network. This code is often referred to as a Smart contract.

Etherum is not the only block chain that allows smart contracts just one of the most popular,other ones exist such as the Binance smart chain.
Etherum Virtual Machines allows nodes to store and process data in exchange for payment. This Payment is usually in the form of (Ether) which is the native currency of the Etherum Block Chain. The smart contract integration opened up blockchain to a myriad of opportunities and industries. Popular use cases are Decentralized finance, Logistics and Supply chain,Real-time IoT operating systems.

Example

Lets say, I have a voting App where users can vote for their favourite footballer using a token I created lets call it (vtokens). The more vtokens a player has the higher the rating. Users of my app can transfer vtokens with each other and also use vtokens to vote for their favourite player.

Daniel a user of my app wants to send 5,000 vTokens to Lionel Messi, Daniel calls a function inside my Voting App asking it to do so."Please transfer 5000vtokens from adress D to adress LM"

I want to to use the Etherums Blockchain to store and process the voting information (who voted for who and token transfer between users).

Even though Daniel isn’t sending ether, He must still pay a fee denominated in (ether) to have his transaction request included in the Etherums blockchain and utilize the blockchains resources to send that token. I dont want Daniel to go through the stress of also buying ether. I could peg my own vtokens to the ether i.e 20,000 vtokens = 1 Ether. So for every transaction that occurs in my app I could take a percentage of my vtoken convert it to ether which can be used as payment to the Etherum network to help me do work. The more users that want my token, the more my tokens value increases. All this interopability is made possible by "Smart Contracts" which is a self-executing contract with the terms of the agreement between buyer and seller being directly written into lines of code.

TOKEN STANDARDS

For two systems to work together they need a common agreement. To create your own token on an existing blockchain you need to follow the token standard. Popular token standards are ERC-20 & BEP-20.
You can think of it as a blueprint for tokens that defines how they can be spent, who can spend them, and other rules for their usage. By following the outline, developers don’t need to reinvent the wheel. Instead, they can build off a foundation already used across the industry.

For example, To be ERC-20-compliant, your contract needs to include six mandatory functions: totalSupply, balanceOf, transfer, transferFrom, approve, and allowance. In addition, you can specify optional functions, such as name, symbol, and decimal. From their names you might've already deduced what these functions do. Below are the functions as they appear in the Ethereums Solidity language. Dont worry if you do not understand the code syntax. The next article we will be an introduction into the Solidity Language.

totalSupply function totalSupply() public view returns (uint256) Returns total coin supply in circulation
balanceOf function balanceOf(address _owner) public view returns (uint256 balance) When called, it returns the balance of the specified address’s token holdings.
transfer function transfer(address _to, uint256 _value) public returns (bool success) Transfers tokens from the person calling the smart contracts address to another.
transferFrom function transferFrom(address _from, address _to, uint256 _value) public returns (bool success) Unlike the regular transfer, in transferFrom you specify the sender address.It doesnt necesarrily have to be the address of the person calling the smart contract. A good use case is if you want to set up recurring payments using a smart contract. You could transfer the token to an "Admin user" that you created and then authourize that user to transfer the token at specific intervals.
approve function approve(address _spender, uint256 _value) public returns (bool success) With this function, you can limit the number of tokens that a smart contract can withdraw from your balance.
allowance function allowance(address _owner, address _spender) public view returns (uint256 remaining)

This is used together with the approve function. For example if your smart contract is permitted to withdraw 30tokens and it has withdrawn 20. Calling the allowance should return 10tokens. Basically how many tokens are left for the smart contract to call.

To create your own token there is an easy and a hard way. Easy way using the cointool app(Very Basic functionality). Hard way programming your smart contract in Solidity.(Endless possiblities). In this article we will be exploring this. But later on in this series we will create our tokens smart contract with solidity.

Creating your own token on Binance Smart Chain

Download trust wallet app Andriod ios
Create a token using the coin tool app here -Specify the token name,symbol,initial supply(How many total tokens you want to create) and decimals. -Token decimals are an interesting concept. Basically this specifies the lowest unit of your token. Example. 1ETH = 1,000,000,000,000,000,000 wei. 1 Bitcoin = 100,000,000 Satoshis. This means I can own 0.000000000000000001 Eth, OR 0.0000001 BTC. It essentially helps to make our token very divisible. 1.Can Burn- This means the total circulating token can be reduced 2.Can Mint- Means an addtional amount of this token can be "minted"/"created". 3.Can Pause- This specifies whether your token and its associated functions can pause.Maybe due to hacking. -Click on Connect wallet --> Trust wallet --> BNB Network Chain. -Logon to your trust wallet --> settings --> Wallet Connect
Scan the QR code and Approve
Create token -The token will be created once you approve this transaction fee.Usually about (0.01)BNB. Unfortunately i dont have suffiecient BNB to complete this token creation.😑

To make it show

Go to trust wallet
Click on the two sliders on the top right of the home page -Scroll down to --> add custom token
Select Network to Smart chain
Fill in your network address
Fill in the rest of the details
Log out and Login and see your token. Congratulations you are now own your token. Start collecting payments to your token address.🤝

In summary we have learnt

A Cyptocurrency is the native currency of a blockchain whilst a token is build untop an existing block chains network.
Smart contracts are codes used to execute functions for a given token.
For a token to be created on a Block chains network it needs to follow the token standards by specifying certain mandatory functions.

In the part 2 of this article I will be expounding on the solidity language which is used to write smart contracts.

Follow me here and across my social media for more content like this Twitter Linkedin

Making Sense Of Blockchain And CryptoCurrency

Daniel Elegberun — Mon, 31 May 2021 16:54:52 +0000

This article is my attempt to make sense of blockchain and cyptocurrency. I will be demystifying blockchain technology and explaining various cryptocurrency lingua franca.

WHAT IS BLOCKCHAIN?

A blockchain is an auditable database. A database in which data can only be added but not removed or changed. Data can be periodically added to the database in things called blocks. As the name implies, a series of these blocks chained together is called a Blockchain.

WHAT IS CRYPTOCURRENCY?

This is a digital currency in which transactions are verified and records maintained by a decentralized system using cryptography. Combining blockchain and cryptocurrency together. A blockchain is a network of computers (nodes) that run software to confirm the security and validity of (digital currency)on the network. Blockchain is the network and cryptocurrency is what is being spent on the network. Bitcoin is currently the most popular Blockchain and cryptocurrency but other blockchains exist like Etherum with (ether)as the currency being spent on the network.

DECENTRALIZED VS CENTRALIZED

Most legacy financial institutions use a centralized system wherein user data is stored and managed by a private entity or group of entities. All users connect to a single source of data in that sense. This sort of relationship is called a centralized network, a major downside to this is that it provides a single point of failure. If the centralized database is wiped out all the data is lost and because a single entity(central bank)has power over the system they can make changes as they please. On the other hand in a decentralized network, data is stored on different nodes(computers)in the network and none of the nodes is managed by a central authority, all of the nodes have to somewhat agree to trust each other, the participating nodes have the exact copy of the database. If one node is down other nodes can provide the data. This provides a redundant and resilient network that ensures high reliability.

How does the blockchain work?

Centralized Vs Decentralized

HOW DOES THE BLOCKCHAIN WORK?

Imagine a large hall with briefcases on one end and glass on the other end, each briefcase has a lock and a chain that connects it to the next briefcase. Everyone can see the briefcases through the glass. But only the person who has the key to a briefcase can open it. Supposing I want to transfer money(currency) from my briefcase to yours I will need my key(crypto) to open and sign the transaction. Everyone looking through the glass will see that the transaction belongs to me along with the details of the transaction; as soon as it gets to you, you broadcast to everyone that you have received your money so everyone takes a note of it. After a period of time, someone gathers all these mini transactions and collates them into blocks which are then added to the chain; at a simplistic level, this is how a blockchain works. It is a series of these blocks chained together; It consists essentially of two parts: A Block and a chain. I refer to a block as a collection of transactions and a chain as the linking mechanism which checks if the block's signatures are valid.

We define a bitcoin as a chain of digital signatures. Each owner transfers bitcoin to the next by digitally signing a hash of the previous transaction and the public key of the next owner and adding these to the end of the coin. A payee can verify the signatures to verify the chain of ownership.
-Satoshi Nakamoto, Bitcoin Whitepaper
The blockchain is downloaded by all the nodes in the network; when a new block is added to the chain, verification is done to check if the block is valid. If valid, it makes a copy of that block and forwards it to the other nodes in the network till all nodes have that block added to their chain.

How does the blockchain work?

How does the blockchain work

WHAT IS CRYPTO MINING?

This process of verifying if a block is valid is done by nodes called miners. Cryptocurrency mining is a term that refers to the process of gathering cryptocurrency as a reward for work that you complete. (This is known as Bitcoin mining when talking about mining Bitcoins specifically.) Whenever transactions are made, all network nodes receive them and verify their validity. Miner nodes go to collect all these transactions from the memory pool and attempts to organize them into blocks. The nodes perform a series of hashing functions (hard mathematical problems)according to preset protocols by that crypto network until it finds a valid hash. When a valid hash is found, the node that finds the hash will broadcast the block to the network. All other nodes will check if the hash is valid and, if so, add the block into their copy of the blockchain. This process of hashing the transactions to form blocks requires a lot of computational power therefore the nodes that perform these hashes are rewarded for their efforts. Nodes can combine together to pool resources so the currency reward is shared amongst them.

Mining Farm

CONSENSUS ALGORITHM

Proof Of Work

One of the challenges with distributed systems is ensuring honesty. How do we ensure that each node in the network is honest and the miners of these blocks do not just bring an invalid block without doing the proper work?. The answer can be found in a consensus algorithm. These are a series of algorithm mechanisms that govern miners in a blockchain, all the nodes know this algorithm so they can check for its validity. It is sorta like a lie detector in the chain. It verifies that the hashed block the miner is proposing is valid. If a miner brings forth an invalid hashed block. It would have wasted its computational time, resources, and reputation. If it brings a valid block it gets rewarded in the native coin. This type of consensus algorithm is called proof of work because miners have to prove that they have done the work.

Proof Of Stake

In proof of stake instead of sacrificing computation resources nodes sacrifice crypto coins in a process known as staking. A pseudo-random node is selected based on some conditions to forge a new block. The node then stakes a percentage of its coin as a down payment that it's going to forge the block correctly. When the node completes the block It gets rewarded with transaction fees for its work. If it tries to cheat by not forging the block correctly. It loses both its stake and its reputation.
HALVING is a process where by the reward miners get for mining is reduced. This is done to reduce the total value of its crypto coins in circulation.

A PRACTICAL EXAMPLE

Gbenga and Ada are on the ecoin network. Gbenga wants to send 3 ecoins to Ada. Let's walk through the process.
Address- Both Gbenga and Ada need an address. The first time Gbenga issues a transaction. A private and a public key is generated, the private key for Gbenga to sign the transaction, and a public key for other people to verify Gbengas signature. Ada will also need all these at her end. After solving the address problem Gbenga then issues a statement saying "I Gbenga belonging to this address(123) is sending 3 ecoins to Ada in this address (1234)". He issues this statement with a hash of his public key and his signature (private key).When the coin gets to Ada she can verify it is his signature because of his public key. Ada then signals to the rest of the network that she has gotten her coins. As other people on the network hear that message, each adds it to a queue of pending transactions that they've been told about, but which haven't yet been approved by the network. David "a miner" checks his copy of the blockchain, and can see that each transaction is valid. He would like to help out by broadcasting news of that validity to the entire network so it can be added as a block. However, before doing that, as part of the validation protocol David is required to solve a hard computational puzzle - the proof-of-work. Without the solution to that puzzle, the rest of the network won't accept his validation of the transaction.
Once David solves this problem he is rewarded with a crypto coin and the block of transactions is added to the network.
Head to this site bitcoin explorer. -> Latest Blocks -> View All ->Select a Block that has been mined. You can see all the transactions associated with that block. You can hover on each column to get the information on the transaction.

WHAT IF I WANT TO EXCHANGE CRYPTOCURRENCIES?

I understand how the exchange is done within a blockchain. But what if I want to exchange an ecoin for a gcoin?. There are two main ways to do this Centralized Exchanges vs Decentralized.

Centralized Exchanges

In this type of exchange, a centralized body helps you swap coins. It manages your blockchain's private keys and handles the responsibility involved in managing transactions. (It's funny having a centralized body manage a decentralized system🤔). Supposing I want to swap x amounts of coin A for y amounts of coin B. The exchange removes x amounts of coins A value from my coin A crypto wallet and records it on its ledger. It then creates a crypto B wallet (if I have none )to store coin B and purchase coin B on my behalf. After which it gives my coin A to other people willing to buy coin A. The more people that want to buy coin A the more coin As the value goes up. Popular examples of centralized exchanges are Binance and CoinBase.

Decentralized Exchanges

In a way, these work similarly to the centralized exchanges except that you have the responsibility of storing your own private keys. Your keys are not managed by a "central entity". This essentially allows peer-to-peer (P2P) trading, enabling you to directly transfer funds to the interested buyer/seller, without having to go through middlemen.
Decentralized exchanges work on agreements(smart contracts). The seller raises an order to sell and after filling in the necessary transaction details an advert is placed on a marketplace. Once a buyer agrees on terms with the seller a smart contract is created which cannot be changed until the payment is confirmed and both parties are settled.

SMART CONTRACTS

A contract is an agreement between two parties what makes this one "smart" is that the third party(Lawyer, Bank) is replaced by a computer program that both parties understand. Why are smart contracts needed?. They can be used for a variety of reasons one of which is that they allow more systems to utilize an existing blockchains network. Supposing I want to create an app that can use blockchain technology to help me transfer music across a network in a secure way. I can go two ways
Create my own blockchain network(Very stressful)
Utilize an existing blockchains network(Less stressful)

The blockchain owners give me a smart contract guideline to follow in my app. With this smart contract, I can create my own token and peg it to the blockchain's native currency I then give my users my own token to use on the platform. An example will be:
Let's imagine I have a token called etoken. I want to use an existing blockchains network in our case let's say the Binance smart chain for my new project. To use it I need the blockchains native coins(BNBs). I decided that 5000 etoken is going to be worth 1BNB.I need 10,000 of these BNB tokens which will cost me 100,000 dollars. But there is a problem I cannot afford the BNB tokens. To raise money I go through a process known as an INITIAL COIN OFFERING.
Basically, I tell people to give me their money for my huge project coming up, I then reward the people with my etoken which they can use on my app. People will need to trust my project to give me their cash. After I get the cash I buy the BNB tokens which I can then use on the blockchains network.
Imagine Samuel a user on my app wants to buy music from Alice another user. An agreement is formed on the blockchain using a smart contract. This smart contract contains an agreement between Alice and Samuel. In the simplest terms, the agreement will look like this: "WHEN Samuel pays Alice 20 etokens, THEN Samuel will receive the music ". The blockchain checks if the smart contract is valid and then fulfills the transaction.

AUTOMATED MARKET MAKERS

Automated market makers are like robots that give you the price of a currency using a formula. In traditional order books, let's say I have Ether and I want to trade it for Bitcoin. I will need to put out an advert that I am selling ETH for BTC. If I get a buyer we will then have to go through the hassle of negotiating till we come to a common agreement and the swap happens. But AMMs cut out this hassle process. It uses a formula to calculate the price of each asset automatically, allowing for a very fast and seamless swap between assets. People often refer to it as P2C(Peer to Computer) as against P2P(Peer to Peer). The exact technicalities of how it does this are beyond the scope of this reading but in summary.
-People submit their cryptocurrencies into a pool called a Liquidity pool. Anytime there is going to be a transaction with that cryptocurrency. The robots interact with that pool to get the currency to exchange.
-The people that submit their cryptocurrencies into the pool are rewarded a percentage of each transaction that happens with that currency.
-What price you get for an asset you want to buy or sell is determined by a formula. This formula can vary with each protocol.

COINS VS TOKENS

With respect to my example on smart contracts, coins are native to their own blockchain. Whilst tokens have been built on top of another blockchain.

FORKING

In the cryptocurrency world, a fork is when there is a change in the rules of the blockchain that the coin operates on or the nodes disagree on a historic transaction(s). Supposing you can your group of friends have been taking left on every turn and then you get to a particular turn and someone takes right. If no one joins him then he is left alone and excluded from the network. But if a minority of people go along he has created a new rule which can be said he has forked out of the original group. In cryptocurrency, if a lot of nodes agree with each other they can decide to fork out of the original network and create their own rules. A popular example is Bitcoin Cash which was forked out of the original Bitcoin.

NON FUNGIBLE TOKENS

NFTS are unique digital assets. They are different from other cryptocurrencies because of one major reason. There are non-fungible. A fungible asset is one that is indistinguishable from one another. 1000 naira in your hand is the same value as it is in mine. With NFTS that is not the case. Its value is unique based on the NFT. I could digitize my painting into an NFT and then issue it out. People that purchase my NFTS now have a digital version of my painting. NFTS can be traded like other cryptocurrencies. But how do I assign value to my NFTS? As with all things in life, a value of a thing is determined by how much people deem it valuable.

STORE OF VALUE

Some cryptocurrencies like Bitcoin are often referred to as digital gold due to their anti-inflationary nature. A pound of gold today is worth more than it was in 1970 which cannot be said about any other currency. Bitcoin bears similarities to gold in this regard. The value of 1BTC in 2021 is 1000 percent more than it was in 2011. Some may argue that the price of Bitcoin tends to crash from time to time. As with gold, the value is determined by how much people are willing to purchase it at that time.

Blockchain technology has the potential to impact several domains- from voting to healthcare, social media, finance it comes with a lot of promises, some of which include transaction automation as in the case of smart contracts which can function without the need of middlemen. Solving the trust problem, the lack of trust is the reason why a lot of organizations spend a lot on security and data protection. Blockchain helps to increase trust within parties that do not currently trust one another. Transparency due to the open and immutable state of the blockchain. They are publicly viewable and can be audited. Blockchain technology allows users to share data, openly and securely having confidence that the data is protected and both parties can be trusted to deliver.

Conclusion

In this article, I have explained some concepts in the blockchain and cryptocurrency sphere and I do hope I have provided some useful information as you continue in your Blockchain journey.

REFERENCES AND FURTHER READING

-How CryptoCurrency Works
-What is Block Chain
-What is Bitcoin ?
-how-the-bitcoin-protocol-actually-works
-what-is-ethereum
-What Is a Smart Contract and How Does it Work?
-How-do-smart-contracts-work
-what-is-cryptocurrency-mining
-ERC-20 Token Standards
-a-guide-to-crypto-collectibles-and-non-fungible-tokens-nfts
what-is-cryptocurrency#centralized-exchanges-cex
-How does a decentralized exchange work and what are the most promising decentralized exchanges?
-Binance Exchange
-So what problems does block chain actually solve?
-what-is-crypto-mining-how-does-cryptocurrency-mining-works
-What is an Automated Market Maker

Follow me here and across my social media for more content like this Twitter Linkedin

Exploring Azure Application Insights with Asp.NetCore

Daniel Elegberun — Tue, 11 May 2021 17:15:26 +0000

Having a way to monitor and measure application performance is essential to building high-quality, reliable software. To continuously improve performance and usability for users it is paramount to have some insight into the state of the application at all times.

Application Insights is an application performance management service for software applications that enables you to monitor your application performance in Azure. With it, you can detect and diagnose application issues, track performance and gain insight into what users do with your application.

How Does It Work?

Without going too much into the specifics.

You install the Application Insights SDK in your project.
The SDK collects data from your project
It then transfers these data to your custom Application Insights portal on Azure for better analysis.

In this article, I am going to be walking through the process of setting up application insight for a demo Asp.NetCore MVC application.

PREREQUISITES

VS CODE vscode
An Azure account. You can create it for free here. Ps you get 200dollars free Azure credits and 12month's pay as you go.

CREATING THE PROJECT

Create a new asp.net core MVC project.

 $ dotnet new mvc -n ApplicationInsightsDemo

Enable Application Insights server-side

Log on to Azure Portal
Create a resource -> search for -> Application Insights
Fill in the required details.
Browse to the resource and copy your instrumentation key.
Add the Application Insights.NetCore SDK package to your application.

$ dotnet add package Microsoft.ApplicationInsights.AspNetCore

In the startupClass inside the ConfigureServices() method Add services.AddApplicationInsightsTelemetry().

      public void ConfigureServices(IServiceCollection 
                                    services)
        {
            services.AddApplicationInsightsTelemetry();
            services.AddControllersWithViews();
        }

Specify the instrumentation key in appsettings.json as shown below. This key points to your specific app insight instance on azure.

{
  "ApplicationInsights": {
    "InstrumentationKey": "yourinstrumentationkeyhere"
  },
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "AllowedHosts": "*"
}

Enabling Application Insights client-side

We need to enable telemetry information for the client-side of the application to help us collect information from the client.

In _ViewImports.cshtml, add an injection

$ @inject Microsoft.ApplicationInsights.AspNetCore.JavaScriptSnippet JavaScriptSnippet

This _ViewImports.cshtml page in ASP.NET is the base page for all the rest of the pages for the application, so adding this code here will add the code in the rest of the pages of the application at runtime.

In _Layout.cshtml, insert HtmlHelper at the end of the section but before any other script.

$  @Html.Raw(JavaScriptSnippet.FullScript)

Now that we have set it up application insights in our application. Let's make some tweaks to the application to really see what application insights can help us with.

I have modified my MVC app to add 3 things

A call to a database service to load users.
A call to an external API resource

I also added buttons on the client to help interact with these different services. As we interact with the different services, we can see how application insights can help monitor our application's state.
The full source code is here.

Run the application and refresh it a couple of times. Now go to your app insights instance on the Azure portal. Click on Live Metrics. You should see data from your application being displayed on the dashboard.
On the Application Insights portal navigate to the performance. You can drill into the operations to get a better insight.
AI automatically tracks dependencies such as HTTP/HTTPS Calls, Calls made with SqlClient. You can see the full list here. In my case, I am using a MySQL DB which AI does not track automatically. You can configure it for AI tracking by adding your own custom dependency tracking.

Adding Custom Dependency Tracking

Inject a telemetry client instance in the constructor of the class you want to use it in.

private TelemetryClient _telemetryClient;

Wrap your call to the MYSQL db like this.

using (var operation = _telemetryClient.StartOperation<DependencyTelemetry>("MySqlDb"))
                {
                    string query = @"SELECT UserName FROM Users";
                    operation.Telemetry.Data = query;
                    operation.Telemetry.Target = $"server={host}:{port} - DB:{usersDataBase}";
                    //sending details of the database call Application insights
                    using (var connection = new MySqlConnection(connString))
                    {
                        result = await connection.QueryFirstOrDefaultAsync<string>(query, CommandType.Text);
                    }
                    if (!string.IsNullOrEmpty(result))
                    {
                        operation.Telemetry.Success = true;
                        //informing Application Insights that the db call is successful
                    }
                    _telemetryClient.StopOperation(operation);
                }

See this article for more details on tracking custom dependencies trackdependency.

Testing the Application

Click on Call Database
Navigate to application insights ->Live Metrics-> check for the details.
You can see all application dependencies by going to the dependencies tab. Note: There is a lag of about 5mins when viewing logs on other tabs. Only Live metrics provide the logs in real-time. So you will have to wait for about 5mins to analyze the logs. See our custom dependency
Click on view details -> investigate performance-> Drill Into and check for the details
We can see that application insight identifies the database service the operation was carried out on, with the exact SQL queries and the request and response times.

Exception when DB was not started

When DB is up and running

You could even see what happened before this exception occurred by clicking on User flows. The User Flows tool starts from an initial page view, custom event, or exception that you specify. Given this initial event, User Flows shows the events that happened before and afterward during user sessions.
Let's test an external API call. You can see the request and response times as well as the URL being called.

For client-side statistics, Go to -> Users.Check the number of users that have visited your site and the time. You could even filter by events or for properties like country name, browser version, device type.

Log Analytics.

All the data for application insights is pushed to a place called Log analytics you can go there and query the logs directly using the Kusto language query syntax. You can access it by Clicking on ->Logs from the side menu. On the left-hand side of the log analytics monitor under tables, you will see the telemetry data which is being queried.

Example

Let's query the number of exceptions that occurred in the last hour.


exceptions
| where timestamp > ago(1h)

You could create alert rules for your logs. By clicking on New Alert Rule So it alerts you when a certain event happens. In my case, I created an alert rule for when there are more than 9 exceptions in 5 minutes.

Play around with your own custom tweaks. Monitor, Measure and Optimize!

RESOURCES

Setting up a CI-CD Pipeline Using Azure DevOps

Daniel Elegberun — Sun, 11 Apr 2021 18:52:04 +0000

Continuous Integration is the process of merging all code changes into a central repository. Continuous Deployment on the other hand is the practice of automatically deploying code changes that have passed predefined checks to a specified environment. It uses a set of automated tools to run tests that ensure the codebase is stable and deployable. This approach helps deliver faster and efficient deployments.

An azure pipeline is a fantastic tool for setting up CICD pipelines; in this article, I am going to be walking through how to set up a CI-CD pipeline using Azure DevOps Pipelines. This is a follow-up to my previous article on Deploying An Asp.Net WebApi and MySql DataBase Container to Azure Kubernetes Service. I will be referring to some information discussed in that article. If you are already familiar with Azure Kubernetes deployments you can continue.

Prerequisites

VSCODE. Download here
Azure DevOps Login
An Azure account. You can create it for free here. Ps you get 200dollars free Azure credits and 12month's pay as you go.
Docker docker
VSCODE vscode

In my previous article, I discussed how to deploy a.NetCore web API and MySQL DB to azure Kubernetes service. I will be going one step further by automating the deployment process using an azure pipeline. The tool checks for changes in the git repository and automates the build and deployment process. I could also add a series of checks within the pipelines like approvals and tests. But for this article,I will just be setting up a basic build and deploy CI-CD pipeline.

1. Download the source code from GitHub here
2. Create a repository for the project and push your code to GitHub
3. If you followed the previous article revert the manual steps we did in the previous article by deleting the services and pod deployments and the images we pushed to Azure container registry..

Deleting Container Images To delete a container image login to Azure portal- > container registry -> repositories click on the three ellipses.
Deleting Kubernetes Services and Pods Run this command to get your current Kubernetes context

$ kubectl config get-contexts

Ensure you are in your azure Kubernetes context
Delete all the resources in that context

$ kubectl delete daemonsets,replicasets,services,deployments,pods,rc,pv,pvc --all

Check if it has been deleted

$ kubectl get pods

No resources found in the default namespace.

4. The deployment YAML files are in a folder called manifest

This is to ensure that the pipeline can refer to all the deployments in one folder.

5. Login to Azure Pipelines
Azure Pipelines

Go to pipelines -> Create Pipeline
Connect to your GitHub and Authenticate
Select your GitHub Project
Select Deploy to Azure Kubernetes Service
Login to your azure portal from the app
Fill in your details like container registry and image name
A YAML file will be automatically generated.
Edit the contents of the YAML file with this gist Remember to replace the environment variables with your own details and credentials

Reviewing the YAML file

trigger: This command tells the pipeline to listen to changes pushed to the main branch.

variables: To store reusable entries things like the containerRegistry and secrets that you want to reuse in multiple places in the pipeline.
I define a mysqltag to specify the MySQL version I want to be created.

Build Stage

Stages: are the major divisions in a pipeline: "build this app", "run these tests", and "deploy to pre-production" are good examples of stages. In our case, we have only "build" and "deploy".

Job: A deployment job is a collection of steps that are run sequentially against the environment.

pool- is the server that your azure pipeline would be carrying out the operation. By default Azure assigns an agent for each pipeline. But you can create yours. In my case I created an agent called "ubuntu-latest".

steps- The build stage has a task of Docker@2 which will perform 2 tasks (Build & Push).
This will build the service and then push it to my Azure container registry using the environment variables as defined.

We add two builds and deploy steps because we have two images we will like to build. The SQL and the API image

stages have jobs-> each job has a step -> each step has a task assigned to it.

publish. This command tells the pipeline to find the manifest folder, create an artifact in the pipeline called manifest using the resources in the folder.

Deploy Stage

depends on -It depends on the build stage. This means the build stage has to be completed before this can run.
strategy - How you want to deploy. Run once means all the steps run sequentially.

In the task section, there is a task: KubernetesManifest@0
Kubernetes manifest task command tells the pipeline:

To download the files from the manifest artifact
Apply the manifest artifact to the Kubernetes cluster
Deploy the artifact to the Kubernetes cluster

As with the build section, we define two deployment tasks because we have two images.

Hit save and run and commit directly to the master branch. This will store your YAML file in your GitHub repo and it can be modified as needed.

You should see this stage stating that the build has begun

-------------------------ERROR---------------------------
This agent request is not running because you have reached the maximum number of requests that can run for parallelism type 'Microsoft-hosted Private'. Current position in queue: 1
-------------------------ERROR---------------------------
You will probably run into this error and the root cause of the issue is that the pipeline Microsoft-hosted agent for public and private projects in new organizations on Azure DevOps has been restricted in the latest update. The summary of this is that Microsoft has temporarily restricted agents from running your jobs.

To resolve this for a private project

Private Project:

You could send an email to azpipelines-freetier@microsoft.com in order to get your free tier.

Your name
Name of the Azure DevOps organization
Private Project:

To resolve this for a public project

You could send an email to azpipelines-ossgrant@microsoft.com in order to get your free tier.

Your name
Azure DevOps organization for which you are requesting the free grant
Links to the repositories that you plan to build
Brief description of your project.

If you are impatient like moi. Watch this

Self Hosted private Agent on Linux (Ubuntu) for Azure Pipelines on how to create your own self-hosted agents and link them to run your pipelines.
After you create the agent.
Add it to the project
Copy the name and edit this section with your pool name so that agent runs it.
Install docker on your self hosted agent using this the first step in this How To Install and Use Docker on Ubuntu 18.04
cd into your agent and restart the agent.

As you can see all our Build processes are running similar to what we did before. But right now everything is running automatically in the Pipeline. Once there is a push to the main branch the pipeline triggers a build and it is deployed.

The build and deploy stages have been completed.
Go to pipelines -> Runs -> View Environment -> Resources ->Services

Navigate through the deployment and explore the resources created.
Select the Test-API service External IP
Copy the external IP and navigate to the URL**
(http://external-ip:8080/swagger/index.html)
Mine is (http://20.62.158.83:8080/swagger/index.html). Test the API deployment using the swagger page as shown. Our containerized API is now running inside the Azure Kubernetes service and exposed to the internet via the Load Balancer Service.
We have successfully set up an azure pipeline to build and deploy our containerized application to both Azure Container Registry and Azure Kubernetes Service.

Add The Azure Pipeline built successfully badge to the git repo

-In your Azure DevOps. Select the pipeline -> Runs -> Click on the 3 Elipses

Select Status Badge -> Copy the Sample markdown.
Create a README.md file in your GitHub repo and paste the markdown there.

Follow me here and across my social media for more content like this Twitter Linkedin