Forem: Kshitij (kd)

Abstract to Go: Lets create our own Ansible (Part 2)

Kshitij (kd) — Mon, 13 Nov 2023 16:36:32 +0000

In the previous post I shared what ansible is, how our code would look when it comes to parsing the information from the playbook and host, and how when we have code converted into commands, we can execute them on the servers and return the response.

In this article, we will take a look at what strategies can be used for execution. The conversion of data from a YAML file into commands is not a part of this blog, but you can check the code here.

Strategies

Now there are two Strategies to choose from

Linear
Free

There is a common denominator for both strategies: MaxConcurrency. It is not a good idea to spin up 100 goroutines if there are 100 servers on which the playbook is to be executed.

Linear Strategy

This is the default strategy. Let's say you have five tasks that are supposed to run in parallel on three machines. In this strategy, the execution of the next task will only happen if the previous task has been completed on all the servers. If any of the tasks fail on any of the servers, we may or may not proceed with the next task for that server according to the meta data provided for the task (skip_errors or not).
Again, if the number of hosts exceeds the maximum concurrency number, we will split our execution into batches and run these batches sequentially. Each batch will have a certain number of hosts, which will run the tasks in parallel. We can do something similar using semaphores as well, but we will keep it simple here.
This is what the flow would look like:

Parse all the tasks into commands
Create batches of hosts
For each Batch
- Run tasks on each host in parallel
- Wait until all the tasks are finished
Proceed to the next task

For the waiting part, we can simply use the waitgroup

This is what the code may look like

func (e *Engine) LinearStrategy(respObj PlayDoc) {

    opts := []ExecOutput{}
    for k := 0; k < len(respObj.hosts)/e.maxConcurrent; k += e.maxConcurrent {
        start, end := k*e.maxConcurrent, ((k + 1) * e.maxConcurrent)
        if end > len(respObj.hosts) {
            end = len(respObj.hosts)
        }
        for _, t := range respObj.tasks {
            e.wg.Add(len(respObj.hosts))

            for _, h := range respObj.hosts[start:end] {
                h := h
                t := t
                if !e.sameOS(t, h) {
                    continue
                }
                go func() {
                    // Executing the ssh commands for each server
                    defer e.wg.Done()
                    for _, c := range t.cmds {
                        res, err := e.sshService.execute(h, c)

                        if err != nil {

                            continue
                        }
                        // Checking if there is an error and flag for skipping error is false.
                        if strings.Trim(res.Err, " ") != "" && !t.skip_errors {
                            break
                        }
                        opts = append(opts, res)
                    }

                }()
            }
            e.wg.Wait()
        }

    }

    fmt.Println(opts)

}

Free Strategy

This is where all the hosts run the tasks in parallel. We will wait until all tasks by all hosts are finished before we proceed for the next thing.
Free Strategy is faster than the linear strategy as we dont have to wait for each task. Also, some servers may have better network bandwidth/ more computer and yet they would have to wait for the task on other servers to complete if using linear strategy.
This is how the code may look like:

func (e *Engine) FreeStrategy(respObj PlayDoc) {

    e.wg.Add(len(respObj.hosts))
    opts := []ExecOutput{}
    for _, h := range respObj.hosts {
        h := h
        go func() {
            defer e.wg.Done()
            for _, t := range respObj.tasks {
                h := h
                if !e.sameOS(t, h) {
                    continue
                }

                for _, c := range t.cmds {
                    res, err := e.sshService.execute(h, c)
                    fmt.Println("Response is ", res)
                    if err != nil {
                        continue
                    }
                    if strings.Trim(res.Err, " ") != "" && !t.skip_errors {
                        break
                    }
                    opts = append(opts, res)

                }

            }
        }()
    }

    e.wg.Wait()
    fmt.Println(opts)

}

And its done. So we have covered

Parse Tasks
Executing bunch of tasks using different strategies
Executing ssh commands on remote host

The parsing of inventory file is not a part but one thing to mention is , while parsing the inventory file we need to make sure there is no cycle. i.e When grouping different hosts or hosts of hosts one may generate a cycle. So we need to check for cycle before executing the tasks. This can be done using graph algorithms such as Depth First Search or Breadth Firrst Search. You can check how I have Validated the inventory data for this project here

And thats it. We have a project that resembles ansible. It can run several tasks in parallel on bunch of machines.

Abstract to Go: Lets create a hot reloader

Kshitij (kd) — Mon, 06 Nov 2023 18:01:26 +0000

Being part of a couple Go communities, there is one question that is asked pretty frequently: What's the best hot reloader for Go? Hot reloading automatically detects changes in code and restarts the application. So there is no need to go to the terminal to build and run the programme again and again.

So this time, instead of finding out the best hot reload app for Go, we will create a simple hot reloader that just does exactly what we want: Restart on each update in any Go file in the project.

Design

To have a working hot reload software, we need an application that is able to

Watch files for changes
Execute the go program
Kill the existing program if there is any change, and start a new one.

So from the command line, we would need to know the target, i.e the file we need to run, as well as which directory it is in. This information is important, as we don't want our software to watch unnecessary folders for updates.
We will be using the exec package to start and restart the application.
To watch the files for changes, I am going to use the fsnotify package.

type Watcher struct {
    directory  string
    command    string
    w          *fsnotify.Watcher
    cmd        *exec.Cmd
    lastUpdate time.Time

}

Start/Restart the Application

Whenever we execute using the start method from exec Package, a processID gets attached, which can be used as a reference when we kill the program and start it again.
This processID corresponds to the command that we are executing on the shell. It won't work if what we are executing itself creates a child process.

So "go run" won't work as it creates a child process. Instead, we will build an executable and run it.

So whenever the method is called, we will just check if an instance is already running. If it is, we will kill that instance.

func (wg *Watcher) startCommand() {
    cmdArgs := strings.Split(wg.command, " ")
// If the instance is running
    if wg.cmd != nil {
// Kill Process
        wg.cmd.Process.Kill()

    }
// build the executable and call it ff
    cmd := exec.Command("go", "build", "-o", "./ff", cmdArgs[0])
    cmd.Dir = wg.directory
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
// Run the build command
    cmd.Start()


    wg.cmd = exec.Command("./ff")
    wg.cmd.Dir = wg.directory
    wg.cmd.Stdout = os.Stdout
    wg.cmd.Stderr = os.Stderr
    wg.lastUpdate = time.Now()

// Run the executable
    err := wg.cmd.Start()
    if err != nil {
        fmt.Println("Process Killed", err)
    }

}

Watch Events

We need our application to restart whenever there is a write/edit in any file that has the extension "go" to it.
It will call the startCommand method, which will start/restart our application.

    // Start an event loop to handle events
    for {
        select {
        case event, ok := <-wr.w.Events:
            if !ok {
                return
            }
            if event.Op == fsnotify.Write {

                // if event.Op
                f := strings.Split(event.Name, ".")

                if f[len(f)-1] == "go" {

                    // if time.Since(wr.lastUpdate) > 1*time.Second {
                    wr.startCommand()
                    // }

                }

            }

        case err, ok := <-wr.w.Errors:
            if !ok {
                return
            }
            log.Printf("Error: %s\n", err)
        }
    }

And that would be enough to have a minimal version of hot reload. I am taking the to input parameters : working directory and the filename to be executed.

go run main.go -d=../book_five --file="main.go"

Set the Hot Reload Software

Now we wouldn't want to run all our applications through these projects. Its better if we create a binary of the program and set an alias for it or move it to /usr/local/bin/ from which we can just directly reference our hot reloader. This works for Linux and should work for Mac as well.
Lets name our hot reload executable, golo

go build -o ./golo
sudo mv ./golo /usr/local/bin/golo

And that's it. Now go to the working directory of your Go application and run the command.

golo -d= ./ --file=main.go

The source-code can be found here

Abstract to Go: Lets create our own Ansible (Part 1)

Kshitij (kd) — Mon, 30 Oct 2023 22:16:37 +0000

If you like automation as much as I do, you must have spent hours automating tasks that probably take 5 minutes to do manually. Things get interesting when it comes to infrastructure automation. Tools like Ansible are used to make changes to several servers at the same time without logging into them. In this article, we will look into how Ansible usually works and then convert that abstract information into code.

What is Ansible

Ansible is an agent-less automation tool that can perform a wide range of tasks, such as deploying code, updating systems, and provisioning infrastructure. What's remarkable is that Ansible is agent-less, which means you don't have to install any additional software on your servers to make Ansible work. Behind all that abstraction, it uses SSH to execute commands.

It's also important to know that most workflows using Ansible are designed to be idempotent. That means if you run the same Ansible script multiple times, such as one responsible for installing specific packages, those installations will typically occur just once.

For the sake of this article, we will focus on two significant components of Ansible:

Inventory File

The inventory file is where all the information about the servers is stored. You can also group servers based on your needs. For example, you might want to run updates on all the backend servers while leaving the database servers as they are. Here's an example of how an inventory file may look:

all:  
  hosts:
    server1:  
      ansible_host: sv1.server.com  
      ansible_user: root  
      ansible_ssh_pass: Passw0rd  

    server2:  
      ansible_host: sv2.server.com  
      ansible_user: root  
      ansible_ssh_pass: Passw0rd  

    server3:  
      ansible_host: sv3.server.com  
      ansible_user: root  
      ansible_ssh_pass: Passw0rd  

    server4:  
      ansible_host: sv4.server.com  
      ansible_user: root  
      ansible_ssh_pass: Passw0rd

The inventory file, by default, is an INI file format, but Ansible can also accept a YAML file as input.

Playbook

This contains execution information like :

What tasks to run
Where to run the tasks
How to run the tasks (Strategy)
Maximum number of hosts that are to be run at a time

Here's an example of how a playbook file may look:

---
- name: example playbook
  hosts: server1,server2,server3,server4
  strategy: free
  tasks:
    - name: Create a group
      group:
        name: yourgroup
        state: present  
      skip_errors: true

    - name: Create a user
      user:
        name: yourusername
        password: yourpassword  
        groups: yourgroup    
        state: present

The playbook suggests that two tasks should be run on servers 1 to 4, using a free strategy. By default, Ansible uses a linear strategy, meaning all servers run the first task, then the second, and so on. A free strategy allows all servers to run tasks concurrently, and information about the execution is collected at the end.

Design

So these are the main things that we would need to do make our ansible-like application work

Parse the inventory and the playbook files
Run ssh commands on multiple servers at the same time
Implement different strategies on how to run the tasks from playbook
Ignore errors from commands if explicitly mentioned in the playbook

SSH-Client

So in the end, all the tasks in the playbook should be converted into commands that we run on the server's shell. We would like to capture both the error and the output.

It is also important for us to know the operating system, because there is a possibility that among the set of hosts, there are a few servers that cannot run the command because they have a different operating system. So instead of trying to run these commands on the server, we should just skip the execution altogether.
We also do not want to reconnect to the server for each task.
This calls for a data structure that holds the login details of the SSH client.
This is what the structure may look like

type sshConn struct {
    host   string
    os     string
    user   string
    pw     string
    pkey   string
    client *ssh.Client
    port   int

}

It will have an execution method that will run a command, capture its output into a structure, and return it. This is what it may look like

func (sc *sshConn) execute(cmd string) ExecOutput {
    ll := make([]byte, 0)
    mm := make([]byte, 0)
    sshOut := bytes.NewBuffer(ll)
    sshErr := bytes.NewBuffer(mm)

    session, err := sc.client.NewSession()
    if err != nil {
        log.Fatal(err)
    }

    defer session.Close()

    session.Stdout = sshOut
    session.Stderr = sshErr

    session.Run(cmd)
    co := ExecOutput{
        Out: sshOut.String(),
        Err: sshErr.String(),
        Cmd: cmd,
    }

    return co
}

Tasks

Tasks in Ansible playbooks can have various structures. To handle this variability, it's efficient to use a map-based approach for task processing. This method involves parsing tasks as maps and then iterating through the keys to determine the type of task and how to handle it.

This is what it may look like

func parseTask(task map[string]interface{}) (*Task, error) {
    defer func() {
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    // TODO: Work the other task level variables that may be present
    var result = &Task{}
    result.os = "any"
    for key, _ := range task {
        switch key {
        // case "copy":
        //  res = modules.NewCopy(task[key].(map[string]interface{}))
        case LineinfileMod:
            cmds, err := modules.NewLineInFile(task[key].(map[string]interface{}))
            if err != nil {
                return result, err
            }
            result.cmds = cmds

        case fileMod:
            cmds, err := modules.NewFilePermissions(task[key].(map[string]interface{}))
            if err != nil {
                return result, err
            }
            result.cmds = cmds

        case userMod:
            cmds, err := modules.NewUser(task[key].(map[string]interface{}))
            if err != nil {
                return result, err
            }
            result.cmds = cmds
        case shellMod:
            cmds, err := modules.NewShell(task[key].(map[string]interface{}))
            if err != nil {
                return result, err
            }
            result.cmds = cmds
        case "skip_errors":
            result.skip_errors = true
        case "name":
            result.name = task[key].(string)
        case "default":
            fmt.Println(key)
        }
    }
    return result, nil
}

Here you can see that we have methods for shell tasks, user and group manipulation, as well as lineinfile, which is used to add lines to existing files or check whether a line is present in a file.
The implementation can be found here.

In the next article, we will see how to run all the tasks together, using different strategies.

Abstract to Go: Quad Trees

Kshitij (kd) — Sun, 22 Oct 2023 22:07:03 +0000

While going through important concepts for system design interviews, I came across quad trees, a data structure with numerous applications. But its always better to know the problem it solves.

Pre-requisites

A base level understanding of trees and tree-traversal would help you get a better understanding of the code.

Introduction

Let's talk about a 2-dimensional multiplayer game. We have a hilly landscape with two tanks on the opposite side. trying to destroy each other. For every miss, the bomb will probably explode on land, and hence a crater will be created. Let's assume for now that the blast is circular in nature. If we don't have to worry about the graphics, how do we change the state of the area to represent an empty space that used to be land?

Let's take another example. You have to create an application that shares how many people live in a radius of 5km with you at the centre.

Quad Tree

In such situations, data structures like quad trees perform really well. Both the map and the 2D game of tank can be visualised as a matrix of size n.

The idea is pretty simple. Calculate the total value for the whole matrix. In our case, let's say the world population is 8 billion. So we have a node with a value of 8 billion. Now divide the matrix (or map) into four nodes: North West, North East, South West, and South East. Calculate the population of all these nodes. Connect these nodes to the root. What you did on the root, now do for each of the nodes.
Keep doing this until you find a depth that is appropriate for the use case. Below is the diagram that will help you visualise the same.

Similarly, for the tank game with explosives, we can divide the 2D scenery into smaller segments, up to a depth that may be equal to the minimum blast radius of a missile in the game. The smallest cell (from the maximum depth) can be marked as equal to 1. All the non-land areas can be marked zero. The higher depth of a quad tree will lead to more realistic explosions. Below is an example of the same. Each cell here can hold a value of 4. There is another depth level that is not shown for clarity. Each of those cells will have a value of 1. You can see that around the border between the land and non-land regions, the boxes have values of 1, 2, and 3, meaning that in those boxes there are 1, 2, or 3 land segments.

Implementation

Now this is the easy part. Let's skip the part where we have to map these problems onto a 2D matrix.

What should we be able to do with this quad tree?

Find Regions: Get all those areas with population X
Add: As the population regularly increases, the values in the quad tree should be updated as well. The same can be used for the other example. The only difference would be that we would pass negative values instead.

Quad Tree Structure

The structure, in many ways, will be similar to a tree. In each node, we have to save the coordinates, the value of the quadrant, and the depth level.
For the coordinates, we need to know the minimum and maximum of the row and column values.
This is how the structure may look:

type QuadTree struct {
    lvl      int
    val      int
    x        [2]int // start and end index of matrix row
    y        [2]int // start and end index of matrix column
    children []*QuadTree
}

Initialising the Quad Tree

So we already have all the information laid out on a 2D matrix.
These are the steps we would have to follow :

Calculate Sum of each cell in the matrix
Create a node and set the sum. Set Depth level
Split matrix into four parts
Repeat steps 1 to 3 for each quadrant until the specified depth level is achieved.

This is what the code will look like

func New(arr [][]int, dp int) *QuadTree {

    var newquadtree func(arr [][]int, depth int, lc, rc, tr, br int) *QuadTree
    newquadtree = func(arr [][]int, depth int, lc, rc, tr, br int) *QuadTree {
        if lc+1 == rc || tr+1 == br {
            return nil
        }
        sum := 0
        for i := tr; i < br; i++ {
            for j := lc; j < rc; j++ {
                sum += arr[i][j]
            }
        }
        qt := &QuadTree{
            val: sum,
            x:   [2]int{tr, br},
            y:   [2]int{lc, rc},
            lvl: dp - depth,
        }
        if depth == 0 {
            return qt
        }
        midR := tr + (br-tr)/2
        midC := lc + (rc-lc)/2
        qt.children = append(qt.children, newquadtree(arr, depth-1, lc, midC, tr, midR))
        qt.children = append(qt.children, newquadtree(arr, depth-1, lc, midC, midR, br))
        qt.children = append(qt.children, newquadtree(arr, depth-1, midC, rc, tr, midR))
        qt.children = append(qt.children, newquadtree(arr, depth-1, midC, rc, midR, br))
        return qt
    }

    return newquadtree(arr, dp, 0, len(arr[0]), 0, len(arr))
}

Here I am recursively calling the quadTree generation until a specified depth can be reached or if the current matrix cannot be divided into further quadtrants.

Find Regions

We will again be using DFS to find out which regions have values passed to the function. The aim here is to find the smallest quadrant that holds the value. These are the main things we have to make sure to check

If current nodes value < target, dont traverse
If all child quadrants hold a value < target, the current node is the smallest quadrant that holds a value >= target, so append the current node to the list of results.
If the child quadrant holds value >= target, traverse the child quadrant.

This is how the code may look like :

func (qt QuadTree) FindRegions(value int) []*QuadTree {
    arr := []*QuadTree{}
    var dfs func(node *QuadTree, depth int)
    dfs = func(node *QuadTree, depth int) {
        if node == nil {
            return
        }
        if node.val < value {
            return
        }
        flag := false
        for _, child := range node.children {
            fmt.Println("For ", node, "child value::", child.val)
            if child.val > value {

                dfs(child, depth+1)
                flag = true
            }
        }

        if !flag {
            fmt.Println("Flag", flag, "for ", node)
            arr = append(arr, node)
        }
    }
    dfs(&qt, 0)
    for _, v := range arr {
        fmt.Println(v.val, v.x, v.y)
    }
    return arr
}

Add Value to a region

Here we may be asked to increase the value of a particular region. We would need to recognise the value of all the parents in that particular region.
This is pretty straight forward to figure out. Keep on traversing the quadrant that holds the region, and keep incrementing these quadrants with the value.

This is what the code may look like

func (qt *QuadTree) Add(value, x, y int) {
    var dfs func(node *QuadTree)
    dfs = func(node *QuadTree) {
        if node == nil {
            return
        }
        node.val += value
        for _, n := range node.children {
            if n == nil {
                continue
            }
            if n.x[0] <= x && n.x[1] >= x {
                if n.y[0] <= y && n.y[1] >= y {
                    dfs(n)
                }
            }

        }
    }
    dfs(qt)
}

And that's it! Remember, there are a lot more properties of quadtrees, and only the basics are covered. you can find the code for the implementation here. Do you believe the algorithm can be further optimised? Let me know.

Resilient Systems using Go: Semaphores

Kshitij (kd) — Wed, 18 Oct 2023 14:29:49 +0000

Previously, we talked about retry mechanism and circuit-breaker, two resiliency techniques, and what their packages may look like. In this final chapter of the resilience series, we will take a look at semaphores and convert the abstract information to a working package.

Introduction

Let's say we have to create a search page for our cyber security application. The search page shows all the addresses and headers of all the possible malicious emails against a suspicious sender email and subject. To get the email information, the system would have to interact with the email system API, which has a very high threshold for accepting requests. So the search would lead to a search throughout your whole organisation against the keywords mentioned, do a malicious check on them, and return the results.
Now, these can be big or small emails that are to be processed for malicious threats. You would like to process as many mailboxes as possible at a time, but you also don't want the system to slow down by having too many concurrent tasks on a service with unlimited bandwidth.

Semaphores

What we would like to have is a mechanism that restricts the number of concurrent requests we can perform with the resources. The number would align with what the system can handle without interrupting the performance of other processes.
We can achieve this by using semaphores.
Semaphore is a mechanism to put an upper-bound on the number of requests that one can perform at a time. If the semaphore is running under capacity, it can accept further requests. Whenever a request is completed, the semaphore package can notify our system that it is available to take in more requests. If it is already at full capacity, it will return an error.

Designing the Semaphore Package

Package Structure

So a basic functioning package implementing Semaphore would require

Weight : Maximum number of requests that can run concurrently
Count : Count of requests under progress
Notifier: A notification function that will tell the system whenever it is available to take more requests.
mutex : a mutex would be used to access and update the count variable. Two requests may try to update the count variable at the same time, creating a race condition.

This is what the structure may look like


type Semp struct {
    weight uint32
    count  uint32
    mu     *sync.Mutex
    notify NotifyFunc
}

Functionality

Functionality is pretty straightforward. There are two important methods

Acquire

The system will call the acquire method whenever it wants the request to be processed. If the semaphore is at its full capacity, an error should be returned. Otherwise, the counter should increment.

func (s *Semp) Acquire(i int) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    if s.count+uint32(i) > s.weight {
        return ErrCannotAcquire
    }

    s.count += uint32(i)
    return nil
}

Release

Here we just need to decrement the counter and call the notify function passed by the user

func (s *Semp) Release(i int) error {

    s.mu.Lock()
    defer s.mu.Unlock()
    s.count -= uint32(i)
    if s.count <= 0 {
        s.count = 0
    }
    go s.notify()
    return nil
}

And that's it. That's how you implement the basic functionality of semaphore. The code can be found here. It goes without saying that all the resiliency mechanisms should be context aware. One should be able to cancel any ongoing request if certain criterias can't be met.
What other scenarios do you think we can use semaphores for ?

Resilient Systems using Go: Circuit Breaker

Kshitij (kd) — Mon, 09 Oct 2023 21:05:49 +0000

Introduction

In the previous post we talked about retry mechanism and what all possibilities can be encapsulated together in a package. Its an important mechanism to prevent the whole system from going down in case an external service goes down.

Let's take an example of Twitter (or X), like social media-application that synchronously loads the website with all the important features like recommended tweets, user recommendations, and trending hashtags.

Its the football World Cup, and England is playing Portugal. People who are stuck in the office are checking the hashtag #EngVsPor to get live reaction of others, and hence overloading the hashtag component of our system. The hashtag service is taking 7 seconds to get data instead of the usual 20ms. And because the call is synchronous, each web page reload is taking at least 7 seconds.
On top of that, now we have a lot of concurrent requests stuck on our server waiting for the hashtag service to give a result, ultimately leading to an outage.

In this case, it would be better if we failed all the requests early. One way to do that is to reduce the timeouts for these requests. But coming up with a value would be difficult. If the system is under load, all the requests will time out.
So a timeout will not be an efficient way to manage this problem. And this is where circuit breaker comes in.

Circuit Breaker

What we can do is let the request to the hashtag service go through the circuit breaker. If the number of errors goes beyond a specified threshold, the circuitbreaker will stop sending requests to the hashtag service. Hence, the circuit is open.

But how would we know if the service can start taking requests again? This can be done by adding another state to the circuit breaker - the half open state. After a specified duration of time, one can send a few requests to the service. If even a single request returns error, the circuit will open again, and the cycle will continue.

If, in half state, a good amount of requests do not cause any errors, we can close the circuit again and resume the flow.
But the flow through the circuit breaker is not controlled by the package. We need a way to inform the system of the current state of circuit breaker

Design

So our Circuit Breaker structure must include

current state (open/half/closed)
threshold : when the number of errors reaches the threshold, the state changes to open.
duration: time after which our state changes from open to half.
good requests: Total number of good requests in halfstate
halfStateThreshold: Exceeding this threshold will change the state to closed, and a full flow of requests can be expected afterwards. A good idea will be to have it as a percentage of the threshold variable from above.
NotifyFunc: function that will be called whenever a state is changed.
StateMutex : The state of the circuit breaker due to concurrent access will cause locks. So we will use mutex to avoid that scenario.

Let's have a separate structure that will be used as an input to invoke our circuit breaker structure. The image below shows what the structures will look like.

Implementation

So the execution will be somewhat similar to what was done in the previous blog about the retry mechanism.

// cb is the circuit breaker object.
        cb.Execute(context.Background(), func() (interface{}, error) {
            l := m.Func()

            if l == "" {
                return "", errors.New("Error found")
            }
            return "ok", nil
        })

The execute function will run the closure if it is in a half state or closed state. It will return an error if it is in the open state.

// Execute executes the user defined function in the circuit breaker
func (cb *CircuitBreaker) Execute(ctx context.Context, fn Action) (interface{}, error) {
    // Execute the function
    var state State

    cb.sLock.Lock()
    state = cb.state
    cb.sLock.Unlock()

    switch state {
    case Closed:
        return cb.run(fn)
    case Open:
        return nil, ErrCircuitOpen
    case Half:
        return cb.runInHalfState(fn)
    }
    return cb.run(fn)
}

The switch from a closed to an open circuit will happen when the number of errors reaches the threshold. Once the state is set to Open, we will wait for specified amount of time and then change the state to Half state

// running the function in closed state
func (cb *CircuitBreaker) run(fn Action) (interface{}, error) {

    res, err := fn()
    if err != nil {
        cb.count++
    }
    if cb.count >= cb.threshold {
        go cb.openCircuit()
    }
    return res, err
}

// Open circuit 
func (cb *CircuitBreaker) openCircuit() {
    cb.setState(Open)
    go cb.halfCircuit()

}


// HalfOpen Circuit
func (cb *CircuitBreaker) halfCircuit() {

// Sleep for specified duration
    time.Sleep(cb.duration)

    cb.setState(Half)
}

Now for the execution in half state, we will keep on counting the good requests, and if the number exceeds a certain percentage of the threshold, we can close the circuit.
If even a single error comes up, we close the circuit again.

func (cb *CircuitBreaker) runInHalfState(fn Action) (interface{}, error) {

    res, err := fn()
    if err != nil {
        cb.openCircuit()
        return res, err
    }

    cb.goodReqs++

    if cb.goodReqs >= (cb.hsThreshold*cb.threshold)/100 {
        cb.closeCircuit()
    }
    return res, err
}

And whenever we set the state, we need to notify the system about the change in state

func (cb *CircuitBreaker) setState(st State) {

    cb.sLock.Lock()
    cb.goodReqs = 0
    cb.count = 0
    cb.state = st
    cb.sLock.Unlock()

    //Notify the userDefined Function
    go cb.notifyFunc(stateMapping[st])

}

And that's it! The circuit breaker package is ready to use.
Code alongside testcases can be found here

Resilient Systems using Go: Retry Mechanism

Kshitij (kd) — Wed, 27 Sep 2023 17:47:18 +0000

Introduction

In this post, we will discuss the retry mechanism used to make systems more resilient and create a simple implementation in Go. The idea here is to develop the mechanism from some abstract information

Background

There used to be a time when multiple instances of a monolith were enough to serve users. Applications today are a lot more complex, moving a lot of information and communicating with different other applications to provide users with a service. With a lot of moving parts, it becomes more necessary to make sure your application doesn't break when interacting with third party services. Its better to let the user know that the request cannot be processed rather than make them wait.

That's why we strategically place timeouts. In Go, we do that using the context package. Then the application's logic tells it if it wants to try again or not.

We also dont want to retry too many times in a short period of time. The third party API may rescind all the requests if the number goes beyond the threshold or even block your application's IP from making any requests.
It is a good idea to retry for a fixed amount of time, with some pre-defined time intervals.
The most common way to do it is to retry after every n seconds configured by the user. This value can be obtained with respect to the rate-limiter threshold of the third party API.

One can also apply exponential backoffs. If there is an error, the next request will be done after n seconds, n*n for the request after, and so on.

But do you want the retry mechanism to work when the application makes a bad request?
Or make a request when the third party API tells the system that the service is unavailable for some time.

Design

So our retry package should have these configurations

Max number of retries
Standard duration between each retry
User Defined BackOffs (Exponential or Custom)
Bad Errors When these error occur, we stop retrying
Retry Errors, which is a list of errors If an error outside of the list occurs, we stop retrying

One cannot have both bad and retry errors enabled for our retry functionality at the same time. Similarly, if custom or exponential intervals are given for retry, we should be omitting the variable that sets the maximum number of retries.

Implementation

For the user to run our retry package, they would have to adhere to a function signature. The function call should only return an error.
This can be easily done using closures.
Lets say our retry package has a method called Run
Run(fn func()error)

And we want to call the method Run on our own function, ThirdPartyCall(input string)(string,error)

So a call should look like



obj := retry.New()
obj.Run(func()error {
     resp,err := ThirdPartyCall("input String")
     if err != nil {
          return err 
    }
       // code logic
     return nil 

})

Run Function

For the purpose of this blog, I have not implemented separate functions for a normal retry method and the one with user specified intervals. So we will just do a check on the interval variable. If its length is zero, we will run our function in normal mode, or else we will run using the intervals specified.

These are the few things that we need to keep in mind before the implementation

Have a count of the number of retries done so it can be compared to the threshold.
Put the time gap not at the start of the function but after an error occurs. You don't want the system to wait until the function is called for the first time.
Check for bad errors. If they exist, dont try again
Check for retry errors. If they don't exist, don’t try again.

Another thing to keep in mind is that the whole request, including retries, might take a couple of seconds. So we don’t want the configuration variables to be changed while the retry process is going on. The extra space occupied isn't much, so it won't be a worry.

So the code may look like this



// Run runs the user method wrapped inside Action
// with a set number of retries according to the configuration
func (r *Retrier) Run(fn Action) error {

    if len(r.intervals) > 0 {
        return r.RunWithIntervals(fn)
    }
    var (
        count       int
        badErrors   = r.badErrors
        be          = r.be
        maxRetries  = r.maxRetries
        re          = r.re
        retryErrors = r.retryErrors
        sleep       = r.sleep
    )

    var rn func(fn Action) error
    rn = func(fn Action) error {

        if err := fn(); err != nil {
            if be {
                if _, ok := badErrors[err]; ok {
                    return err
                }
            }

            if re {
                if _, ok := retryErrors[err]; !ok {
                    return err
                }
            }
            count++

            if count > maxRetries {
                return ErrNoResponse
            }
            time.Sleep(sleep)
            return rn(fn)
        }
        return nil
    }

    e := rn(fn)
    return e
}

Run With Intervals

The code will be similar to our run function. As we already keep tabs on the count to compare it to the threshold, the same information can be used to determine how much time we need to sleep before we retry the same function. The code for this would look like



// RunWithIntervals is similar to Run. The difference is that we have a slice
// of time durations corresponding to each retry here, instead of maxRetries
// and constant sleep gap.
func (r *Retrier) RunWithIntervals(fn Action) error {
    var (
        count       int
        badErrors   = r.badErrors
        be          = r.be
        maxRetries  = r.maxRetries
        re          = r.re
        retryErrors = r.retryErrors
        intervals   = r.intervals
    )

    var rn func(fn Action) error
    rn = func(fn Action) error {

        if err := fn(); err != nil {
            if be {
                if _, ok := badErrors[err]; ok {
                    return err
                }
            }

            if re {
                if _, ok := retryErrors[err]; !ok {
                    return err
                }
            }
            count++
            if count >= maxRetries {
                return ErrNoResponse
            }
            time.Sleep(intervals[count])
            return rn(fn)
        }
        return nil
    }

    e := rn(fn)
    return e
}

And that's it. The Retry package is ready to use. Configuration of the package is not in the scope of this post, but it can be found here. I have used the functional options pattern to set the configuration.

You can checkout the package and its test cases here

Parallel algorithms series. Part 2: PRAM Models

Kshitij (kd) — Mon, 10 Jul 2023 23:12:35 +0000

In the previous post, we discussed about sequential algorithms and the RAM Model. Here we will talk about some Parallel machine models that we can select for our task. But before that lets have an understanding what kind of parallelism we can achieve.

Parallelism on a single node
Unless you are using a couple decade old computer, you probably are using a system that has more than 1 cores. And if you can have more than one cores, we can utilise all of them for our algorithm. As everything resides on the same node, we can utilize shared memory to run our algorithms

Distributed Computing
This is where one system is not enough, for several reasons, and we have to utilize multiple nodes to be able to complete the task. An example of this is supercomputers or high performance computers. You must have done matrix multiplication on a 3x3 matrix. Now imagine, you need to do it on matrices which are shared as files a couple gigabytes each.

We are going to focus on parallelism on a single node.

PRAM or Parallel RAM
The attributes of PRAM Model are as follows

It has p Processing elements, each having a unique id. The number of processors that are to be utilized can also depend upon the size of the input.
Each processing element can run its own RAM-style program.
Each processing element has its own registers but shared memory.
Synchronising the processing elements also has some overhead.

So we have several processing elements(pe) doing its computations. Some information is being stored in registrations, some operations are being performed, and then resultant information is then being sent back to the memory.
But what if multiple processing elements write to the same memory location? We have conflict!
On the basis of this we can have several sub-divisions to the PRAM model.

EREW-PRAM

Exclusive Read Exclusive Write
If multiple pe try to access the same memory location, read or write, the program will crash

CREW-PRAM

Concurrent Read, Exclusive Write
Its fine if multiple pe are trying to read the same memory cell at the same time, but only one should write at a time.

CRCW-PRAM

Concurrent Read and Write
For this a rule is required for concurrent writes.

Execution Costs

We saw that in the RAM Model that the number of instructions tells the total computation required.
There is a bit more to the cost metrics of a PRAM Model as now the instructions are being run in parallel. So on top of space(memory access) and time(which is the maximum time by any PE taken to finish the allocate task), total instructions (all instructions combined).

Remember, if the parallel algorithm is taking more time than the best sequential algorithm you know of, its better if you forget about it, or atleast try to optimize it.

If interested, you can have a look at parallel implementation of the kmp string matching algorithm using WaitGroups and Channels .
Let me know in the comments if you would like a deep dive into how to design parallel algorithms.

Parallel algorithms series. Part 1: A little bit about sequential algorithms.

Kshitij (kd) — Mon, 10 Jul 2023 21:53:52 +0000

Introduction

If you are reading this post, I am pretty sure that you have already implemented some form of a sequential model. It can be a an algorithm for sorting, a function that returns a fibonacci sequence or something as simple as Adding all the items in a list.
Most of the algorithms that you have come across till now utilizes a single core.
And probably that is not what you are interested in. But to get into parallel algorithms, first look at the machine models that we use for sequential computations.

RAM Model

Lets talk about adding two numbers, each having d = 5 digits
e.g 23568 + 98321
What will be the cost of adding these two numbers ?
For this particular use case , you might have guessed "constant time" and you are probably right.
We have two numbers that can be easily contained in a 64 bit integer and the addition might happen in 1 or 2 operation cycles.

But thing changes if the number itself is too big, and does not fit into the 64bit . In that case, there will be a split, some carryovers and then we will get the result. One can say it depends upon the size the number of digits we are adding.
The analysis of the code will get complex if we start considering these cases as well, hence we need a standard to follow to make things easier for ourselves. Generally, we use RAM Model, a standard for sequential computation.

What is RAM Model then ?
RAM Model or Random Access Model is a machine model that has the following attributes
*Unlimited memory. Any size of input can be accomodated easily in memory cells.

Fixed amount of registers. Registers are temporary storage sitting next to the processors. They cannot be unlimited, or else why do we even need memory ? Save everything in the registers. Hence, we have limited amount of storage to hold data at a time.
memory cells and registers can save w size integers (or numbers from 0 to 2^w-1). Here w = logn. The size of integer will increase logarithmically wrt to the number.

In the RAM Model, the total cost is the number of instructions executed. These instructions are

Load into registers from memory, and vice versa.
Arithematic Operations (+,-, %, *, AND , OR , XOR etc)
conditional/unconditional jumps (Those if cases have some cost as well!)

So the next time, when you see a program with n iterations and multiple operations and if-else conditions, you can tell your friends that they aren't wrong, but they aren't right either. Here is an example below

    var i,n, sum int
    fmt.Print("Enter the value of n: ")
    fmt.Scan(&n)
    for i <= n {
        sum += i
        i++
    }
    fmt.Println(sum)

The time complexity here is O(n). But really, its 2n instructions. (add operation on sum and add operation on i).

I am trying to keep the articles short and sweet. In the Next One, we will dive into PRAM Model.

Plugin multiple Mongodb sources to Prometheus and visualise them on Grafana

Kshitij (kd) — Mon, 22 Nov 2021 07:00:08 +0000

Introduction

I ran into a situation where I had to put certain metrics alert for a couple of mongodb instances. This was a challenge as I couldnot find direct resources for the setup we were trying to achieve. So this article will help you plugin multiple mongodb sources and visualise them on Grafana.

Pre-requisites

The setup we are doing would require Docker installed on your system/server.

Create user for all Mongo instances

Go to your mongodb shell and create a user . This user will be used by mongodb exporter to export the metrices and forward them to Prometheus.

>use admin
>db.createUser(
  {
    user: "mongodb_exporter",
    pwd: "your_unique_password",
    roles: [
        { role: "clusterMonitor", db: "admin" },
        { role: "read", db: "local" }
    ]
  }
)

Installing and Running Mongodb Exporter

Here I am using Bitnami's Docker Image for exporting the metrices.
I have two mongodb sources that I need to plugin. So I will be running to mongodb exporters.

For instance 1 and 2:

docker run -d --name m1  -p 9216:9216 bitnami/mongodb-exporter:0.11.2 --mongodb.uri=mongodb://mongodb_exporter:your_unique_password@INSTANCE.1.IP:27017

docker run -d --name m2  -p 9215:9216 bitnami/mongodb-exporter:0.11.2 --mongodb.uri=mongodb://mongodb_exporter:your_unique_password_2@INSTANCE.2.IP:27017

The key differences above are the name, port on which the services run, and the mongo url.
Note: If you are trying this out on a server, make sure to add rules to your firewall so that the server containing the mongodb exporters can access the mongodb instances.

Prometheus

To install Prometheus using docker, we need to add a configuration file that will then be mounted to the docker container.

prom.yml

scrape_configs:
- job_name: 'prometheus'
    static_configs:
            - targets: ['PRIVATE_IP:9090']
  - job_name: 'mongo-1'
    static_configs:
            - targets: ['PRIVATE_IP:9216']
            - labels:
                instance: 'mongo1'
  - job_name: 'mongo-2'
    static_configs:
            - targets: ['PRIVATE_IP:9215']
            - labels:
                instance: 'mongo2'

So we mentioned targets in the scrap_configs section
i.e our mongodb-exporter instances that are scraping data from mongodb.

Now to run Prometheus

docker run  -d -v /PATH/TO/CONFIG/prom.yml:/etc/prometheus/prometheus.yml --network=host  --name prom prom/prometheus

Grafana

To run Grafana using docker:

docker run -d  --network=host --name grafana grafana/grafana

I have added a makefike for all the docker commands required to run each and every instance.

Note: Prometheus and Grafana here are running on host network and this is not adviceable for production environments. Its better to create a different network or expose only the ports that are to be accessed.

Signin

Go to YOUR_IP:3000. Login (username and password both is admin)

Add source

Go to Configuration > Data Sources > Add Source
Select Prometheus
In the URL field add http :// YOUR_IP:9090. Save the configuration.

See Multiple Grafana instances and toggle between them.

There are several free to import dashboards that you can use and we will be using one of those to see the data and toggle between different instances.

Add a Dashboard

Hover on Import (+ sign on the left) > Import
In the Import via Grafana section, for this example, add 7353. Click on load on the right of the input field.
At the bottom, select source for prometheus and save the dashboard

Here You can see an Instance filter where one of the sources would be selected. From here you can toggle to the other source you have added initially.

And that is all you need to do to add multiple mongodb sources and visualise them on Grafana.I have added all the relevant files here

Install and Run Mongodb using Ansible

Kshitij (kd) — Mon, 23 Aug 2021 11:22:31 +0000

Introduction

As a person who has to do several on-premise deployments, at a place where docker is not accepted yet, I have had several issues installing mongodb. Even if I am installing it in the same environment, or upgrading it, I face several problems regarding permissions, ownerships, lockfiles etc.
In this article I will share an automated way through which you can install mongodb (version 4.2). Yes, we will be using ansible.

Pre-requisites

You need to have ansible installed on your system.

Hosts

When you run an ansible-playbook command, you may or may not mention the host file which has the all the required host information. Lets create a new file named mongo-hosts and enter name of all the servers on which you need mongodb installed

mongo-hosts

[local]
localhost       ansible_connection=local

[mongo-server-1]
XX.YY.ZZ.AA   ansible_connection=ssh  ansible_user=user

[mongo-server-2]
XX.YY.ZZ.BB   ansible_connection=ssh ansible_user=user

[mongo-servers:children]
local
mongo-server-1
mongo-server-2

We have mentioned all the instances, with their connection type and username.
Here you can see that I have grouped all the servers at the end. So anywhere mongo-servers is mentioned in playbook,All the child instances will be considered.

Playbook - mongo-playbook.yml

- hosts: mongo-servers
  become: true
  serial: 1

  tasks:
    - name: Install aptitude using apt
      apt:
        name: aptitude
        state: latest
        update_cache: yes

    - name: Import public key
      apt_key:
        url: 'https://www.mongodb.org/static/pgp/server-4.2.asc'
        state: present

    - name: Add repository
      apt_repository:
        filename: '/etc/apt/sources.list.d/mongodb-org-4.2.list'
        repo: 'deb https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.2 multiverse'
        state: present
        update_cache: yes

    - name: Install mongoDB
      apt:
        name: mongodb-org
        state: present
        update_cache: yes

    - name: Ensure mongodb is running and and enabled to start automatically on reboots
      service:
        name: mongod
        enabled: yes
        state: started

So here as we can see

hosts represent the instances on which this playbook will run.
become parameter is a privilege escalation setting as some commands we would only be able to run if the user is sudo. Note: These commands are only possible if the user through which ansible logs into the server is a sudo user.
serial parameter is for controlling how the playbook will be executed. In our case, this would mean that we are going to run the playbook on all the servers serially.
tasks are the steps that ansible will take on each server. These are the commands that we generally run to install mongodb manually.

Execution:

ansible-playbook mongo-playbook.yml -vvvv -i ./mongo-hosts

-vvvv is used to run the ansible playbook in verbose mode.
If we dont mention the hosts file(using i flag) , ansible-playbook will pick from the default one. In case of Ubuntu, that is /var/ansible/hosts

And thats it! This is a really basic case which only touches a simple installation. Let me know if you need examples of more complex ones (Primary,secondary,arbiter or the same using docker)

Run Docker commands inside Jenkins Docker container

Kshitij (kd) — Mon, 19 Jul 2021 11:14:10 +0000

Introduction

If you want to initiate docker containers from within your jenkin containers,this is what you have to do:

In the Jenkins Dockerfile, add commands to get docker, docker-compose installed.
bind mount the docker socket.

And thats it! This is how the Dockerfile looks like:

Jenkins Dockerfile

FROM jenkins/jenkins

# Docker install
USER root
RUN apt-get update && apt-get install -y \
       apt-transport-https \
       ca-certificates \
       curl \
       gnupg2 \
       software-properties-common \


RUN curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -
RUN apt-key fingerprint 0EBFCD88
RUN add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/debian \
       $(lsb_release -cs) \
       stable"

RUN curl -L https://github.com/docker/compose/releases/download/1.27.4/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose \
&& chmod +x /usr/local/bin/docker-compose

RUN apt-get update && apt-get install -y docker-ce-cli

USER jenkins

Now to build the image:

docker build -t jenkins-docker .

To run the docker-image, including the volume mounts:

 docker run -d --group-add $(stat -c '%g' /var/run/docker.sock) \
-v /var/run/docker.sock:/var/run/docker.sock -p 8080:8080 -p 50000:50000 \
-v `pwd`/jenkins:/var/jenkins_home --log-opt max-size=50k   --log-opt max-file=5   --name jenkins -P jenkins-docker

Here you can see docker.sock file has been mounted.Also, jenkins_home folder has been mounted so that you can persist the information regarding your pipeline/configuration/users etc.
Dont forget to take backup of jenkins_home directory!

I have uploaded a makefile and a similar Dockerfile on github.

And thats it!