Forem: Arunabh Gupta

How I Stopped Manually Setting Up Virtual Machines

Arunabh Gupta — Fri, 24 Apr 2026 08:33:36 +0000

I could spin up a virtual machine using Oracle VM VirtualBox.

Install the OS. Set up packages. Configure everything just the way I wanted.

But then recreating them again and again and again... you get my point. It's tiring.

Creating virtual environments is easy, but doing the same thing again and again is just stupidity.

What are VMs?

Let’s first clear what the hell even are vms and why do we use it.

VM stands for "virtual machines." In simple terms these are “computer within a computer”. It’s an emulation of a physical computer system that runs an operating system and applications like an actual physical machine would.

Why do we use VMs ?

The primary goal of using a VM is isolation. By running a VM, you create a dedicated environment that is completely separate from your host machines’s OS. This is useful because:

We can run dangerous scripts and test them without risking our host OS.
We can run a Linux environment on macOS/Windows OS to match the production environment.
We can literally run any version of an OS (even the very old ones) to support specific software that won’t run on modern systems.

Oracle VirtualBox

To actually run a VM, you need a "hypervisor." In simple words, a hypervisor is a manager that sits between the computer’s hardware and virtual machines.

Oracle VirtualBox is also one such hypervisor. It’s an industry standard for local virtualization since it’s free and open source and can be run on any host operating system.

How it works (The Manual Way):

The ISO Hunt: You have to go find a Linux .iso file (like Ubuntu or Debian) and download it.
The Wizard: You open the VirtualBox GUI and click through a "New VM" wizard, assigning RAM, CPU cores, and virtual hard drive space.
The Installation: You "insert" the virtual disk and walk through the entire OS installation process manually. Setting up the username, timezone, and partitions... every... single... time.

Click fatigue problem…

If you accidentally delete or misconfigure your VM, you have to delete it and do the entire process again, which is just too much pain.

Vagrant!!!

Here comes our savior. Vagrant is an open-source command-line tool by HashiCorp designed for creating, configuring, and managing portable VMs.

In simple words, Vagrant helps us automate the entire process of setting up a VM through code. Instead of clicking through a GUI to create a VM, you write a script called Vagrantfile. In DevOps terms, this is our first real encounter with Infrastructure as Code (IaC).

Vagrant is a huge improvement from the gui setup process due to the following reasons:

You no longer have to use the virtualbox window. Vagrant runs VMs in the background, and you can interact with them solely through the terminal.
If the configuration is messed up, just fix the script and restart or first destroy the old VM and then start the VM again using Vagrant CLI commands.

Instead of hunting for ISO files like a digital scavenger, Vagrant uses 'boxes'—pre-packaged images you can pull from Vagrant Cloud. Want Ubuntu 22.04? It’s one line of code away.

The command vagrant init hashicorp/bionic64 creates a base vagrant file with ubuntu/jammy64 box. Ex:

# -*- mode: ruby -*-
# vi: set ft=ruby :


# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
 # The most common configuration options are documented and commented below.
 # For a complete reference, please see the online documentation at
 # https://docs.vagrantup.com.


 # Every Vagrant development environment requires a box. You can search for
 # boxes at https://vagrantcloud.com/search.
 config.vm.box = "ubuntu/jammy64"


 # Disable automatic box update checking. If you disable this, then
 # boxes will only be checked for updates when the user runs
 # `vagrant box outdated`. This is not recommended.
 # config.vm.box_check_update = false


 # Create a forwarded port mapping which allows access to a specific port
 # within the machine from a port on the host machine. In the example below,
 # accessing "localhost:8080" will access port 80 on the guest machine.
 # NOTE: This will enable public access to the opened port
 # config.vm.network "forwarded_port", guest: 80, host: 8080


 # Create a forwarded port mapping which allows access to a specific port
 # within the machine from a port on the host machine and only allow access
 # via 127.0.0.1 to disable public access
 # config.vm.network "forwarded_port", guest: 80, host: 8080, host_ip: "127.0.0.1"


 # Create a private network, which allows host-only access to the machine
 # using a specific IP.
 # config.vm.network "private_network", ip: "192.168.33.10"


 # Create a public network, which generally matched to bridged network.
 # Bridged networks make the machine appear as another physical device on
 # your network.
 # config.vm.network "public_network"


 # Share an additional folder to the guest VM. The first argument is
 # the path on the host to the actual folder. The second argument is
 # the path on the guest to mount the folder. And the optional third
 # argument is a set of non-required options.
 # config.vm.synced_folder "../data", "/vagrant_data"


 # Disable the default share of the current code directory. Doing this
 # provides improved isolation between the vagrant box and your host
 # by making sure your Vagrantfile isn't accessible to the vagrant box.
 # If you use this you may want to enable additional shared subfolders as
 # shown above.
 # config.vm.synced_folder ".", "/vagrant", disabled: true


 # Provider-specific configuration so you can fine-tune various
 # backing providers for Vagrant. These expose provider-specific options.
 # Example for VirtualBox:
 #
 # config.vm.provider "virtualbox" do |vb|
 #   # Display the VirtualBox GUI when booting the machine
 #   vb.gui = true
 #
 #   # Customize the amount of memory on the VM:
 #   vb.memory = "1024"
 # end
 #
 # View the documentation for the provider you are using for more
 # information on available options.


 # Enable provisioning with a shell script. Additional provisioners such as
 # Ansible, Chef, Docker, Puppet and Salt are also available. Please see the
 # documentation for more information about their specific syntax and use.
 # config.vm.provision "shell", inline: <<-SHELL
 #   apt-get update
 #   apt-get install -y apache2
 # SHELL
end

A little bit of networking mumbo jumbo

In the following Vagrantfile (created from the command in the previous section), you can see some terms like "forwarded port," "private network," etc. Let’s discuss them 1 by 1.

Forwarded Port: Maps a port on your laptop to a port in the VM (e.g., see a website running on the VM via localhost:8080).
Private Network: Gives the VM a specific internal IP address (like 192.168.33.10) that only your laptop can see. This prevents the other person from accessing your VM from their laptop.
Public Network: Bridges the VM to your Wi-Fi, making it look like a real, separate computer on your home network. Your router sees the VM as a completely new physical device (like a new phone or a second laptop). It assigns the VM its own unique IP address from your home network's pool.

One 'complex' bit I found interesting is the host_ip parameter. By default, forwarding a port opens it up to your entire local network. If you're working on a public Wi-Fi, that's a security nightmare. By specifying host_ip: "127.0.0.1", you're essentially telling the VM, 'Only talk to me, and nobody else.' It’s a simple line of code that moves your setup from 'it works' to 'it’s secure.

Provisioning?

I manual setup of a VM, you often endup with a “Snowflake Server” - a machine that has unique, hand crafted configuration that no one can replicate because the steps weren’t documented.

Provisioning solves this by treating your server setup as a script. In Vagrant, you have three main ways to handle this ( btw I have only explored the first one in detail ):

1. Shell Provisioning (The Scripted Approach)

This is the most straightforward method. You write a standard Bash script that runs the moment the VM boots up.

The Process: Vagrant uploads the script to the guest machine and executes it with sudo privileges.
Use Case: Perfect for installing basic dependencies like git, curl, or setting up your Java or Go runtime environments.

Ex: In the vagrant file provided above, there is a section at the bottom of the file setting up basic provisioning. A better provisioning script would be something like below:

config.vm.provision "shell", inline: \<\<-SHELL  
  apt-get update  
  apt-get install \-y nginx  
  systemctl enable nginx  
  systemctl start nginx  
SHELL

This sets up nginx in an ubuntu VM.

2. File Provisioning (The Configuration Transfer)

Sometimes you don't need to run a command; you just need to move a file.

The Mechanism: It uses SCP/SFTP to move files from your host machine into the VM.
Use Case: Moving your custom nginx.conf or a .env file into the VM before the application starts.

3. Configuration Management (The Scalable Way)

Vagrant also supports "Heavyweight" provisioners like Ansible, Chef, or Puppet.

The Theory: Instead of just running a list of commands, these tools ensure the machine is in a specific "State." If a package is already installed, they skip it. This is known as Idempotency.

The beauty of provisioning is that it happens at the initial boot. If I need to reset my environment, I run vagrant destroy and vagrant up --provision. This ensures that my development environment is disposable and reproducible. I am no longer afraid to break things because I know I can recreate my entire infrastructure in seconds.

MultiVM Setup: Real Advantage of Vagrant !!!

MultiVM setup is where we define an entire infrastructure-like a load balancer, two web servers, a database - everything inside a singel Vagrant file.

Instead of having one block of configuration, you use config.vm.define to create separate "identities" within your Vagrantfile. This allows you to manage the entire lifecycle of a distributed system with a single command.

When you define multiple machines, Vagrant treats the Vagrantfile as a loop. Each machine can have its own OS, its own IP address, and its own provisioning scripts.

Here is an example script

Vagrant.configure("2") do |config|
 config.hostmanager.enabled = true
 config.hostmanager.manage_host = true
 ### DB vm  ####
 config.vm.define "db01" do |db01|
   db01.vm.box = "generic/centos9s"
   db01.vm.hostname = "db01"
   db01.vm.network "private_network", ip: "192.168.56.15"
   db01.vm.provider "virtualbox" do |vb|
    vb.memory = "600"
  end


 end
 ### Memcache vm  ####
 config.vm.define "mc01" do |mc01|
   mc01.vm.box = "generic/centos9s"
   mc01.vm.hostname = "mc01"
   mc01.vm.network "private_network", ip: "192.168.56.14"
   mc01.vm.provider "virtualbox" do |vb|
    vb.memory = "600"
  end
 end
 ### RabbitMQ vm  ####
 config.vm.define "rmq01" do |rmq01|
   rmq01.vm.box = "generic/centos9s"
   rmq01.vm.hostname = "rmq01"
   rmq01.vm.network "private_network", ip: "192.168.56.13"
   rmq01.vm.provider "virtualbox" do |vb|
    vb.memory = "600"
  end
 end
 ### tomcat vm ###
  config.vm.define "app01" do |app01|
   app01.vm.box = "generic/centos9s"
   app01.vm.hostname = "app01"
   app01.vm.network "private_network", ip: "192.168.56.12"
   app01.vm.provider "virtualbox" do |vb|
    vb.memory = "800"
  end
  end

 ### Nginx VM ###
 config.vm.define "web01" do |web01|
   web01.vm.box = "ubuntu/jammy64"
   web01.vm.hostname = "web01"
 web01.vm.network "private_network", ip: "192.168.56.11"
 web01.vm.provider "virtualbox" do |vb|
   #  vb.gui = true
    vb.memory = "800"
  end
end
 end

In this example we can use the following commands to manage our VMs using command line.

When you have multiple machines, your CLI commands change slightly:

vagrant up web01: Starts only the nginx server.
vagrant up: Starts the entire cluster.
vagrant ssh web01: Connects you specifically to the nginx server.

Conclusion: Why Vagrant Still Matters in a Containerized World

With the rise of Docker and Kubernetes, many assume that Virtual Machines are a relic of the past. However, Vagrant occupies a unique and necessary space in a developer's toolkit.

While containers are great for isolating applications, Vagrant is designed to isolate the entire environment. When you need to test kernel modules, experiment with different network topologies, or simulate a full-blown multi-node Linux cluster on your local machine, Vagrant is the industry standard for a reason.

The "Infrastructure as Code" Shift

The real power of Vagrant isn't just about avoiding the VirtualBox GUI; it’s about immutability. By defining your infrastructure in a Vagrantfile, you’ve moved away from "manual setup" and into Engineering.

Your environment is now Version Controlled: You can commit your Vagrantfile to Git.
Your environment is Sharable: A teammate can run vagrant up and have the exact same setup in minutes.
Your environment is Disposable: If a configuration goes sideways, you don't debug for hours; you destroy and up.

DP Isn’t Just About Big-O: How Cache Misses Killed My Knapsack

Arunabh Gupta — Tue, 06 Jan 2026 23:07:32 +0000

“I had a correct O(n·x) DP solution. Constraints were within limits. Still… TLE.”

I'm sure you must have said or heard of this phrase a lot of times before. Same happened with me as well. I was trying to solve the Book Shop Problem on CSES. It's a knapsack problem. I had my logic sorted out, but when I submitted the code, I was getting TLE for some reason. I checked multiple times, tried sorting and other techniques to somehow get better results but nothing worked. Also memory limit is not an issue since enough memory is given for this problem. So what the hell happened here ???

I would strongly suggest to try this problem first before reading ahead.

The DP that should have passed

Let's first discuss an approach for this problem. If you can observe here, you basically have two options at any index of the prices array. You can either take it or skip that book.

So, if we define a 2D dp array for this problem, any state dp[i][k] ( where i is index and k is budget ) can be defined as maximum value using values from i .... n-1 with a budget k.

Therefore the dp state would become

dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k])

Algorithmically, nothing seems to be wrong.

( I would suggest to watch some knapsack problem tutorials if you have difficulty understanding this. You can refer to this video for a better explanation. Although his approach is a bit different. He is thinking in prefix array terms whereas I my approach is derived from suffix array point of view. )

The mistake I didn’t know was a mistake

My initial code was something like this:

int dp[1002][100002];
int solve(int n, int x, int* h, int* s) {

    for(int k=0; k<=x; k++){
        dp[n-1][k] = k>=h[n-1] ? s[n-1] : 0;
    }

    for(int k=0; k<=x; k++){
        for(int i=n-2; i>=0; i--){
            if(k>=h[i]){
                dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
            }
            else{
                dp[i][k] = dp[i+1][k];
            }
        }
    }

    return dp[0][x];
}

This looks fine, right?
Let me break this down a little bit. I am looping through all the k from 0 -> x and for every k, I am looping through the array, updating the dp array using the same logic we discussed before.

This looping seems fine but still it was slow and I was getting a TLE.

The missing concept: memory access

When we write code, it’s easy to think that accessing an array is a constant-time operation.
In theory, it is.

In practice, where the data lives matters more than the operation itself.

Modern CPUs are extremely fast. So fast, in fact, that reading from main memory (RAM) would slow them down constantly. To avoid this, CPUs use caches — small, very fast memory layers placed between the CPU and RAM.

Roughly speaking, access times look like this:

Location	Approximate cost
CPU register	1 cycle
L1 cache	~4 cycles
L2 cache	~12 cycles
L3 cache	~40 cycles
RAM	200–300 cycles

This means that a single memory access from RAM can cost as much as hundreds of CPU instructions.

Cache lines: why memory is fetched in chunks

Another important detail:
The CPU never fetches a single integer from memory.

Instead, it fetches a cache line, typically 64 bytes at a time.

For an int array, that’s:

64 bytes = 16 integers

So when you access:

dp[i][k]

the CPU actually loads:

dp[i][k], dp[i][k+1], dp[i][k+2], ..., dp[i][k+15]

Whether you use them or not.

Why sequential access is fast

If your code accesses memory sequentially, like this:

dp[i][0], dp[i][1], dp[i][2], dp[i][3], ...

then:

One cache line is loaded
All 16 integers are used
The next access is already in cache
The CPU runs at full speed

This is why simple loops over arrays are incredibly fast.

Why skipping memory is slow

Now consider this access pattern:

dp[n-2][k]
dp[n-3][k]
dp[n-4][k]

Because C++ stores 2D arrays in row-major order, each of these accesses is far apart in memory—often hundreds of kilobytes.

What happens in this case is

A cache line is loaded
Only one integer from it is used
The remaining 15 integers are wasted
The CPU has to fetch another cache line from RAM
The CPU stalls for hundreds of cycles

In other words, I was paying the cost of memory access 16 times more often than necessary.

At small input sizes, this doesn’t matter.
At 100 million DP transitions, it absolutely does.

You can also use the analogy of a book to get a better feel for this. Imagine reading word 1 from page 1, word 1 from page 2, word 1 from page 3 and so on. It's very inefficient right. Instead you'll read the entire page 1, then page 2, then page 3 and so on.

Why “skipping memory” is disastrous

At this point, it's clear that memory is fetched in cache lines, not as individual elements. So the real problem appears when we don't completely use what we have fetched.

So we know that a cache line holds 16 integers. If my code access the memory sequentially, then all 16 integers will be used. If my code is jumping around the memory, then only 1 integer will be used and the rest 15 integers are wasted.

This minor difference causes a theoretically correct solution to give TLE.

Let's understand this in a bit more detail.

One cache miss vs many cache misses

Consider two access patterns over an array of integers.

Pattern 1: Sequential access
a[0], a[1], a[2], a[3], ...

What happens:

One cache line is loaded
16 integers become available
The CPU uses all of them
Cache miss every 16 accesses

Pattern 2: Skipping access
a[0], a[100000], a[200000], ...

What happens:

A cache line is loaded
Only one integer is used
The rest are wasted
Cache miss every single access

Both patterns perform the same number of array reads.
But one causes 16× more cache misses. On a large scale this can cause some serious optimisation issues.

The fix (almost anticlimactic)

We just need to flip the loops inside out. This way we will fix the index and loop for all k from 0 -> x. This way we are looping in a row-first manner, avoiding cache misses on every single access. This is much more efficient than the first version.

for(int i=n-2; i>=0; i--){
        for(int k=x; k>=0; k--){
            if(k>=h[i]){
                dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
            }
            else{
                dp[i][k] = dp[i+1][k];
            }
        }
    }

How to identify this mistake in other problems

By now, you might be thinking:
“How do I even know when to flip loops?”
“How do I decide which loop goes outside and which goes inside?”

The answer lies in DP state dependency — not in optimization tricks.

Let’s revisit the DP transition we used.

Step 1: Identify the dependency in the DP formula

From the knapsack recurrence, we need:

dp[i+1][k]
dp[i+1][k - h[i]]

So we can summarize the dependency as:

dp[i][k]  ←  dp[i+1][*]

This is the only dependency that matters.

Step 2: Translate dependency into an ordering constraint

In simple terms, this means:
To compute values for row i, the entire row i+1 must already be computed.
Now ask yourself a very direct question:
"How do I compute the table so that row i+1 is already available when I compute row i?"
There is only one correct answer:
i must go from n-1 down to 0

Which immediately gives us:

for (int i = n - 1; i >= 0; i--)

This loop order is not an optimization — it is required by correctness.

Step 3: What about the inner loop?

Once we are inside a fixed row i, we are computing:

dp[i][0], dp[i][1], dp[i][2], ..., dp[i][x]

Now ask the next crucial question:
Does dp[i][k] depend on dp[i][k-1]?
The answer is no.
It depends only on:

dp[i+1][k]
dp[i+1][k - h[i]]

Which means:

All k values inside the same row are independent
They can be computed in any order
As long as row i+1 already exists

Step 4: The natural loop structure

Putting this together, the correct loop order becomes:

for (int i = n - 2; i >= 0; i--) {   // dependency loop
    for (int k = 0; k <= x; k++) {  // free loop
        compute dp[i][k];
    }
}

This loop order matches the DP logic itself, not just performance concerns.

A crucial mental rule to avoid this confusion
The dimension that appears in the DP dependency must be the outer loop.
Look at the formula again:

dp[i][k] = f(dp[i+1][...])

i appears on the right-hand side
k does not

So we can say:

i controls the correctness order
k is a secondary, free dimension

Which gives us the rule:

for (i ...)        // dependency dimension
    for (k ...)    // free dimension

Once you start thinking in these terms, “flipping loops” stops being a trick and becomes a natural consequence of the DP definition.

Conclusion

At the end I came to a realization that big O notation only gives you an idea of how many operations are performed. It says nothing about how fast memory is accessed in these operations.
Once n × x gets large enough, cache behavior matters more than anything else.

Goroutines, OS Threads, and the Go Scheduler — A Deep Dive That Actually Makes Sense

Arunabh Gupta — Sun, 30 Nov 2025 22:31:20 +0000

If you’ve ever tried learning go routines, you’ve probably came across the line “They’re lightweight threads”. But then the questions start coming in:

“Are they real threads ?”
“How can go run millions of them ?”
“What is this GMP thingy ?"

I had the same questions and a lot of confusion when I started learning Go. Goroutines seems magical - almost too good to be true. The more research I did, things started to make some sense.

So I went deeper down the rabbit hole to understand more about how goroutines work along with OS threads. This article is a journey and my attempt to explain how everything works.

Introduction

Let’s first get out basics clear. We need to know about “Processes”, “Threads”, “Context-Switching” and some other basic concepts like stack size ( although they are not that important for the context of this article ).

Process

A process is an independent execution environment created by the operating system.

It consists of:

A private virtual address space (VAS) mapped by the OS and MMU ( Memory Management Unit )
Executable code (the program’s text segment)
A heap for dynamic memory allocation
One or more stacks, one per thread
File descriptors referencing kernel-managed resources (files, sockets)
Environment variables inherited from its parent
Process control metadata, including PID, scheduling priority, credentials, resource limits, and runtime statistics

Processes provide isolation: each process runs with its own memory mappings and cannot directly access the memory of other processes.

Context switching between processes requires switching address spaces, memory mappings, and other resource structures, which makes process switches relatively expensive.

In simple terms, A process is just a running program with it’s own isolated memory and system resources. Two processes will never interfere with each other’s execution and would stay isolated.

Threads

Now that we understand what a process is, the next piece is understanding threads, because goroutines are build on this concept.

A single process may contain one or many threads, all sharing the same:

virtual address space
heap
global variables
open file descriptors
code and libraries

However, each thread has:

its own stack
its own program counter (PC)
its own CPU register set
its own thread-local storage (TLS)
its own kernel scheduling metadata

Threads within the same process run independently and may execute concurrently on different CPU cores ( through context switching. We’ll come to this in next section ).

Because threads share memory, they can communicate cheaply — but must also use synchronization mechanisms (mutexes, semaphores) to avoid race conditions.

Race condition is a situation where 2 or more threads try to access the same shared data at the same time and atleast one of them is trying to update/modify it. The final outcome of the data depends on timing of those threads execution. Since the execution timings are unpredictable, results become random and inconsistent everytime we run the program making it difficult to debug the code. That’s why mutexes and semaphores are important to avoid race conditions since they lock the shared data once a thread has access to it. This way other threads don’t have access to that data till the execution of the first thread is finished and the lock is released.

In summary, threads are the kernel’s basic unit of CPU execution inside a process. Or in more simpler words , these are most basic sequence of instructions inside a process which run on cpu core 1 at a time. If you have a Octa core process ( means 8 cores ) then at a time, 8 threads can run simultaneously achieving true parallelism.

Even though switching between threads is cheaper than switching between processes, it’s still not free. Thread context switches involve several steps:

1. Saving/loading CPU registers: Each thread has its own execution state — registers like the program counter, stack pointer, general-purpose registers, etc.

When switching threads, the OS must save the registers of the outgoing thread and restore the registers of the incoming thread, which takes time.

2. Switching stacks: Every thread has its own stack.

A context switch requires switching the stack pointer from one thread’s stack to another’s.

This means the CPU must now begin reading/writing function frames from a completely different memory region. These stack sizes can range from 1Mb to 8 Mbs due to which it takes time in context switching.

3. Possible TLB and cache effects: TLB = Translation Lookaside Buffer, a small high-speed cache that stores recently used virtual-to-physical memory address translations.

Switching threads can cause TLB misses and cache invalidations, which forces the CPU to reload memory mappings or fetch data from slower memory levels, reducing performance.

4. Involvement of the OS scheduler Thread switching requires a tap into the operating system (kernel mode).

The OS scheduler must:

decide which thread runs next
update scheduling metadata
manage states like runnable, waiting, or blocked This kernel-mode transition adds a lot of overhead.

Context Switching

A context switch is the act of the operating system pausing one running thread or process and resuming another.

Because the CPU can run only one thread per core at a time, the OS must rapidly switch between multiple threads to provide concurrency.

When the OS switches from Thread A → Thread B, it must:

Save the CPU registers (program counter, stack pointer, general-purpose registers, flags) of Thread A
Save Thread A’s kernel metadata (scheduling state, priority, CPU usage stats)
Load Thread B’s saved register state
Switch to Thread B’s stack
Possibly switch address spaces (if switching between processes)
Update scheduling queues and bookkeeping (updating all the small pieces of internal data the OS keeps to track the state of each thread or process)

This entire procedure is performed by the OS scheduler and requires switching from user mode to kernel mode, running scheduling logic, then returning back to user mode.

A context switch ensures that each runnable thread gets a fair share of CPU time but comes with performance costs due to register saving, stack switching, TLB/cache effects, and kernel involvement. Following are the most common reasons why context switching is expensive:

Saving & loading registers: The CPU must store all registers of the old thread and restore the registers of the new one — its entire execution state.
Switching stacks: Each thread has its own stack. The CPU has to stop using one thread’s stack and start using another’s.
Kernel involvement: Switching threads requires entering the kernel, updating run queues and priorities, then returning to user mode.
TLB (Translation Lookaside Buffer) effects: Switching between processes requires switching the page table and flushing part of the TLB, slowing down memory access.
Cache disruption: Each thread often works on different memory regions. Switching threads may cause cache misses because the CPU has to load new data from memory.

All this makes context switching far from free, even though modern CPUs and OSes optimize it heavily. This is where goroutines shine !!!

What are Goroutines and how are they different from OS threads ?

Concurrency is one of the Go’s biggest strengths, and goroutines are at the center of it. But in order to appreciate why goroutines are special, we need to understand what they are and how are they different from OS threads.

What are Goroutines ?

A goroutine is a lightweight function that runs independently and concurrently within a go program. More technical definition would be - a goroutine is a user-space execution unit ( or user space threads ) managed entirely by go runtime. It has ->

Its own stack, starting with a very small size ( ~2KB) compared to OS threads
The ability to grow or shrink its stack size on demand
Scheduling performed by the go runtime scheduler, no the OS scheduler
Extremely low creation and context-switching cost

Goroutines are multiplexed on the OS threads using M:N mapping

M goroutines are mapped to N OS threads. This is why goroutines can scale to hundreds of thousands or even millions in a single process.

How are Goroutines different from OS Threads ?

Although they both represent unit of execution or a sequence of instructions, goroutines and OS threads differ in almost every important way.

Scheduling
- Threads: scheduled by OS kernel
- Goroutines: schedules by Go runtime ( this is why goroutine switching is way cheaper since it never traps into kernel )
Stack size
- OS threads: large, fixed-size stacks ( 1MB-8MB each )
- Goroutines: tiny, flexible stacks ( start ~2KB, size is flexible. This is one of the main reasons why go can handle such a massive concurrency )
Creation Cost:
- Thread creation is expensive since kernel and memory allocation is involved.
- Goroutine creation is extremely cheap ( just user space allocation )
Context switching cost
- Thread switching involves saving registers, switching stacks, kernel mode transitions, scheduler logic, and cache/TLB effects.
- Goroutine switching is done inside the Go runtime and requires only saving a small amount of state. Since no kernel is involved, creation is very fast.
Memory & Resource Usage
- Thread consumes megabytes of memory.
- Goroutines consume kilobyts. This allows go programs to use thousands of go routines safely.
Blocking behaviour:
- A blocking syscall blocks an entire OS thread
- A blocking I/O operation in Go parks the goroutine, not the OS thread. Runtime efficiently reassigns another goroutine to the freed OS thread. In this way OS threads are never sitting idle giving higher performance.

A goroutine is a lightweight, user-space thread scheduled by the Go runtime, while an OS thread is a heavyweight execution unit scheduled by the operating system.

Goroutines are cheaper, faster, and far more scalable than OS threads.

What is Go runtime ?

By now we’ve talked about processes, threads, and goroutines — but there’s a crucial piece sitting between goroutines and the operating system: the Go runtime.

This runtime is what actually makes goroutines possible.

The Go runtime is a mini operating system that runs inside your Go program. More formally , it is a user-space runtime system bundled with every Go program.

It takes care of everything the OS doesn’t handle for you, including:

running and scheduling goroutines ( GMP model )
managing memory and garbage collection
handling timers
dealing with network and system calls
growing and shrinking goroutine stacks
waking goroutines when events happen

When you run a Go program, you’re not just running your code — you’re also running this runtime, which works in the background and keeps the whole concurrency system running smoothly.

The key thing to understand:

Goroutines don’t run on the OS.

They run on top of the Go runtime, and the runtime runs on OS threads.

This is what makes goroutines so efficient.

You can think of go runtime as a middle layer between goroutines and OS threads.

GMP Model, M:N mapping ?

Goroutines don’t run directly on CPU cores, and they aren’t scheduled by the operating system.

Instead, Go uses a custom, high-performance scheduler called the GMP model, which is at the heart of Go’s concurrency design.

Understanding GMP is crucial because it explains how thousands of goroutines can be multiplexed onto a small number of OS threads efficiently.

What is the GMP model ?

G stands for goroutine ( obviously ). In other words, a lightweight, user-space execution unit.

It contains it’s own tiny stack, it’s program counter and rest of the meta data. Note the G cannot run on a CPU on itself.

M stands for Machine ( OS thread ). Basically, it is a real operating system thread. This is what the OS scheduler actually runs on a OS. 1 M can run only 1 G at a time. Switching between G’s happen inside M itself. If the M has to do a context switch, then all the G’s assigned to that M will be parked for time being. Once that M returns back, all the states of previously assigned G’s are loaded back.

P stands for Processor. Note that I am not talking about the actual CPU cores here. Processor represents a logical scheduler token which run a queue of goroutines. Only an M holding P is allowed to execute a Go code.

P1 → M1 → G1, G2, G3...

P2 → M2 → G4, G5...

P3 → M3 → G6...

...

If let’s say G1 is blocked due to some I/O ( api call or file read ), then M1 will park that goroutine ( it’s actually the go runtime which parks the G1 by marking it as waiting. Since go code is running on M1, we just say that M1 parks the G1 ) and picks the next available G in P1’s queue. This way OS thread is always busy avoiding wasting CPU time.

Also note that P’s run queue is managed by go runtime.

M:N mapping ?

Let’s say for example you have 100,000 goroutines and 8 cpu cores ( an octa core processor ), and runtime might create around 10-20 OS threads (M’s).

Then these M’s will get scheduled on the 8 cores by the OS. Each M runs one goroutine at a time. Each P has a queue of goroutines waiting to run. When one goroutine is blocked ( e.g. on a channel or syscall ), then M picks another goroutine from P’s run queue.

So M:N basically means that M number of goroutines and mapped to N number of OS threads. ( Please note that here M represent a NUMBER of goroutines, not OS thread of GMP model. I know… letter convention sucks ).

A simple example: How the go runtime schedules goroutines

This example will touch the surface level of scheduling. I won’t be discussing any scheduling algorithms here.

Let’s say we have one OS thread (M), one processor (P), and three goroutines (G1, G2, G3). When these go routines are started, these are placed in the P’s local run queue.

Step 1: M begins running G1

The OS scheduler picks M1 (an OS thread) to run on a CPU core.
M1 owns P1, which contains the run queue:

[G1, G2, G3].

M1 pops G1 from the front of the queue and begins executing it.

Step 2: G1 hits a blocking point

Let’s say G1 makes a channel receive or waits on a network read.

Because this is Go-managed blocking, the runtime notices that G1 cannot continue.

The go runtime parks G1
It moves G1 to some appropriate wait list (e.g., waiting on a channel or the network poller)

G1 is now blocked, but importantly:

M1 is NOT blocked. The OS thread stays free.

Step 3: M1 picks the next runnable goroutine

Since G1 is parked, M1 looks at P1’s run queue.

Remaining goroutines:

[G2, G3]

M1 selects G2 and begins executing it.

This is a goroutine context switch, done entirely in user space by the runtime (no OS involvement).

It switches:

G1’s PC and stack pointer saved
G2’s PC and stack pointer restored

This is extremely fast.

Step 4: OS preempts the thread (OS-level context switch)

While G2 is running, the OS timer interrupt fires.

The OS scheduler says:

“Time slice over. Let’s run another OS thread.”

M1 is paused
OS loads another OS thread, say M7, onto the CPU
M7 may belong to a totally different program

This is an OS context switch — heavier and more costly.

Inside the paused M1:

G2 is still waiting to resume
P1 is still attached to M1

When the OS eventually puts M1 back onto the CPU, Go continues running G2 from exactly where it left off.

Step 5: G1 becomes unblocked

Suppose the network poller signals that data arrived.

The runtime marks G1 as runnable again
G1 is placed back into P1’s run queue

Queue becomes:

[G3, G1]

Step 6: M1 finishes G2 and picks the next goroutine

When G2 yields or completes, M1 picks the next goroutine in P1’s queue.

Next is G3.

After G3 yields, M1 picks G1, which is runnable again.

Putting it All Together

Here’s what happened:

Go runtime context switched between goroutines (G1 → G2 → G3 → G1)

Done entirely in user space
Very cheap
No OS involvement

OS scheduler context switched M1 off the CPU

OS-level
Expensive
Paused whatever goroutine M1 was running

P’s run queue kept track of which goroutines were runnable

Managed by the Go runtime
M1 always pulled new work from P1

Blocked goroutines didn’t block the OS thread

Thanks to goroutine parking
OS thread stayed productive, always busy

Why in CPU bound tasks making more go routines doesn’t make any sense?

Goroutines shine when tasks involve waiting (network I/O, disk I/O, timers, channels, etc.), because they allow other goroutines to run while one is blocked.

But in CPU-bound tasks—like computing primes, hashing, compression, physics simulation, image processing—goroutines don’t help beyond a certain point.

Let’s break down the reasons

There are only N CPU cores, so only N goroutines can run at the same time. If your machine has 8 CPU cores, only 8 threads/G’s can run simultaneously—no matter how many goroutines you spawn. Everything else just waits in queues. So if your CPU can do 8 things at a time, spawning 100,000 CPU-bound goroutines won’t make it faster. It will only add overhead.
Extra goroutines increase scheduler overhead. Every extra runnable CPU-bound goroutine: must be queued, must be picked up by the scheduler, must eventually run, must involve G→G context switching. When the tasks are CPU-bound, each goroutine never blocks, so the scheduler has fewer opportunities to efficiently switch them. Too many unblocked goroutines = too many unnecessary context switches. This overhead can reduce total throughput.

You don’t need more than GOMAXPROCS goroutines for parallel CPU work. If GOMAXPROCS = 8, spawning 8 CPU-bound goroutines achieves maximum parallelism.

Anything more:

doesn’t increase speed
increases scheduling overhead
increases memory usage
causes more context switching
lowers cache locality

So ideal number of goroutines for CPU-bound tasks ≈ number of CPU cores.

Conclusion

Goroutines aren’t just lightweight threads — they’re part of a carefully designed runtime system that makes Go highly scalable. By understanding processes, threads, context switching, and the GMP model, it becomes clear why Go doesn’t rely on OS threads alone.

Goroutines work so well because they use tiny, growable stacks, fast user-space scheduling, and an efficient M:N mapping to OS threads. This lets Go run thousands or even millions of concurrent tasks without the cost of creating thousands of OS threads.

In the end, the message is simple:

OS threads handle parallelism; goroutines enable massive concurrency.

Together, they give Go its power, performance, and simplicity.

This brings us to the end of the blog. I’ve tried my best to explain everything I know in simple terms. AI helped a lot in shaping this article since I’m still improving my writing skills.

Your feedback would mean a lot — it helps me learn and write better.

Creating a very basic gRPC server

Arunabh Gupta — Mon, 03 Nov 2025 19:09:46 +0000

Hi everyone, I started out learning gRPC and created a very basic gRPC server. Although it doesn't do much, it still has a lot of technical stuff going on. So here is a simple breakdown of my attempt at making a gRPC server. I will try to explain everything in simple words and make everything crystal clear.

First of all, let's address the question "What the hell is gRPC ???"
In simple words gRPC is a opensource framework for remote procedure calls (RPC) developed by google. It allows a client to call a function on a server as if it is a local function making it ideal for distributed systems and microservices. It uses protocol buffers which can be serialized into smaller binary format, making it much more efficient and faster than JSON format.It also used HTTP/2 as the transport protocol, enabling faster, persistent connections and features essential for streaming.

Before we move further, there are some prerequisites that you need to be familiar with.

basics of golang
pointers, structs, struct embedding, interfaces
context
stubs
protocol buffers

If you don't know these, don't worry; I'll give a small explanation of all these concepts when used.

Let's first start with our file system. We will create a folder named "grpcdemo". Inside this directory,o pen up your terminal and run the following command:

go mod init grpcdemo

It creates a plain text file named "go.mod" in the current directory. This file is the root of your module and contains all the metadata Go needs to manage your project's dependencies.

Here is my folder structure. If you are following this aritcle then make sure to keep the folder structure consistent.

└── 📁grpcdemo
    └── 📁client
        ├── client.go
    └── 📁proto
        ├── greet.proto
    └── 📁server
        ├── server.go
    └── go.mod

We'll go over each one of them one by one starting with the greet.proto.

syntax="proto3";

package pb;

option go_package = "grpcdemo/pb";

service GreetService{
    rpc SayHello (HelloRequest) returns (HelloResponse);
}

message HelloRequest{
    string name = 1;
}

message HelloResponse{
    string message = 1;
}

Let's first discuss what a proto file is. A .proto file is a simple text file that defines the structure of data you want to serialize and transfer, commonly used in conjunction with Protocol Buffers (protobuf, in short). Protobuf is nothing but a data serialization mechanism. It's language agnostic (independent of the language we are writing our code in). It was developed to serialize data more efficiently across various internal services. Normally we use REST APIs which is fine and relatively easier to implement, but protobufs are preffered in more high performant services.

First line declares the syntax for proto file since there are versions available.

The second line specifies the package name for the Protocol Buffer definitions contained within the file.

Third line is a Go-specific instruction for the protoc compiler. Its primary purpose is to define the Go package name and import path for the generated files (greet.pb.go and greet_grpc.pb.go).

Protoc is a compiler which compiles the .proto files and generates functional source code from the schema defined in the proto file. It uses plugins for generating the executable code.

Protocol Buffer Compiler Installation | Protocol Buffers Documentation

How to install the protocol buffer compiler.

protobuf.dev

After installing protoc use the following command to generate executable go files from out greet.proto file. Make sure you are in the root directory.

protoc --go_out=../ --go-grpc_out=../ proto/greet.proto

It executes two separate code generation passes, dictated by the two output flags:

--go_out=../,

plugin: protoc-gen-go,

description: Generates files like greet.pb.go.,"Contains the Go structs (e.g., HelloRequest, HelloResponse) for your data messages, along with serialization and deserialization methods." ../ is the path we pass relative to which our go files will be generated. In this case files will be generated inside pb directory.

--go-grpc_out=../,

plugin: protoc-gen-go-grpc,

description: Generates files like greet_grpc.pb.go (name may vary).,Contains the Go interfaces (GreetServiceServer) for implementing the service and the client stubs for calling the service.

After running the command, file system should look something like this:

└── 📁grpcdemo
    └── 📁client
        ├── client.go
    └── 📁pb
        ├── greet_grpc.pb.go
        ├── greet.pb.go
    └── 📁proto
        ├── greet.proto
    └── 📁server
        ├── server.go
    ├── go.mod
    └── go.sum

You can see that it generated a new directory pb along with two go files inside. We will use these go files in our server.go and client.go. Now let's discuss about the services and messages in our proto file.

messages & services are the most important parts of a .proto file since they defined the schema of the data and function we need to serialize.

In messages we define the structure of the data/payload we need to pass between server and client.

Notice the numbers assigned to each field: name = 1; and message = 1;. In Protocol Buffers, these numbers (not the field names) are used to identify the field in the serialized binary data. They are known as field tags. These tags must be unique within the message and must never be changed once your service is in production, as changing them will break backward compatibility with older clients or servers.

In services we define the rpc function signature. This is the function that can be called by the client, in our case it's SayHello.

Now the greet_grpc.pb.go and greet.pb.go files need not to be meddled with since they are automatically generated. If you want to change anything, make the changes in .proto file.

It's time for server.go file.

package main

import (
    "context"
    "grpcdemo/pb"
    "log"
    "net"

    "google.golang.org/grpc"
    "google.golang.org/protobuf/types/known/emptypb"
)

type Server struct {
    pb.UnimplementedGreetServiceServer
}

func (s *Server) SayHello(ctx context.Context, req *pb.HelloRequest) (*pb.HelloResponse, error) {
    name := req.GetName()

    result := "Hello " + name + ", Welcome to grpc tutorial !!!"

    res := &pb.HelloResponse{Message: result}

    return res, nil
}

func main() {
    lis, err := net.Listen("tcp", ":12345")
    log.Print("gRPC server is running on port 12345")
    if err != nil {
        log.Fatalf("failed to listen : %v", err)
    }

    s := grpc.NewServer()

    pb.RegisterGreetServiceServer(s, &Server{})

    if err := s.Serve(lis); err != nil {
        log.Fatalf("Failed to server: %v", err)
    }
}

There are a bunch of things happening in this code.

type Server struct {
    pb.UnimplementedGreetServiceServer
}

The protocol buffer compiler (protoc) generated an Interface called GreetServiceServer. This interface lists every RPC method defined in your .proto file (currently, just SayHello). To build a valid gRPC server, your Server struct must satisfy this entire interface.

By embedding the UnimplementedGreetServiceServer struct (which provides a default, error-returning implementation for all methods), we accomplish two things:

Initial Satisfaction: You instantly satisfy the entire GreetServiceServer interface without explicitly writing SayHello implementation yet. Doing this would ensure forward compatibility. It's like a default implementation for GreetServiceServer interface inside greet_grpc.pb.go file
The Override: Your manually written SayHello method overrides the placeholder version, making your implementation the one that actually runs.

(Do check out the greet_grpc.pb.go file to find the GreetServiceServer interface)

func (s *Server) SayHello(ctx context.Context, req *pb.HelloRequest) (*pb.HelloResponse, error) {
    name := req.GetName()

    result := "Hello " + name + ", Welcome to grpc tutorial !!!"

    res := &pb.HelloResponse{Message: result}

    return res, nil
}

This is the implementation of the SayHello method we defined in the .proto file.

func main() {
    lis, err := net.Listen("tcp", ":12345")
    log.Print("gRPC server is running on port 12345")
    if err != nil {
        log.Fatalf("failed to listen : %v", err)
    }

    s := grpc.NewServer()

    pb.RegisterGreetServiceServer(s, &Server{})

    if err := s.Serve(lis); err != nil {
        log.Fatalf("Failed to server: %v", err)
    }
}

Here comes the fun part. The net.Listen function creates a TCP listener (lis) on the specified port (:12345). This step effectively tells the operating system, "I want to reserve this port and listen for incoming connections." If the port is already in use or access is denied, the program exits with an error.

s := grpc.NewServer() line creates an instance of gRPC server which handles all the complex internal tasks of grpc.

pb.RegisterGreetServiceServer(s, &Server{}) line links/registers our Server implementation to grpc server instance. In other words it tells the grpc server that if any request comes targetting the "GreetService" then redirect that request to our defined Server struct.

Finally s.Serve(lis) launches our server. It takes the network listener (lis) and begins an infinite loop, continuously accepting new client connections.

Now let's move on to the client implementation. We will make a function call to the SayHello as if it was a local function.

package main

import (
    "context"
    "grpcdemo/pb"
    "log"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func main() {
    conn, err := grpc.NewClient(
        "localhost:12345",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }

    defer conn.Close()

    greetClient := pb.NewGreetServiceClient(conn)

    ctx, cancel := context.WithTimeout(context.Background(), time.Second)
    defer cancel()
    res, err := greetClient.SayHello(ctx, &pb.HelloRequest{Name: "Billy"})
    if err != nil {
        log.Fatalf("Failed to greet: %v", err)
    }

    log.Printf("Greeting: %s", res.GetMessage())
}

Great !!!, now let's first test the code. Open two terminals. In first one run the server.go file using go run main.go and similarly run client.go file as well. You should see the following output on client terminal.

2025/11/03 22:36:20 Greeting: Hello Billy, Welcome to grpc tutorial !!!

Great Work !!!
Here is a breakdown of the client.go file.

grpc.NewClient("localhost:12345",grpc.WithTransportCredentials(insecure.NewCredentials()),)
This line creates a persistent connection to the open server running on port ":12345". Since this is a simple implementation, I haven't passed any proper credentials.

greetClient := pb.NewGreetServiceClient(conn)

The function pb.NewGreetServiceClient() is generated by the Protocol Buffer compiler. It takes the established network connection (conn) and returns an object (greetClient). All methods called on this stub (like SayHello) are automatically handled by gRPC, taking care of all the complicated underlying networking stuff.

res, err := greetClient.SayHello(ctx, &pb.HelloRequest{Name: "Billy"})

Finally, greetClient is used to call the SayHello method on the client stub. This is what we call an RPC. Print out the response variable to check if everything worked out fine or not.

That's the end of this short project. I definitely got to learn a lot about protocol buffers and how they work with gRPC. The biggest takeaways were realizing how Protocol Buffers define the contract and how the client stub hides the network overhead. Understanding the need for struct embedding to achieve forward compatibility was the critical Go-specific lesson. Please let me know if I missed out something or went wrong somewhere.

Here's the source code for this project. You can play around using this as a base. Try to create more services, more methods inside the GreetService only and interact with the gRPC server.

Forem: Arunabh Gupta

How I Stopped Manually Setting Up Virtual Machines

What are VMs?

Why do we use VMs ?

Oracle VirtualBox

Vagrant!!!

A little bit of networking mumbo jumbo

Provisioning?

1. Shell Provisioning (The Scripted Approach)

2. File Provisioning (The Configuration Transfer)

3. Configuration Management (The Scalable Way)

MultiVM Setup: Real Advantage of Vagrant !!!

Conclusion: Why Vagrant Still Matters in a Containerized World

The "Infrastructure as Code" Shift

DP Isn’t Just About Big-O: How Cache Misses Killed My Knapsack

The DP that should have passed

The mistake I didn’t know was a mistake

The missing concept: memory access

Cache lines: why memory is fetched in chunks

Why sequential access is fast

Why skipping memory is slow

Why “skipping memory” is disastrous

One cache miss vs many cache misses

The fix (almost anticlimactic)

How to identify this mistake in other problems

Step 1: Identify the dependency in the DP formula

Step 2: Translate dependency into an ordering constraint

Step 3: What about the inner loop?

Step 4: The natural loop structure

Conclusion

Goroutines, OS Threads, and the Go Scheduler — A Deep Dive That Actually Makes Sense

Introduction

Process

Threads

Context Switching

What are Goroutines and how are they different from OS threads ?

What are Goroutines ?

How are Goroutines different from OS Threads ?

What is Go runtime ?

GMP Model, M:N mapping ?

What is the GMP model ?

M:N mapping ?

A simple example: How the go runtime schedules goroutines

Step 1: M begins running G1

Step 2: G1 hits a blocking point

Step 3: M1 picks the next runnable goroutine

Step 4: OS preempts the thread (OS-level context switch)

Step 5: G1 becomes unblocked

Step 6: M1 finishes G2 and picks the next goroutine

Putting it All Together

Go runtime context switched between goroutines (G1 → G2 → G3 → G1)

** OS scheduler context switched M1 off the CPU**

** P’s run queue kept track of which goroutines were runnable**

** Blocked goroutines didn’t block the OS thread**

Why in CPU bound tasks making more go routines doesn’t make any sense?

Conclusion

Creating a very basic gRPC server

Protocol Buffer Compiler Installation | Protocol Buffers Documentation

OS scheduler context switched M1 off the CPU

P’s run queue kept track of which goroutines were runnable

Blocked goroutines didn’t block the OS thread