Forem: Ernesto Enriquez

Lifecycle of a process

Ernesto Enriquez — Wed, 18 Mar 2026 14:34:37 +0000

With Linux version 7 just around the corner, I thought it would be interesting to trace process creation in as much painstaking detail as one weekend and 6 cups of coffee would allow.

Namely, I want to answer the question of what exactly happens when we open a new browser tab or start our favorite video game. I’ll be focusing on processes spawned by other processes¹ through libc, since the C standard library is about as close as we're going to get to the bedrock of the very universe. If you use CPython, this mechanism applies as well.

We'll be taking the following journey together:

I like to think of this as expanding on Thorsten Ball’s “Where did fork go?”. In his post, Thorsten gives insight into what happens when you fork a process. It’s a great read, you should check it out.

The mechanism I’ll be describing also includes processes started by a programmer. Indeed, this is what most people see when they use a computer to create some application. However, under the hood there is always a process required to start another process, such as a shell (e.g, ./<process> on bash) or an init system (e.g, creating a service with Systemd).

Simple.c

Consider a C program whose only purpose in life is to spawn a child process (id est, it passes butter).
Why so simple? I figured this program would impose the least cognitive load. I do this for your understanding. I don’t think you’re stupid. I think you’re smart, and beautiful, and precious, and worth letting merge into my lane during rush hour.

// Simple.c
#include <unistd.h>

int main() {
  fork();
  return 0;
}

Before continuing, I recommend reading the fork man page, at least up to the errors section. It’s a quite short read, and If I recall correctly, there was a Harvard health study that claimed that programmers that read linux man pages are 78% less likely to develop carpal tunnel [//TODO: citation needed]

The tldr is that fork() is a function in libc² that spawns a child process from the parent process that called it. This child is almost identical to the parent.
Like the parent, the child will capture the return value from the fork() and execute any code after fork().

Unlike the parent, however, the child will not execute fork() or any code before it. Moreover, the child’s fork() will return a different value, the child will have a different process ID than the parent, and a couple other things you’d know if you read the manual.

Fork is a system call (not to be confused with fork(), the libc function). System calls are services the linux kernel provides to processes in user space. You can also think of system calls as an API provided by the Kernel. The API contract in this sense includes things like what values are expected in what registers and what assembly instruction (e.g, int 0x80, syscall, etc.) to execute. The Standard C Library (libc) also provides an API that simplifies working with system calls, in the form of functions that wrap the Kernel’s system call API.

You could in theory provision these system call services directly from the Kernel, bypassing libc completely. If you’re particularly keen on managing thread safety, thinking about register values, and programming in assembly then please, by all means, knock yourself out.

Now here’s the kicker: fork(), the libc function, does not call Fork, the system call. Rather, fork() eventually calls Clone, yet another system call.

Why is this?

Let’s dive into fork() and see what we find.

A journey of a thousand miles

// unistd.h
/* Clone the calling process, creating an exact copy.
   Return -1 for errors, 0 to the new process,
   and the process ID of the new process to the old process.  */
extern __pid_t fork (void) __THROWNL;

This is the prototype for the fork() function. If you’re on linux (congrats!) the location of this file is typically found in

/usr/include/unistd.h

The implementation of this peculiar little function takes up just over 140 lines of code at the time of this writing. You won't find said implementation in your file system. By the time you use it, it’s already a shared object file, ready to be linked to your lovely programs at runtime. This object file in question is libc.so, found in /usr/lib/.

Check out the fork.c file in Posix standard library in glibc if you want to take a look at the implementation.

The first thing you might notice is that fork.c does not implement:

__pid_t fork (void)

Towards the bottom of the file you’ll find find:

weak_alias (__libc_fork, fork)

Okay, so calling the fork() function in your code actually executes:

pid_t __libc_fork (void)

You’ll also notice inside the __libc_fork function that line 75 makes a call to _Fork().

pid_t pid = _Fork ();

So fork() in your code actually runs __libc_fork(), whose job is (among other things) to handle threading and eventually call _Fork().

Funny enough, _Fork() actually exists in a few places within the standard library. During buildtime, the _Fork() you execute is decided by your machine’s OS and CPU³.

We’ll be focusing on the _Fork.c file within sysdeps/nptl. Here, we see a call is made to:

pid_t pid = arch_fork (&THREAD_SELF->tid);

arch_fork() is defined in its own header file, and here is where we start to see how the sausage is made.

/* Call the clone syscall with fork semantic.  The CTID address is used
   to store the child thread ID at its location, to erase it in child memory
   when the child exits, and do a wakeup on the futex at that address.

   The architecture with non-default kernel abi semantic should correctly
   override it with one of the supported calling convention (check generic
   kernel-features.h for the clone abi variants).  */
static inline pid_t
arch_fork (void *ctid)
{
  const int flags = CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD;
  long int ret;
#ifdef __ASSUME_CLONE_BACKWARDS
# ifdef INLINE_CLONE_SYSCALL
  ret = INLINE_CLONE_SYSCALL (flags, 0, NULL, 0, ctid);
# else
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, 0, ctid);
# endif
#elif defined(__ASSUME_CLONE_BACKWARDS2)
  ret = INLINE_SYSCALL_CALL (clone, 0, flags, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_BACKWARDS3)
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, 0, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_DEFAULT)
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, ctid, 0);
#else
# error "Undefined clone variant"
#endif
  return ret;
}

Remember what I was saying about fork() calling Clone? Well, behold!

Clone is used since it’s a more versatile system call. You can read more about their differences here.

At this point it seems what lies before us is a maze of macros. I encourage you to try making your way through, starting at the arch_fork function. See if you can find your way to the assembly. If we keep following this thread, starting at INLINE_SYSCALL_CALL, you’ll eventually arrive at the internal_syscall0, internal_syscall1, …, all the way up to internal_syscall6.

The clone system call has 5 arguments, so we’re dealing with this guy

#undef internal_syscall5
#define internal_syscall5(number, arg1, arg2, arg3, arg4, arg5) \
({                                  \
    unsigned long int resultvar;                    \
    TYPEFY (arg5, __arg5) = ARGIFY (arg5);              \
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);              \
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);              \
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);              \
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);              \
    register TYPEFY (arg5, _a5) asm ("r8") = __arg5;            \
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;           \
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;           \
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;           \
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;           \
    asm volatile (                          \
    "syscall\n\t"                           \
    : "=a" (resultvar)                          \
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4),     \
      "r" (_a5)                             \
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);            \
    (long int) resultvar;                       \
})

There you have it. We’re loading up our registers and executing the syscall instruction directly in assembly.

What a ride, huh?

I think that’s enough digging through our boy Stallman et al’s magnum opus. It’s a wonderful repository, to say the least, and I implore you to spend some time exploring it time forbid. I recommend starting with the table of contents in the manual, picking what seems interesting, and dig dig dig!

At this point, you might be asking: “Okay dude, but what does the assembly actually look like?”

Were you actually thinking that? What a nerd.

But fair enough, looking at this from a lower level of abstraction should help solidify what in Davey Jone’s locker is actually happening. Let’s look at some assembly.

A lower level of abstraction

gcc -S simple.c -o simple.S
cat simple.S

Outputs:

    .file    "simple.c"
    .text
    .globl    main
    .type    main, @function
main:
.LFB0:
    .cfi_startproc
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    call    fork@PLT        ; look at me, look at me, im mr meeseeks look at me!!
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size    main, .-main
    .ident    "GCC: (GNU) 15.2.1 20260209"
    .section    .note.GNU-stack,"",@progbits

At runtime, when the call assembly instruction is executed, our CPU will start executing fork in libc. Where do we find fork? We can use the ldd command.

gcc simple.c -o simple
ldd simple

output:

    linux-vdso.so.1 (0x00007f7edfa0f000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f7edf7f2000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f7edfa11000)

Awesome! Next, let’s disassemble the libc.so.6 object file and find exactly where the system call is made. Remember the _Fork() function? That should give us a clue of where to look. I’ll spare you the trouble of finding the precise location of the system call. This is what worked for me, feel free to tweak it to your heart’s content:

objdump -d /usr/lib/libc.so.6 | grep -m 1 "<_Fork@@GLIBC_2.34>" -A 56 | nl

output:

     1    00000000000e4840 <_Fork@@GLIBC_2.34>:
     2       e4840:    f3 0f 1e fa              endbr64
     3       e4844:    55                       push   %rbp
     …
    20       e4881:    b8 38 00 00 00           mov    $0x38,%eax
    21       e4886:    0f 05                    syscall
     …

See line 21? Lovely, isn't it?

How do we know that's the system call we're looking for? You can tell what system call a program makes by looking at the value in the EAX register right before the syscall assembly instruction (or int 0x80 on a 32-bit ISA).

In our case, we're moving the value 0x38 into our EAX register.

When the syscall assembly instruction is executed, a couple of things happen very, very quickly. The important bits are:

Read the value of some MSR register into the CS register for privilege escalation from user mode to kernel mode.
Save the instruction pointer in the RIP register to the RCX register for returning from our syscall.
Read the values of some MSR register into the RIP register to point to the kernel's virtual address where we’ll resume execution.

These are just the highlights, if you want to learn more I recommend taking a look at your CPU manufacturer’s Software Developer Manual. I also recommend a dark roast triple shot espresso – you’ll need it. For my ISA, the manual is provided by intel.

That pretty much wraps things up! At this point you’ll find yourself executing code within the kernel itself. What code exactly depends on your architecture but for 64 bit systems on X86 you’ll want to take a look at the entry_64.S assembly file.

Here is that diagram again, should make a lot more sense now!

Kernel POV

For the curious, this is what things look like from a very, very high level on the kernel side:

From the entry point, you’re routed to the system call handler
From the system call handler you’re routed to the function for cloning your parent process: kernel_clone
From kernel_clone you’ll first clone your process and then ship it off to the scheduler.
The schedule will run the child process eventually.

Thanks for reading, and keep digging!

Another approach worth mentioning, but beyond our scope, are processes created through the OS at boot time (i.e, the idle process with PID 0 and the init system with PID 1). The mechanism behind these are quite interesting and if you’d like to learn more I recommend this excellent chapter from the Linux Insides online textbook. ↩
To be more precise, fork() is a function in the posix standard library, which is part of libc. Libc is part of glibc; also known as Gnu libc (or as I've recently taken to call it, GNU plus libc). ↩
Glibc has an algorithm for figuring out what code to compile for your system beforehand. It’s not too complicated. ↩

How much torment can my little homelab take? Part 1.

Ernesto Enriquez — Thu, 12 Mar 2026 12:22:04 +0000

My setup ain’t much. I have a laptop running Arch and a desktop running Debian. I'm worth a grand total of 32 gigs of ram, 24 CPU cores, and 6 feet of a cat8 Ethernet Cable.

There’s a gnarly little question gnawing at my nucleus accumbens . How many requests per second can my $700 setup handle? What about reads per second, or writes!?

Assuming a Java web application and relational database, can it handle, say, 10,000 of each?

Probably not! In fact, it’s a ridiculous suggestion. I mean, what am I– crazy? Naive? Blissfully unaware of the economic state of consumer hardware? Well, I’m going to try it anyway!

Fun fact: 10k req/sec is about four times what stack overflow was doing back in 2016 with bare-metal, enterprise level hardware.
Fun fact 2: One of my homelab’s fans doesn’t work. Thought you might find that mildly amusing.

The architecture is simple. Spring pet clinic is a sample MVC app that uses PostgreSQL for storage. Here is what it looks like deployed. I’ll be using Grafana K6 for stress testing.

I’ll be targeting the following endpoints:

GET “/” – requests per second.
GET /owners/{ownerID} – triggers a read to Postgres.
POST /owners/new – triggers a write to Postgres.

Given that the JVM is notorious for having a cold start problem¹, we’ll be including a ramp up period for each stress test. We’ll start with 1% max rps -> 10% max rps -> 50% max rps -> 100% max rps. We’ll spend 30 seconds ramping up to each stage and hold our max stage for 60 seconds.

For example, at 1k rps:

I decided to containerize everything from the get go and deploy it on Minikube, but I’ll eventually move to bare metal or k3s since it’s more resource optimized. Also, I’ll start with a single node at first and add a second machine later (probably in continuation post). It’s a journey, after all.

Anyways, did you catch all that? Here, look some Excalidraw:

I’ll be writing down findings, headaches, and optimizations I make along the way.

Let’s see, am I forgetting anything before we start?

[x] State goal
[x] Describe application and infrastructure
[x] Scatter droll remarks across the article and mention that I use Arch Linux at least twice.

Oh yeah! Our only SLO is 0% request drop rate under the target load. In other words, a single queued request not handled by the end of a run yields a big fat failure.

1 X per second

As you might have guessed, this was pretty much smooth sailing. I’m going to show the 9 experiments back to back for this one, since the throughput is so low.

The first three tests are req/sec, then reads, then writes.

That latency spike at the start was the JVM warming up. Two more (smaller) latency spikes follow when we switch to reads and writes. They’re different code paths, after all!

Here are the Postgres metrics:

Now, let’s turn up the heat.

1000 X per second

Requests

All systems are nominal. Take a look at those latency scores, though! Each time we started the load test, you’d see a hit to performance, and then a massive, sharp improvement. They don’t call it Hotspot for nuthin.

Next, let’s do some reads.

All systems nominal part 2.

Honestly I didn't think I'd make it this far with the default Minikube limits (2 cores and 2GB of memory). 1k requests per second is a little over 86 million requests per day. Pretty grande numero, compadre.

Writes:

It didn't work! Of the 3 experiments, only the second managed to write the 60k records into Postgres.

So, why did run 1 and 3 fail? Two metrics immediately stand out.

First, take a look at the thread state (chart 1 row 2 column 3). For both of the poo-poo runs, the number of threads in the “Timed waiting” state hit 200.

Second, the average memory usage (chart 2 row 1 column 2) for the database dashboard tells a similar story, peaking at around the same time our number of timed waiting threads hit 200.

We know that Java Spring defaults to Tomcat as its web server. Out of the box, Tomcat limits the maximum number of Worker Threads to 200. Each request gets its own worker thread. Worker threads are a wrapper for the OS threads. We’re hitting that limit around the same time everything breaks.

Hypothesis #1: We can solve this problem by increasing Tomcat’s thread limit.

If I had to be perfectly honest, increasing the number of worker threads smells a little funky. Kinda feels like slapping a bandaid on a gash that needs stitches. Or, it might be closer to slapping a band aid on a radiation burn. More operating system threads means having to worry about the cost of thread context switches, which means higher cpu utilization, which introduces latency, which as we’ve discussed before makes puppies cry². You wouldn’t want to make puppies cry, would you?

There is a more modern take on thread pool exhaustion. Virtual threads are a Java 21+ feature. Here, read the friendly manual to learn more.

Hypothesis #2: We can solve this problem by introducing virtual threads.

Another reason may be the nature of the work itself. What if our threads are presently caught up in some sort of I/O? Say a request comes in. One of our threads might be like “i got this bro”. That thread then goes to the connection pool for postgres and opens a connection. Nine more threads do just this, exhausting our default connection pool of 10.

Requests are still coming in, though. Any thread without access to a connection to postgres goes into the TIMED_WAITING state. So even if we increase the number of threads or switch to virtual threads, this would not help as much as, say, increasing the size of the connection pool. I could be wrong though, let’s throw some scientific method at it and see what happens!

Hypothesis #3: We can solve this problem by increasing the HikariCP connection pool size.

Now, say none of these work. In which case, the problem is most likely CPU throttling. Postgres is trying to commit a thousand records per second on two cpu cores, the poor little guy. I’d prefer to delay throwing more compute at the problem for as long as possible. But if we simply can’t handle the load, then we’ll brute force our way through and leave more time consuming optimizations for later.

Hypothesis #4: We can solve this problem by increasing the number of cpus (cores).

Pause here and take a crack at guessing what happens when we:

(1) increase the maximum number of tomcat worker threads 200->400
(2) switch to virtual threads
(3) increase the hikari cp connection pool size 10->50
(4) increase the number of cores 2->12

Did you take a guess? Please take a guess. Please. Come on, man. Think of the kids.

Hypothesis #1 result:

it didn’t work! Oddly enough, we ended up using the same number of threads. I did a little digging and it turns out the threads state metric captures tomcat workers + JVM internal threads. In other words, we either weren’t reaching our limit on the Tomcat side at all, Tomcat decided it didn’t need to spawn more worker threads, or something else. I humbly regret to inform you that I'm leaning towards something else. It seems figuring out exactly why this happens requires further observability/instrumentation (with my luck, it will turn out to be something quite obvious) and knowing the cause (probably) wouldn’t get us a boost in performance, so I'm moving on!

Hypothesis #2 result:

Well, we fixed the thread pool exhaustion issue. But no cigar. We’re still not hitting those 60k writes. Interestingly enough, our p99 latencies are looking pretty scrumptious compared to using the non-virtual threads. Seems virtual threads may be part of a late game meta?

Hypothesis #3 result: Great news everyone! It didn’t work!

I think this may have actually been our worst performance. The mechanics of why that happened is actually pretty interesting, albeit beyond the scope of this article. But here, if you want to learn more, click me! Or me! Or even me!

TLDR; You should set the number of connections within a pool to

(cpu cores * 2) + effective spindle count

If you make it any bigger, then you’re increasing the thread count to your detriment. In other words, context switches are gonna get cha’. Sometimes, less is more!

Hypothesis 4 result: Okay, this one actually worked!

It seems like 1k writes per second is too much for Minikube’s default 2 cpu cores after all. Whooda thunk?

And as a final sanity check

Perfect!

Okay, let’s do 2.5k now. We’ll keep 12 cores and up the memory to 12GiB. We’ll also stick to the default thread settings for now. I want to see how far that’ll take us.

2500 X per second

Requests

Utter failure.

Notice anything different about these charts than our initial results for 1k writes per second? First, we're capping out a little over 2.2-ish thousand requests per second; that is, we never actually hit our goal. Second, our thread states are spiky. During writes, we’d plateau at our limit of 200 threads in the timed-waiting state. Third, our latency distributions are suspiciously consistent. Not really something you’d expect from a system crumbling under an unbearably girthy throughput.

Another thing you might notice from our thread states is the new spike of threads in the “Runnable” state.

I give up.

I don’t think we’re going to hit 2500 RPS today, but I want to leave things off with a few observations and next steps.

There are a couple of questions whose answers would greatly help going into part 2. I ran 3 more tests for each category and here are the numbers (worst case):

Requests

What is our p(99) at a manageable load (1000 RPS)?

-> 15.26ms

What is the max latency at a manageable load (1000 RPS)?

-> 191.84ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 2083 requests per second

Reads

What is our p(99) at a manageable load (1000 Reads per second)?

-> 10.58ms

What is the max latency at a manageable load (1000 Reads per second)?

-> 27ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 1818 reads per second

Writes

What is our p(99) at a manageable load (1000 writes per second)?

-> 10.25ms

What is the max latency at a manageable load (1000 writes per second)?

-> 209ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 2041 writes per second

On growing queues

We know the request queue is growing by the number of virtual users (VUs) that dynamically spin up during our load tests. Virtual users are an abstraction provided by Grafana k6 that simulates requests from real users. I used Little’s law to set the initial number of VUs. Grafana K6 dynamically increases the number of VUs up to some max number (which I set to 500). My guess is they use Little’s law or something similar. In other words, more latency -> more VUs dynamically allocated. Under the hood, virtual users are goroutines. Goroutines are similar to Java’s virtual threads in their purpose – concurrency through abstraction. The cost of too many goroutines (and therefore VUs) is memory. All this to say I’m monitoring the logs from K6 for spikes in the number of dynamically allocated VUs as a sign/symptom that the number of requests are greater than what the Java application can handle. For example, within a matter of 5 seconds, the number of VUs shoots from 3 to our maximum of 500. Across three runs, the experiment below began to queue around the 1950 iterations per second mark.

Moving forward

I think I’ve run out of low hanging fruit. I’m planning on making the following changes for part two:

Increase to two nodes. I’ll run pg_bench on a containerized postgres instance for both of my machines. Whichever SSD of the two performs better in terms of Transactions per second (TPS) will host postgres.
Switch off minikube to a lighter distribution. Namely, k3s, which is made for IoT, edge devices, and sad, penurious homelabs :(
Come in with some application level data on where the bottlenecks are at runtime. I’ll be using asyncprofiler.
Generally drink more water and get more sleep
Go back to virtual threads.

Until next time!

At runtime, the JVM’s JIT Compiler turns bytecode into “optimized” native machine code. Optimized processes are good because they utilize less CPU cycles to perform certain instructions. Meaning, we have more cpu cycles left over to perform more work. However, If we bombard the REST API with an onslaught of requests before this optimization takes place then the cpu utilization skyrockets, leading to CPU throttling -> higher latency -> unhappy users -> and crying baby puppies. ↩
See footnote one right above me. What, you don't read footnotes? Do you think you're better than me or something? ↩