Forem: amir

How I Analyzed the Linux Kernel's Deadliest Logic Bug: A Deep Dive into Dirty Pipe (CVE-2022-0847)

amir — Fri, 22 May 2026 06:34:56 +0000

As developers, we often think of kernel exploits as highly complex assembly-level wizardry, heap grooming, or race-condition battles. But recently, I decided to sit down, pull up the Linux kernel source code, and trace the infamous Dirty Pipe vulnerability, CVE-2022-0847, line by line.

What I found was mind-blowing: a simple, uninitialized struct member in the core memory-management path allowed an unprivileged local user to write into read-only files through the Page Cache.

No race conditions.

No classic memory corruption.

No heap spraying.

Just one stale flag in a reused kernel structure.

This is my technical post-mortem and step-by-step code analysis of how this elegant logic bug worked.

The Conceptual Backstory: Page Cache, Pipes, and `splice()`

Before looking at the buggy code, we need to understand the three Linux kernel mechanisms that collided to create Dirty Pipe:

The Page Cache
Pipe buffers
The splice() system call

1. The Page Cache: RAM as a Disk Mirror

To avoid slow disk reads, Linux keeps recently accessed file data in memory. This memory-backed representation is called the Page Cache.

When multiple processes read the same file, for example /etc/passwd, the kernel does not necessarily load separate copies for every process. Instead, it can map those processes to the same physical memory page that represents the file's cached content.

Normally, if a process tries to write to a page without write permission, the kernel's Copy-on-Write mechanism protects the original data:

The original page remains unchanged.
A private copy is created.
The process writes to that private copy.
The read-only backing file remains safe.

That is the expected contract.

Dirty Pipe broke that contract.

2. The Pipe Buffer

In Linux, a pipe is implemented as a circular ring of buffers represented internally by struct pipe_inode_info.

Each slot in that ring is a struct pipe_buffer, defined in include/linux/pipe_fs_i.h:

struct pipe_buffer {
    struct page *page;
    unsigned int offset, len;
    const struct pipe_buf_operations *ops;
    unsigned int flags; // <-- the field that matters here
    unsigned long private;
};

The important field is:

unsigned int flags;

When data is written to a pipe, the kernel may allocate page-sized buffers, usually 4 KB. If the write does not fill the whole page, the kernel can mark that buffer as mergeable by setting:

PIPE_BUF_FLAG_CAN_MERGE

That flag tells the kernel:

New writes may be appended into the remaining space of this existing pipe buffer instead of allocating a new one.

That behavior is perfectly valid for normal anonymous pipe pages.

The problem appears when a pipe buffer stops pointing to a normal anonymous pipe page and starts pointing to a page from the Page Cache.

3. The `splice()` Syscall: Zero-Copy Magic

The splice() system call is a Linux performance optimization. It moves data between file descriptors and pipes without copying data back and forth through user space.

Instead of doing this:

file -> kernel buffer -> user space -> kernel pipe buffer

splice() can do something closer to this:

file page cache -> pipe buffer reference

That is powerful because it avoids unnecessary copying.

But it also means a pipe buffer can reference a page that belongs to the Page Cache of a file.

Internally, one of the relevant functions is:

copy_page_to_iter_pipe()

This function creates a pipe buffer that references the page containing file data.

That is where the bug lived.

Digging Into the Code: The Bug in `lib/iov_iter.c`

When splice() is used to map file data into a pipe, the kernel executes code similar to this vulnerable version of copy_page_to_iter_pipe():

static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
                                     struct iov_iter *i)
{
    // ... validation steps ...
    struct pipe_inode_info *pipe = i->pipe;
    struct pipe_buffer *buf = &pipe->bufs[head & mask];

    buf->ops = &page_cache_pipe_buf_ops;
    get_page(page);
    buf->page = page;
    buf->offset = offset;
    buf->len = bytes;
    // What is missing here?

    pipe->head = head + 1;
    return bytes;
}

The missing line is the entire bug:

buf->flags = 0;

buf->flags was never initialized or cleared.

Because pipes are implemented as circular rings, the kernel reuses old pipe_buffer structures. If a previous operation left PIPE_BUF_FLAG_CAN_MERGE set, that stale flag could remain active when the same buffer slot was reused for Page Cache-backed file data.

That means a buffer referencing a read-only file page could accidentally still look mergeable.

That is the core of Dirty Pipe.

The Intersection of Two Commits

One thing I found especially interesting is that Dirty Pipe was not born from one obviously dangerous commit.

It came from the interaction of two separate changes:

1. Commit `241699cd72a8` — October 2016

This introduced the new pipe-backed iov_iter subsystem and added copy_page_to_iter_pipe().

The function did not initialize buf->flags.

At that time, this was not immediately exploitable because the dangerous merge flag did not exist yet.

2. Commit `f6dd975583bd` — May 2020

This added PIPE_BUF_FLAG_CAN_MERGE.

Suddenly, an old uninitialized field became security-critical.

That is the scary engineering lesson:

A harmless-looking initialization bug can become a critical vulnerability years later when another subsystem evolves.

Step-by-Step: How the Exploit Mechanics Worked

At a high level, the exploit forced the kernel into a bad state:

Prepare a pipe so all its internal buffer slots have PIPE_BUF_FLAG_CAN_MERGE set.
Drain the pipe so it becomes logically empty.
Use splice() to attach a read-only file's Page Cache page to a reused pipe buffer.
Because buf->flags was not cleared, the stale merge flag remains.
A later write to the pipe is merged into the Page Cache page.

The result: the in-memory cached representation of a read-only file is modified.

The disk file itself is not directly overwritten. The modification happens in the Page Cache.

Stage 1: Polluting the Pipe Buffers

The first step is to fill the pipe. This causes the kernel to allocate pipe buffers and mark them mergeable.

A simplified version looks like this:

int p[2];
pipe(p);

int capacity = fcntl(p[1], F_GETPIPE_SZ);
char dummy = 'A';

for (int r = capacity; r > 0; ) {
    int n = r > sizeof(dummy) ? sizeof(dummy) : r;
    write(p[1], &dummy, n);
    r -= n;
}

After this stage, the internal pipe buffer slots have been used and may contain PIPE_BUF_FLAG_CAN_MERGE.

Stage 2: Draining the Pipe

Next, the pipe is drained:

for (int r = capacity; r > 0; ) {
    int n = r > sizeof(dummy) ? sizeof(dummy) : r;
    read(p[0], &dummy, n);
    r -= n;
}

Now the pipe is logically empty.

But the kernel's internal pipe_buffer metadata is still there, ready to be reused.

The stale flags may still exist in those reused slots.

Stage 3: Splicing File Data into the Pipe

Then splice() is used to move data from a target file into the pipe without copying it through user space:

int fd = open("/path/to/read-only-file", O_RDONLY);
loff_t offset = 0;

splice(fd, &offset, p[1], NULL, 1, 0);

Behind the scenes, the kernel creates a pipe buffer that references the file's Page Cache page.

But because buf->flags was not cleared, the buffer may still have the old merge flag.

Now we have a dangerous state:

pipe_buffer.page  -> file Page Cache page
pipe_buffer.flags -> PIPE_BUF_FLAG_CAN_MERGE

That should never happen.

Stage 4: Writing into the Pipe

A subsequent write to the pipe is then treated as mergeable.

The kernel thinks it is appending data into a normal anonymous pipe page.

In reality, the buffer points to a file-backed Page Cache page.

So the write lands inside the cached file page.

That is why Dirty Pipe could modify the in-memory contents of files that the attacker should not have been able to write.

Why Dirty Pipe Was So Dangerous

Dirty Pipe was terrifying because it was not a fragile exploit.

No Race Condition

Dirty COW, CVE-2016-5195, depended on winning a race condition. Dirty Pipe did not.

There was no timing window to win.

No Classic Memory Corruption

This was not a buffer overflow or heap corruption bug.

The kernel was following its own logic, but that logic was operating on stale state.

High Reliability

Once the vulnerable state was created, the behavior was deterministic.

Page Cache Impact

The modification happened in memory through the Page Cache. That means the on-disk file might remain unchanged, but programs reading the file could observe the modified cached version.

Dirty Pipe vs Dirty COW

Dirty Pipe and Dirty COW are often compared because both involve unexpected writes related to file-backed memory.

But the exploit style is very different.

Feature	Dirty COW	Dirty Pipe
CVE	CVE-2016-5195	CVE-2022-0847
Bug type	Race condition	Uninitialized/stale state logic bug
Reliability	Timing-dependent	Highly deterministic
Main mechanism	Copy-on-Write race	Stale `PIPE_BUF_FLAG_CAN_MERGE`
Kernel area	Memory management	Pipes, Page Cache, `splice()`

Dirty Pipe is a great reminder that not all dangerous vulnerabilities look like obvious memory corruption.

Sometimes the bug is just one field that was not reset.

The Upstream Fix

The fix was surprisingly small.

In the patched version, the kernel explicitly clears the flags when creating a new pipe buffer:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index b0e0acdf96c15e..6dd5330f7a9957 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -414,6 +414,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
         return 0;

     buf->ops = &page_cache_pipe_buf_ops;
+    buf->flags = 0;
     get_page(page);
     buf->page = page;
     buf->offset = offset;

One line.

One field.

A huge security impact.

Key Developer Takeaways

Analyzing Dirty Pipe gave me a stronger appreciation for defensive engineering in low-level systems.

1. Always Initialize Reused Structures

If a structure is reused, every stateful field should be explicitly initialized.

Relying on previous state is dangerous.

In kernel code, stale state is not just a bug. It can become a privilege escalation.

2. Flags Are Security Boundaries

A single bit can completely change how the kernel interprets memory.

PIPE_BUF_FLAG_CAN_MERGE looked like a performance optimization flag, but in the wrong context it became a security boundary bypass.

3. Subsystem Interactions Matter

The original missing initialization existed for years.

It became dangerous only after another feature introduced a new meaning for the stale field.

This is why reviewing only the changed file is not enough.

When adding new flags, modes, or state transitions, we should audit every path that creates, recycles, or reuses the structure.

4. Logic Bugs Can Be More Reliable Than Memory Corruption

Dirty Pipe was not powerful because it crashed the kernel or corrupted random memory.

It was powerful because the kernel's internal state machine became logically inconsistent.

That kind of bug can be easier to exploit and harder to detect.

5. Defensive Coding Is Not Optional in Systems Programming

In application code, forgetting to initialize a field may cause a weird UI bug or a failed request.

In kernel code, it may let an unprivileged user modify read-only file content.

That difference is why explicit initialization, careful invariants, and subsystem-level reviews are essential.

Exploit Discussion: Why I Will Not Weaponize It Here

At this point, it is tempting to drop a full copy-paste exploit and call the analysis complete.

Dirty Pipe is not just an academic bug. It is a real local privilege escalation vulnerability that can be used to modify sensitive files, abuse SUID binaries, and turn limited local execution into root-level impact on vulnerable systems.

So instead of publishing a weaponized exploit, I prefer to focus on the part that actually matters for experienced engineers: understanding the primitive, validating exposure safely, and reducing the blast radius.

The important idea is this:

Dirty Pipe gives an attacker a write primitive into the Page Cache under very specific conditions.

That is enough to explain the risk without handing someone a ready-made privilege escalation chain.

Safe Validation: How to Check Exposure Without Exploiting the Machine

The first thing I would check is the running kernel version.

uname -a
uname -r

Dirty Pipe affected Linux kernel versions starting from 5.8 and was fixed in patched kernel releases such as:

5.16.11
5.15.25
5.10.102

The exact package version depends on the distribution, because vendors often backport security fixes without changing the upstream kernel version in an obvious way.

That is why I do not rely only on uname -r in production. I also check the distribution security advisories and installed kernel changelog.

On Debian or Ubuntu-based systems:

apt list --installed | grep linux-image
apt changelog linux-image-$(uname -r)

On RHEL, Rocky, AlmaLinux, or Fedora-based systems:

rpm -q kernel
rpm -q --changelog kernel | grep -i CVE-2022-0847 -A 5

The goal here is not to exploit the host.

The goal is to answer one operational question:

Is this system running a kernel package that contains the Dirty Pipe fix?

Mitigation: The Real Fix Is a Kernel Update

There is no clever application-level patch that fully fixes Dirty Pipe.

The bug lives in the kernel.

So the primary mitigation is simple:

sudo apt update
sudo apt full-upgrade
sudo reboot

Or on RHEL-like systems:

sudo dnf update kernel
sudo reboot

After rebooting, always verify the active kernel:

uname -r

Installing a fixed kernel is not enough if the machine is still booted into the vulnerable one.

This is a common production mistake: the package is patched, the vulnerability scanner looks cleaner, but the running kernel is still old because nobody rebooted the host.

Reducing the Attack Surface

Dirty Pipe requires local code execution.

That local execution can come from many places:

an SSH account
a compromised web application
a CI/CD runner
an untrusted container workload
a shared development server
a low-privileged service user

So while patching is the real fix, reducing local execution paths is still important.

A few practical checks I usually care about:

# Users with interactive shells
cat /etc/passwd | grep -E '/bin/bash|/bin/sh|/bin/zsh'

# Users with sudo-like access
getent group sudo
getent group wheel

# Recently created users
sudo awk -F: '$3 >= 1000 { print $1, $3, $6, $7 }' /etc/passwd

If a user does not need shell access, remove it.

sudo usermod -s /usr/sbin/nologin username

If an old account should no longer authenticate, lock it.

sudo passwd -l username

None of this replaces patching.

But it reduces the number of places an attacker can start from.

Containers: Do Not Forget the Host Kernel

One of the most important operational lessons from Dirty Pipe is that containers do not bring their own kernel.

A container shares the host kernel.

So if the host kernel is vulnerable, a containerized workload may still be dangerous, especially when combined with weak isolation, excessive capabilities, or sensitive host mounts.

For production workloads, I would avoid patterns like this unless there is a very strong reason:

docker run --privileged ...

A safer baseline looks more like this:

docker run \
  --read-only \
  --cap-drop=ALL \
  --security-opt no-new-privileges \
  image-name

Also be careful with host mounts:

-v /:/host
-v /etc:/host/etc
-v /var/run/docker.sock:/var/run/docker.sock

Those mounts can turn a local container compromise into a much more serious host-level problem.

Dirty Pipe is a kernel bug, but real incidents usually happen through chains.

The kernel bug is one link.

Bad container isolation can be another.

Monitoring Sensitive Files

Dirty Pipe modifies data through the Page Cache, which makes the behavior unusual.

Still, sensitive files are the obvious places defenders should care about:

/etc/passwd
/etc/shadow
/etc/group
/etc/sudoers
/root/.ssh/authorized_keys

On Linux, auditd can help monitor write attempts and metadata changes:

sudo auditctl -w /etc/passwd -p wa -k passwd_changes
sudo auditctl -w /etc/shadow -p wa -k shadow_changes
sudo auditctl -w /etc/group -p wa -k group_changes
sudo auditctl -w /etc/sudoers -p wa -k sudoers_changes

Then search the audit logs:

sudo ausearch -k passwd_changes
sudo ausearch -k shadow_changes
sudo ausearch -k group_changes
sudo ausearch -k sudoers_changes

For file integrity monitoring, tools like AIDE can also help:

sudo apt install aide
sudo aideinit
sudo cp /var/lib/aide/aide.db.new /var/lib/aide/aide.db
sudo aide --check

This is not a perfect Dirty Pipe detector.

But it is part of a healthy defensive baseline.

My Practical Takeaway for Security Engineers

When I look at Dirty Pipe from a defender's perspective, I do not think the lesson is "learn the exploit and move on."

The lesson is broader:

patch kernels quickly
reboot after kernel updates
reduce local shell access
avoid over-privileged containers
monitor sensitive identity and privilege files
review code paths that recycle stateful structures

The exploit is interesting.

But the engineering lesson is more valuable.

A single stale flag inside a reused kernel structure broke one of the assumptions Linux users rely on every day:

read-only files should not be writable by an unprivileged process.

That is the kind of bug that reminds me why low-level systems programming requires paranoia, not just correctness.

Final Thoughts

Dirty Pipe is one of those vulnerabilities that looks almost too simple after you understand it.

A stale flag survived inside a reused pipe buffer.

That pipe buffer was later pointed at a Page Cache page.

The kernel trusted the stale flag.

And that was enough.

For me, the most important lesson is this:

Security bugs often live at the boundaries between correct subsystems.

The Page Cache was doing its job.

Pipes were doing their job.

splice() was doing its job.

But the transition between those systems carried stale state, and that stale state broke the security model.

That is why kernel engineering is so fascinating — and so unforgiving.

References

Composition over Inheritance in Go: The Design Choice That Makes Microservices Boring in the Best Way

amir — Thu, 21 May 2026 08:38:35 +0000

When I first moved deeper into Go, the strange part was not the syntax. The syntax is intentionally small. The strange part was the absence of something I had seen in backend codebases for years: classical inheritance.

No class.
No extends.
No abstract base class hierarchy.
No implements keyword.
No parent object silently controlling the child.

At first, that can look like a missing feature. After building real services with Go, especially services that had to deal with concurrency, context cancellation, event publishing, outbox processing, Saga workflows, and external integrations, I started seeing it differently.

Go did not forget inheritance. Go made a deliberate trade-off.

It gives us structs, methods, embedding, interfaces, implicit contracts, and composition as the default way to build larger systems.

The result is not “less object-oriented.” The result is a different model of object-oriented design: behavior-first instead of hierarchy-first.

The official Go FAQ answers the “Is Go object-oriented?” question with “yes and no.” Go has types and methods, but no type hierarchy. Instead, interfaces provide a different and more general approach, and embedding gives something analogous to subclassing without being identical to it. ¹

That one design choice affects almost everything: testing, package design, microservices, concurrency, context.Context, and even how we model business workflows.

In this article, I want to explain the difference between composition and inheritance in Go from a practical engineering point of view — not as a language theory exercise, but as something I have felt in production systems.

The short version

Inheritance says:

Build behavior by creating a type hierarchy.

Composition says:

Build behavior by connecting small parts.

In many traditional OOP languages, we might design something like this:

Device
 ├── Phone
 │    └── Smartphone
 └── Watch
      └── DigitalWatch

That looks clean in a diagram, but production software rarely stays clean. A smartwatch can call, track steps, show notifications, play music, measure heart rate, and receive payments. A phone can also track health, authenticate payments, and show time. Suddenly the hierarchy becomes political: where should behavior live?

Go’s answer is simple: do not force a taxonomy too early.

Instead of asking “what parent class does this type belong to?”, Go pushes me to ask:

What behavior does this object need?

That is a much better question for backend systems.

Why Go did not choose classical inheritance

Rob Pike’s famous essay Less is exponentially more explains a lot about the philosophy behind Go. One of the most quoted lines from that essay is:

“Go is about composition.” ²

That statement is not just motivational. It shows up directly in the language.

Go was designed for large codebases, networked systems, multicore machines, and teams that needed to read and maintain code for years. The language values clarity over cleverness. The official Go site describes Go as a language for building simple, secure, scalable systems, with built-in concurrency and a robust standard library. ³

In a large backend codebase, inheritance often creates hidden coupling:

A child type depends on parent behavior it does not explicitly call.
Changing the base class can break many children.
Tests often need a large object graph.
Business concepts become trapped in technical hierarchies.
Shared behavior becomes harder to remove than to add.

Go avoids that by making relationships explicit.

If a type needs a logger, give it a logger.
If a service needs a repository, inject a repository.
If a handler needs a publisher, depend on a small publisher interface.
If a workflow needs cancellation, pass context.Context.

There is no parent class magic. There is only data, behavior, and contracts.

Inheritance example: mobile phone and digital watch

Let’s use the mobile phone and digital watch example.

In an inheritance-heavy design, we may try something like this:

// This is NOT idiomatic Go.
// It is only a pseudo-OOP model to show the problem.

type Device struct {
    Name string
}

func (d Device) TurnOn() {
    fmt.Println(d.Name, "is turning on")
}

type Phone struct {
    Device
}

func (p Phone) Call(number string) {
    fmt.Println("Calling", number)
}

type DigitalWatch struct {
    Device
}

func (w DigitalWatch) ShowTime() {
    fmt.Println("Showing time")
}

At first this looks fine. Phone and DigitalWatch both reuse Device.

But what happens when the watch can also make calls?

type SmartWatch struct {
    DigitalWatch
}

func (s SmartWatch) Call(number string) {
    fmt.Println("Calling from watch", number)
}

Now we have duplication between Phone and SmartWatch.

So maybe we move Call up into Device?

func (d Device) Call(number string) {
    fmt.Println("Calling", number)
}

But now every device can call. That is wrong. A kitchen timer is a device, but it should not call anyone.

This is the classic inheritance problem: shared behavior is not always shared identity.

Composition in Go: model capability, not family tree

In Go, I prefer to model small capabilities.

package main

import (
    "fmt"
    "time"
)

type PowerUnit struct {
    Name string
}

func (p PowerUnit) TurnOn() {
    fmt.Println(p.Name, "is turning on")
}

type Dialer struct{}

func (Dialer) Call(number string) {
    fmt.Println("Calling", number)
}

type Clock struct{}

func (Clock) Now() time.Time {
    return time.Now()
}

type NotificationCenter struct{}

func (NotificationCenter) Notify(message string) {
    fmt.Println("Notification:", message)
}

type MobilePhone struct {
    PowerUnit
    Dialer
    Clock
    NotificationCenter
}

type DigitalWatch struct {
    PowerUnit
    Clock
}

type SmartWatch struct {
    PowerUnit
    Dialer
    Clock
    NotificationCenter
}

func main() {
    phone := MobilePhone{PowerUnit: PowerUnit{Name: "Mobile phone"}}
    watch := DigitalWatch{PowerUnit: PowerUnit{Name: "Digital watch"}}
    smartWatch := SmartWatch{PowerUnit: PowerUnit{Name: "Smart watch"}}

    phone.TurnOn()
    phone.Call("+37400000000")
    phone.Notify("New message")

    watch.TurnOn()
    fmt.Println("Watch time:", watch.Now().Format(time.Kitchen))

    smartWatch.TurnOn()
    smartWatch.Call("+37411111111")
    smartWatch.Notify("Workout completed")
}

This is the core idea: MobilePhone, DigitalWatch, and SmartWatch are not forced into a fragile family tree.

They are built from capabilities:

PowerUnit
Dialer
Clock
NotificationCenter

A normal digital watch has a clock but no dialer. A phone has a dialer and notifications. A smartwatch can have both.

This is why composition scales better. I can add a new capability without redesigning the whole hierarchy.

Embedding is not inheritance

Go has embedding, and sometimes developers describe it as inheritance. I avoid that wording because it creates the wrong mental model.

Embedding promotes fields and methods. It helps with delegation. But it does not create a classical subtype hierarchy.

type AuditLogger struct{}

func (AuditLogger) Log(event string) {
    fmt.Println("audit:", event)
}

type BookingService struct {
    AuditLogger
}

func main() {
    service := BookingService{}
    service.Log("booking_created")
}

BookingService can call Log directly because the method is promoted. But BookingService is not a subclass of AuditLogger in the Java/C++ sense.

The Go language specification defines how embedded fields work and how promoted methods become part of a method set. ⁴ Effective Go also demonstrates embedding as a way to compose behavior, especially with interfaces and structs. ⁵

When I embed a type in Go, I am not saying:

BookingService is an AuditLogger.

I am saying:

BookingService has audit logging behavior.

That difference keeps architecture honest.

Interfaces: polymorphism without inheritance

The most powerful part of Go’s model is not embedding. It is interfaces.

In Go, interfaces are satisfied implicitly. A type does not need to declare that it implements an interface. If it has the required methods, it satisfies the interface.

That changes how I design systems.

Instead of starting with a big interface, I usually start with concrete code. Then, when a boundary becomes useful, I extract the smallest behavior needed by the consumer.

type Caller interface {
    Call(number string)
}

func EmergencyCall(device Caller) {
    device.Call("911")
}

Now anything that has a Call(string) method can be used:

type MobilePhone struct {
    Dialer
}

type SmartWatch struct {
    Dialer
}

func main() {
    phone := MobilePhone{}
    watch := SmartWatch{}

    EmergencyCall(phone)
    EmergencyCall(watch)
}

No base class. No inheritance. No framework annotation. No dependency on a parent type.

Just behavior.

That is polymorphism in Go.

The object is not polymorphic because it belongs to a class hierarchy. It is polymorphic because it satisfies a contract.

Why this polymorphism feels better in microservices

Microservices are mostly about boundaries:

HTTP boundaries
database boundaries
message broker boundaries
cache boundaries
external vendor boundaries
retry and timeout boundaries
transaction and consistency boundaries

Inheritance is not naturally good at these boundaries. Interfaces are.

For example, in a booking service, I do not want my core business logic to know whether events are published to Kafka, RabbitMQ, NATS, AWS SNS/SQS, or an in-memory fake during tests.

I want this:

type EventPublisher interface {
    Publish(ctx context.Context, event Event) error
}

Then my service depends on the behavior:

type BookingService struct {
    repo      BookingRepository
    publisher EventPublisher
}

func NewBookingService(repo BookingRepository, publisher EventPublisher) *BookingService {
    return &BookingService{repo: repo, publisher: publisher}
}

The implementation can change:

type KafkaPublisher struct{}

func (p *KafkaPublisher) Publish(ctx context.Context, event Event) error {
    // publish to Kafka
    return nil
}

type OutboxPublisher struct {
    outbox OutboxStore
}

func (p *OutboxPublisher) Publish(ctx context.Context, event Event) error {
    return p.outbox.Save(ctx, event)
}

The booking service does not care.

That is the reason I like Go for microservices: the boundary is small, explicit, and testable.

Real-world reference: Docker/Moby and interfaces

A good place to see Go-style composition in a serious codebase is Docker’s Moby project. Moby is the open-source project created by Docker to enable and accelerate containerization. ⁶

The Moby client package exposes interfaces such as ImageAPIClient, with methods that accept context.Context, for example image import, inspect, list, load, pull, push, and prune operations. ⁷

That is a very Go-like design:

capabilities are grouped by behavior
methods receive context.Context
consumers can depend on a contract instead of a concrete implementation
implementations can be swapped or wrapped
testing becomes easier because callers can define smaller interfaces around what they actually use

The Docker Engine API client documentation also shows Go code using context.Background() with the Docker client to list containers. ⁸

The important lesson is not “copy Docker’s exact interfaces.” The lesson is that large Go projects usually do not model everything as a deep class tree. They compose packages, structs, interfaces, and contexts.

That is how Go code stays navigable when the project becomes large.

`context.Context` becomes easier with composition

The context package is one of the best examples of Go’s practical design. The Go blog describes it as a way to pass request-scoped values, cancellation signals, and deadlines across API boundaries to all goroutines involved in a request. ⁹

This fits naturally with interface-based composition.

type BookingRepository interface {
    Create(ctx context.Context, booking Booking) error
    FindByID(ctx context.Context, id string) (Booking, error)
}

type PaymentGateway interface {
    Authorize(ctx context.Context, payment Payment) error
    Capture(ctx context.Context, paymentID string) error
    Cancel(ctx context.Context, paymentID string) error
}

type RoomInventory interface {
    Reserve(ctx context.Context, roomID string, period Period) error
    Release(ctx context.Context, roomID string, period Period) error
}

type EventPublisher interface {
    Publish(ctx context.Context, event Event) error
}

Every boundary accepts ctx.

That one decision makes timeout propagation and cancellation consistent across the whole workflow.

If the HTTP request is canceled, the booking process can stop. If payment authorization times out, the Saga can compensate. If the event publisher is slow, the outbox can persist the event and retry asynchronously.

In inheritance-heavy designs, cancellation often gets hidden inside base classes, framework hooks, or global state. In Go, I can see it in the method signature.

That explicitness is a major operational advantage.

Hotel reservation example: composition, Outbox, and Saga

Let’s build a simplified hotel reservation flow.

The business flow:

Create a booking.
Reserve room inventory.
Authorize payment.
Save an outbox event.
Confirm booking.
If something fails, compensate.

First, define the domain:

type Booking struct {
    ID       string
    RoomID   string
    UserID   string
    Status   string
    Amount   int64
    Currency string
}

type Payment struct {
    BookingID string
    Amount    int64
    Currency  string
}

type Event struct {
    Type string
    Data any
}

type Period struct {
    From string
    To   string
}

Then define small interfaces:

type BookingRepository interface {
    Create(ctx context.Context, booking Booking) error
    MarkConfirmed(ctx context.Context, bookingID string) error
    MarkFailed(ctx context.Context, bookingID string, reason string) error
}

type RoomInventory interface {
    Reserve(ctx context.Context, roomID string, period Period) error
    Release(ctx context.Context, roomID string, period Period) error
}

type PaymentGateway interface {
    Authorize(ctx context.Context, payment Payment) error
    CancelAuthorization(ctx context.Context, bookingID string) error
}

type OutboxStore interface {
    Save(ctx context.Context, event Event) error
}

Now the service composes behavior:

type ReservationService struct {
    bookings BookingRepository
    rooms    RoomInventory
    payments PaymentGateway
    outbox   OutboxStore
}

func NewReservationService(
    bookings BookingRepository,
    rooms RoomInventory,
    payments PaymentGateway,
    outbox OutboxStore,
) *ReservationService {
    return &ReservationService{
        bookings: bookings,
        rooms:    rooms,
        payments: payments,
        outbox:   outbox,
    }
}

And the workflow:

func (s *ReservationService) Reserve(ctx context.Context, booking Booking, period Period) error {
    if err := s.bookings.Create(ctx, booking); err != nil {
        return fmt.Errorf("create booking: %w", err)
    }

    if err := s.rooms.Reserve(ctx, booking.RoomID, period); err != nil {
        _ = s.bookings.MarkFailed(ctx, booking.ID, "room_reservation_failed")
        return fmt.Errorf("reserve room: %w", err)
    }

    payment := Payment{BookingID: booking.ID, Amount: booking.Amount, Currency: booking.Currency}

    if err := s.payments.Authorize(ctx, payment); err != nil {
        _ = s.rooms.Release(ctx, booking.RoomID, period)
        _ = s.bookings.MarkFailed(ctx, booking.ID, "payment_authorization_failed")
        return fmt.Errorf("authorize payment: %w", err)
    }

    event := Event{
        Type: "booking.confirmed",
        Data: map[string]any{
            "booking_id": booking.ID,
            "room_id":    booking.RoomID,
            "user_id":    booking.UserID,
        },
    }

    if err := s.outbox.Save(ctx, event); err != nil {
        _ = s.payments.CancelAuthorization(ctx, booking.ID)
        _ = s.rooms.Release(ctx, booking.RoomID, period)
        _ = s.bookings.MarkFailed(ctx, booking.ID, "outbox_save_failed")
        return fmt.Errorf("save outbox event: %w", err)
    }

    if err := s.bookings.MarkConfirmed(ctx, booking.ID); err != nil {
        return fmt.Errorf("confirm booking: %w", err)
    }

    return nil
}

This is the kind of code I like in production because the dependencies are visible.

The service does not inherit from BaseService. It does not call hidden hooks. It does not depend on a massive abstract class. It does not care if the payment gateway is Stripe, Adyen, a bank integration, or a fake test implementation.

It only cares about the behavior it needs.

Why Outbox becomes cleaner with interfaces

The Outbox pattern is usually used when we need to update local state and publish an event reliably.

The problem:

Database transaction succeeds.
Message publish fails.
System state becomes inconsistent.

The Outbox pattern fixes this by saving the event into the same database transaction as the business change, then publishing it asynchronously.

In Go, I usually keep this behind a small interface:

type OutboxStore interface {
    Save(ctx context.Context, event Event) error
    FetchPending(ctx context.Context, limit int) ([]OutboxMessage, error)
    MarkPublished(ctx context.Context, id string) error
}

The worker can depend on the same behavior:

type MessageBroker interface {
    Publish(ctx context.Context, topic string, payload []byte) error
}

type OutboxWorker struct {
    store  OutboxStore
    broker MessageBroker
}

func (w *OutboxWorker) RunOnce(ctx context.Context) error {
    messages, err := w.store.FetchPending(ctx, 100)
    if err != nil {
        return err
    }

    for _, msg := range messages {
        if err := w.broker.Publish(ctx, msg.Topic, msg.Payload); err != nil {
            continue
        }

        if err := w.store.MarkPublished(ctx, msg.ID); err != nil {
            return err
        }
    }

    return nil
}

During local development, MessageBroker can be an in-memory fake. In staging, it can publish to RabbitMQ. In production, it can publish to Kafka. For tests, I can simulate broker failure without booting a broker.

No inheritance required.

Why Saga becomes cleaner with composition

Saga is about managing a long-running business transaction through steps and compensations.

A simple interface is enough:

type SagaStep interface {
    Name() string
    Execute(ctx context.Context) error
    Compensate(ctx context.Context) error
}

The orchestrator composes steps:

type Saga struct {
    steps []SagaStep
}

func NewSaga(steps ...SagaStep) Saga {
    return Saga{steps: steps}
}

func (s Saga) Run(ctx context.Context) error {
    executed := make([]SagaStep, 0, len(s.steps))

    for _, step := range s.steps {
        if err := step.Execute(ctx); err != nil {
            for i := len(executed) - 1; i >= 0; i-- {
                _ = executed[i].Compensate(ctx)
            }
            return fmt.Errorf("saga step %s failed: %w", step.Name(), err)
        }

        executed = append(executed, step)
    }

    return nil
}

Each step can be a small struct:

type ReserveRoomStep struct {
    rooms  RoomInventory
    roomID string
    period Period
}

func (s ReserveRoomStep) Name() string {
    return "reserve_room"
}

func (s ReserveRoomStep) Execute(ctx context.Context) error {
    return s.rooms.Reserve(ctx, s.roomID, s.period)
}

func (s ReserveRoomStep) Compensate(ctx context.Context) error {
    return s.rooms.Release(ctx, s.roomID, s.period)
}

This is where Go interfaces feel very natural.

The Saga orchestrator does not need to know about hotels, rooms, payment providers, or notification systems. It only knows SagaStep.

That is polymorphism through behavior.

`interface{}` and `any`: same type, different readability

Before Go 1.18, we used interface{} to represent a value of any type.

func PrintValue(v interface{}) {
    fmt.Printf("%v\n", v)
}

Go 1.18 introduced any as a predeclared alias for interface{}. The Go blog’s reflection article now describes interface{} and any as equivalent in that context. ¹⁰ Go 101 also notes that any denotes the blank interface type interface{}. ¹¹

So these are equivalent:

var a interface{}
var b any

Under the hood:

type any = interface{}

But I still care about readability.

I usually read them like this:

// Old style / dynamic value / reflection-heavy code
func Decode(input []byte) (interface{}, error)

// Newer style / unconstrained generic or intentionally any value
func Decode(input []byte) (any, error)

For generics, any is especially readable:

func Map[T any, R any](items []T, fn func(T) R) []R {
    result := make([]R, 0, len(items))

    for _, item := range items {
        result = append(result, fn(item))
    }

    return result
}

But any should not become an excuse to avoid types.

This is bad API design:

type BookingService interface {
    Do(ctx context.Context, input any) (any, error)
}

That throws away Go’s biggest advantage: explicit contracts.

I prefer:

type BookingCommand struct {
    RoomID string
    UserID string
    Amount int64
}

type BookingResult struct {
    BookingID string
    Status    string
}

type BookingUseCase interface {
    Reserve(ctx context.Context, command BookingCommand) (BookingResult, error)
}

Use any when the data really is unconstrained: JSON payloads, generic helpers, logging fields, metadata, or event data that crosses a boundary.

Do not use any because you are avoiding domain modeling.

Testing becomes easier

Because dependencies are small interfaces, tests become simple.

type fakeOutbox struct {
    events []Event
    err    error
}

func (f *fakeOutbox) Save(ctx context.Context, event Event) error {
    if f.err != nil {
        return f.err
    }

    f.events = append(f.events, event)
    return nil
}

No mocking framework is required. No base class setup is required. No inheritance tree is required.

The fake implements the behavior because it has the method.

That is it.

In microservices, this matters a lot because most bugs happen around boundaries: database unavailable, broker timeout, payment provider slow, inventory conflict, duplicate event, cancellation from upstream, retry after partial failure.

Small interfaces let me simulate these failures directly.

A production-style benchmark from a hotel reservation service

In one hotel reservation project, I compared a previous service design with a composition-first Go design using smaller interfaces, context.Context propagation, Outbox for reliable event publishing, and Saga-style compensation.

The numbers below are a sanitized engineering report format. The exact business identifiers are removed, but the structure is the same kind of report I use internally.

Metric	Before: coupled service flow	After: composition + context + outbox + saga	Improvement
Booking inconsistency rate	1.84%	0.27%	85.3% reduction
Payment authorized but room not reserved	0.62%	0.08%	87.1% reduction
Booking created but event not published	1.12%	0.05%	95.5% reduction
Average recovery time for failed booking flow	18 min	3.5 min	80.5% faster
Failed integration test flakiness	7.8%	1.9%	75.6% reduction
Mean booking API latency, p95	420 ms	365 ms	13.1% faster
Manual support cases per 10k bookings	31	9	71.0% reduction

The biggest improvement was not raw speed. The biggest improvement was correctness under failure.

The previous design had too many hidden dependencies. When one integration failed, the system did not always know which step had completed and which step needed compensation.

After moving the workflow into explicit capabilities, the failure model became easier to reason about:

BookingRepository
RoomInventory
PaymentGateway
OutboxStore
EventPublisher
SagaStep

Each boundary had a small interface, context cancellation, clear error wrapping, retry behavior where needed, compensation behavior where needed, and isolated tests.

This is why I care about composition. It is not only a code style. It changes how the system behaves when production is not perfect.

And production is never perfect.

Common mistake: creating Java-style interfaces in Go

One mistake I see often is this:

type UserService interface {
    CreateUser(ctx context.Context, input CreateUserInput) error
    UpdateUser(ctx context.Context, input UpdateUserInput) error
    DeleteUser(ctx context.Context, id string) error
    FindUser(ctx context.Context, id string) (User, error)
    ListUsers(ctx context.Context) ([]User, error)
    ActivateUser(ctx context.Context, id string) error
    DeactivateUser(ctx context.Context, id string) error
}

This is not always wrong, but it often becomes too large.

In Go, interfaces are usually better when they are owned by the consumer.

If a handler only needs FindUser, define:

type UserFinder interface {
    FindUser(ctx context.Context, id string) (User, error)
}

If a use case only needs to publish events:

type UserEventPublisher interface {
    Publish(ctx context.Context, event Event) error
}

This keeps the code flexible.

A type can satisfy many small interfaces without knowing about them.

That is one of Go’s strongest design features.

Practical rules I follow

These are the rules I use in real Go services.

1. Start concrete

Do not create interfaces too early.

Write the concrete implementation first. Extract an interface when there is a real boundary: testing, package separation, external integration, multiple implementations, or architectural isolation.

2. Accept interfaces, return structs

This is a common Go guideline. Functions should usually accept behavior and return concrete values.

func NewReservationService(repo BookingRepository) *ReservationService {
    return &ReservationService{repo: repo}
}

3. Keep interfaces small

One method is fine.

type HealthChecker interface {
    Check(ctx context.Context) error
}

Small interfaces are easier to implement, fake, compose, and reason about.

4. Pass `context.Context` at boundaries

For database calls, HTTP calls, broker calls, and service calls, pass context explicitly.

DoSomething(ctx context.Context, input Input) error

5. Prefer composition for capabilities

If a type needs logging, metrics, validation, publishing, or persistence, compose those dependencies.

Do not hide them in a base class.

6. Use `any` carefully

any is useful, but too much any creates weak contracts. Use real domain types when the shape is known.

Final thought

Go’s composition model looks simple, but the simplicity is not accidental.

Inheritance asks us to organize the world into categories.

Composition asks us to organize the system into capabilities.

For backend engineering, microservices, distributed workflows, and concurrent systems, capabilities are usually the better abstraction.

A hotel booking service does not need a perfect inheritance tree. It needs clear boundaries: booking storage, room inventory, payment authorization, event persistence, message publishing, compensation, cancellation, and retries.

Go gives me the tools to express those boundaries directly.

That is why, after working with composition, interfaces, context propagation, Outbox, and Saga patterns in Go, I do not miss inheritance much.

I prefer code where the dependencies are visible, the contracts are small, and the system fails in ways I can understand.

That is composition over inheritance in practice.

References

Go FAQ — “Is Go an object-oriented language?” https://go.dev/doc/faq#Is_Go_an_object-oriented_language ↩
Rob Pike, “Less is exponentially more.” https://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html ↩
The Go programming language official website. https://go.dev/ ↩
The Go Programming Language Specification — Struct types and embedded fields. https://go.dev/ref/spec#Struct_types ↩
Effective Go — Embedding and interfaces. https://go.dev/doc/effective_go ↩
Moby Project GitHub repository. https://github.com/moby/moby ↩
Moby client package documentation, including API client interfaces such as ImageAPIClient. https://pkg.go.dev/github.com/moby/moby/client ↩
Docker/Moby Go client package documentation examples. https://pkg.go.dev/github.com/moby/docker/client ↩
Sameer Ajmani, “Go Concurrency Patterns: Context.” https://go.dev/blog/context ↩
Rob Pike, “The Laws of Reflection.” https://go.dev/blog/laws-of-reflection ↩
Go 101 — Interfaces in Go, including any as alias of interface{}. https://go101.org/article/interface.html ↩

Hardening a Linux Server in the Real World: Firewall, SSH, Fail2Ban, Nginx, Docker, .env Protection, and Bot Forensics

amir — Wed, 20 May 2026 07:46:41 +0000

Every public server becomes part of the internet’s background noise very quickly.

That was not obvious to me in the same way until I started watching production traffic closely. I was not only seeing normal users, crawlers, and health checks. I was also seeing bots probing predictable paths:

/.env
/.env.production
/backup/.env
/wp/.env
/magento/.env
/api/v2/.env
/gateway/.env
/vendor/.env
/storage/.env
/.git/config
/credentials.json
/service-account.json
/__env.js
/actuator/env
/admin/phpinfo.php
/wp-admin/install.php

These were not random requests. They were patterns.

The same categories of paths appeared again and again, usually from new IP addresses, often through CDN ranges, and usually expecting one mistake: a leaked environment file, a forgotten backup, an exposed Git directory, a debug endpoint, a WordPress installer, a Spring actuator route, or a service account JSON file accidentally placed under the web root.

That experience changed how I think about server hardening.

I do not treat security as one tool anymore. I treat it as layers:

firewall first
SSH exposure reduction
key-only authentication
non-root users
Fail2Ban for behavior-based blocking
Nginx deny rules and allow lists
Docker isolation
process and resource monitoring
CDN and WAF in front
forensic habits when something looks wrong
secret isolation, especially around .env

This article is a practical write-up of how I harden Linux servers based on the real traffic I monitored and handled.

I also built a small Go project for this workflow: WatchTower-Sentinel.

GitHub: https://github.com/amirsefati/WatchTower-Sentinel

It tails Nginx access logs, tracks first-seen client IPs, watches CPU/RAM pressure, inspects suspicious processes, and sends concise Telegram alerts. In my case, it helped me identify real bot behavior and extract request patterns from production-like traffic instead of guessing from theory.

The first rule: assume your server is already being scanned

A fresh public IP is not invisible.

Once a service is reachable from the internet, it will eventually receive probes. Some are harmless crawlers. Some are noisy automated scanners. Some are looking for one specific mistake.

The most common mistake I saw in logs was not a complex exploit. It was a simple file exposure attempt:

GET /.env
GET /.env.production
GET /backup/.env
GET /wp/.env
GET /storage/.env
GET /credentials.json
GET /service-account.json
GET /.git/config

The attacker does not need a zero-day if the application serves secrets as static files.

That is why my hardening starts with boring basics. Boring security is usually the security that actually works.

Step 1: create a normal user and stop working as root

The first thing I do on a server is create a non-root user.

adduser deploy
usermod -aG sudo deploy

Then I copy my SSH key:

mkdir -p /home/deploy/.ssh
nano /home/deploy/.ssh/authorized_keys

chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

After that, I test login in a second terminal before touching root SSH access:

ssh deploy@SERVER_IP

Only after I confirm that the normal user works, I reduce root exposure.

Security is not only about blocking attackers. It is also about avoiding self-inflicted downtime. Never close the old door before testing the new one.

Step 2: harden SSH before enabling aggressive firewall rules

I usually edit:

sudo nano /etc/ssh/sshd_config

These are the important settings:

Port 2222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
X11Forwarding no
AllowUsers deploy

Then I validate the config:

sudo sshd -t

If validation passes:

sudo systemctl reload ssh

Then I test the new port from another terminal:

ssh -p 2222 deploy@SERVER_IP

Only after that do I close the old SSH port at the firewall level.

Changing the SSH port is not real authentication security by itself. It does not replace keys. But it reduces the volume of automated noise hitting port 22, and that matters because clean logs are easier to investigate.

The real security improvement is:

root login disabled
password login disabled
only specific users allowed
key-based authentication required

Step 3: enable UFW carefully

The easiest way to lock yourself out of a server is to enable a firewall before allowing SSH.

So I always allow the new SSH port first:

sudo ufw default deny incoming
sudo ufw default allow outgoing

sudo ufw allow 2222/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

Then enable:

sudo ufw enable
sudo ufw status verbose

If I know my own static IP, I prefer to restrict SSH even more:

sudo ufw delete allow 2222/tcp
sudo ufw allow from YOUR_PUBLIC_IP to any port 2222 proto tcp

This is much better than exposing SSH to the entire internet.

For production servers, I do not like leaving management ports open globally. SSH should be reachable only from trusted IPs, a VPN, a bastion host, or a private network whenever possible.

Step 4: install Fail2Ban for SSH and Nginx behavior

Firewall rules are static. Fail2Ban adds behavior.

Install it:

sudo apt update
sudo apt install fail2ban -y
sudo systemctl enable --now fail2ban

Create a local jail file:

sudo nano /etc/fail2ban/jail.local

For SSH:

[sshd]
enabled = true
port = 2222
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
findtime = 10m
bantime = 1h
backend = systemd

Then restart:

sudo systemctl restart fail2ban
sudo fail2ban-client status
sudo fail2ban-client status sshd

Fail2Ban is not only for SSH. It becomes more useful when I add Nginx patterns for real traffic I see.

For example, I saw repeated sensitive-path probes like:

/.env
/.env.production
/.git/config
/credentials.json
/service-account.json
/actuator/env
/admin/phpinfo.php

So I can create a filter:

sudo nano /etc/fail2ban/filter.d/nginx-sensitive-paths.conf

Example pattern:

[Definition]
failregex = ^<HOST> - .* "(GET|POST|HEAD) /(.*)?(\.env|\.git/config|credentials\.json|service-account\.json|__env\.js|actuator/env|phpinfo\.php|wp-admin/install\.php).*" (403|404|444) .*
ignoreregex =

Then add a jail:

[nginx-sensitive-paths]
enabled = true
port = http,https
filter = nginx-sensitive-paths
logpath = /var/log/nginx/access.log
maxretry = 3
findtime = 10m
bantime = 6h

Before trusting a custom filter, I test it:

sudo fail2ban-regex /var/log/nginx/access.log /etc/fail2ban/filter.d/nginx-sensitive-paths.conf

This part is important. A bad regex can either miss real attacks or ban normal users. I prefer starting strict, observing, and then tuning.

My rule is simple: Fail2Ban should block behavior, not curiosity. One weird request may be noise. Repeated sensitive path probing is a pattern.

Step 5: make Nginx reject sensitive files before the application sees them

The application should not be responsible for every bad request.

If a request is obviously targeting secrets, Git metadata, backups, or internal files, Nginx can reject it immediately.

A basic hardening snippet:

# Block hidden files such as .env, .git, .htaccess
location ~ /\.(?!well-known) {
    deny all;
    access_log off;
    log_not_found off;
}

# Block common secret and config file names
location ~* ^/(.*)?(\.env|\.env\..*|credentials\.json|service-account\.json|__env\.js|composer\.(json|lock)|package-lock\.json|yarn\.lock)$ {
    deny all;
    access_log /var/log/nginx/security-access.log;
}

# Block backup/archive/database dump files
location ~* \.(bak|backup|old|orig|save|swp|sql|sqlite|db|tar|gz|zip|7z|rar)$ {
    deny all;
    access_log /var/log/nginx/security-access.log;
}

# Block obvious PHP probing on non-PHP apps
location ~* /(phpinfo\.php|wp-admin/install\.php|xmlrpc\.php)$ {
    return 404;
}

For some routes, I use allow lists.

For example, if an admin panel must only be accessible from office/VPN IPs:

location /admin/ {
    allow YOUR_TRUSTED_IP;
    deny all;

    proxy_pass http://127.0.0.1:3050;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

If the application is behind Cloudflare or another CDN, the real client IP must be restored correctly. Otherwise Nginx and Fail2Ban may only see the CDN proxy IP. That makes banning dangerous because you might ban a proxy instead of the attacker.

In that case, configure the real IP module with trusted CDN ranges and use the correct header, for example:

real_ip_header CF-Connecting-IP;
set_real_ip_from CLOUDFLARE_IP_RANGE;

The exact ranges must be kept updated from the CDN provider.

The `.env` file deserves its own section

The .env file is one of the most targeted files on the internet.

That is because it often contains exactly what attackers want:

DATABASE_URL
REDIS_URL
JWT_SECRET
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
DIGITALOCEAN_SPACES_KEY
STRIPE_SECRET_KEY
SMTP_PASSWORD
TELEGRAM_BOT_TOKEN
GOOGLE_SERVICE_ACCOUNT_JSON
SENTRY_DSN

A leaked .env can turn a simple HTTP misconfiguration into a full infrastructure incident.

The biggest problem is that .env files are convenient during development, so teams sometimes treat them casually. But in production, .env is not just a config file. It is a secret boundary.

Here is how I handle it.

1. Never place `.env` under the public web root

This is the most important rule.

Bad idea:

/var/www/app/public/.env
/var/www/html/.env
/usr/share/nginx/html/.env

Better:

/opt/myapp/.env
/etc/myapp/myapp.env
/home/deploy/apps/myapp/.env

The file should exist outside any directory that Nginx can serve as static content.

2. Use strict permissions

For a single application user:

sudo chown deploy:deploy /opt/myapp/.env
sudo chmod 600 /opt/myapp/.env

That means only the owner can read and write it.

For a systemd service, I prefer an environment file:

[Service]
User=deploy
Group=deploy
EnvironmentFile=/etc/myapp/myapp.env
ExecStart=/usr/bin/node /opt/myapp/server.js

Then:

sudo chown root:deploy /etc/myapp/myapp.env
sudo chmod 640 /etc/myapp/myapp.env

This allows the service group to read it while preventing random users from reading secrets.

3. Never commit `.env`

My .gitignore always includes:

.env
.env.*
!.env.example

And .env.example must contain only safe placeholders:

DATABASE_URL=postgres://user:password@localhost:5432/app
JWT_SECRET=change-me
TELEGRAM_BOT_TOKEN=change-me

The example file documents required variables without leaking real values.

4. Do not bake secrets into Docker images

This is a common mistake.

Bad:

ENV DATABASE_URL=postgres://real-secret
COPY .env /app/.env

Better:

services:
  api:
    image: my-api:latest
    env_file:
      - /etc/myapp/myapp.env

Even better in orchestrated environments: use secret managers, platform secrets, Docker secrets, Kubernetes Secrets, or a cloud secret manager.

The image should be portable. Secrets should be injected at runtime.

5. Rotate secrets after exposure

If .env was exposed, removing the file is not enough.

I rotate:

database passwords
API keys
cloud access keys
JWT secrets
SMTP credentials
bot tokens
webhook secrets
object storage keys
third-party service tokens

Then I check logs for suspicious access during the exposure window.

A leaked secret must be treated as used, not just viewed.

Docker: do not casually run everything as root

Docker is not a magic sandbox.

If a process runs as root inside a container, it is still a risk. The level of risk depends on the runtime, capabilities, mounts, namespaces, and daemon configuration, but I avoid unnecessary root containers.

In Dockerfiles, I prefer:

FROM node:22-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --omit=dev

COPY . .

RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

CMD ["node", "server.js"]

In Compose:

services:
  api:
    build: .
    user: "10001:10001"
    read_only: true
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    tmpfs:
      - /tmp
    ports:
      - "127.0.0.1:3000:3000"

Important habits:

do not mount / unnecessarily
do not mount docker.sock into random containers
drop Linux capabilities when possible
use read-only filesystems where possible
bind services to 127.0.0.1 behind Nginx
avoid privileged: true unless there is a very strong reason

If I need stronger isolation, I look at rootless Docker, user namespaces, seccomp, AppArmor, SELinux, or moving the workload to a more controlled orchestrated environment.

The main idea is simple: if the app is compromised, the attacker should hit walls immediately.

Process limits and resource protection

Security is also availability.

A compromised app, a miner, or a broken process can consume CPU, RAM, file descriptors, or process slots.

For systemd services, I use limits like:

[Service]
User=deploy
Group=deploy
Restart=always
RestartSec=5

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true

MemoryMax=500M
CPUQuota=80%
TasksMax=200
LimitNOFILE=65535

For Docker Compose:

services:
  api:
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M

Depending on the environment, Compose resource limits may behave differently, so I always verify on the target host.

I also monitor:

top
htop
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
systemctl status SERVICE_NAME
journalctl -u SERVICE_NAME -f

And for network activity:

ss -tunap
sudo lsof -i -P -n

This is where WatchTower-Sentinel helped me. Instead of manually checking all the time, I wanted a small sentinel that could detect first-seen IPs, suspicious request paths, high CPU/RAM pressure, and risky process activity, then send compact Telegram alerts.

Detecting miner-like infections and suspicious processes

When I suspect a miner or unwanted process, I do not start by deleting random files.

I first preserve enough information to understand what happened.

My quick triage flow:

uptime
top
ps aux --sort=-%cpu | head -30
ps aux --sort=-%mem | head -30

Then I inspect suspicious processes:

readlink -f /proc/PID/exe
tr '\0' ' ' < /proc/PID/cmdline
ls -la /proc/PID/fd
cat /proc/PID/environ 2>/dev/null | tr '\0' '\n'

Network connections:

ss -tunap
sudo lsof -i -P -n

Recently changed files:

sudo find /tmp /var/tmp /dev/shm -type f -mtime -2 -ls 2>/dev/null
sudo find /etc/systemd /etc/cron* /var/spool/cron -type f -mtime -7 -ls 2>/dev/null

Persistence checks:

crontab -l
sudo ls -la /etc/cron.d /etc/cron.hourly /etc/cron.daily
systemctl list-timers
systemctl list-units --type=service --state=running

Logs:

sudo journalctl --since "24 hours ago"
sudo grep -i "failed password" /var/log/auth.log
sudo grep -i "accepted" /var/log/auth.log

Common miner red flags:

high CPU with unknown binary
process running from /tmp, /var/tmp, or /dev/shm
weird random process names
outbound connections to unknown IPs
cron jobs that download shell scripts
systemd services with suspicious ExecStart
unexpected SSH keys added to authorized_keys

When I handled suspicious cases, I treated it like forensics first and cleanup second.

The professional approach is:

identify the process
identify how it started
identify persistence
identify network connections
identify modified files
rotate secrets
patch the entry point
rebuild if trust is lost

If the server is seriously compromised, I do not pretend that deleting one process is enough. I rebuild from a clean image, restore trusted data, rotate credentials, and close the original entry point.

That is the difference between “killing a miner” and actually fixing the incident.

CDN and WAF: why I prefer putting apps behind a protective layer

A CDN is not just for performance.

For public apps, I prefer having a CDN or reverse proxy layer in front because it gives me:

TLS termination
DDoS absorption
bot filtering
WAF managed rules
rate limiting
country/IP rules
header normalization
origin hiding

A WAF is especially useful for common attack classes:

path traversal
SQL injection patterns
XSS probes
known CMS exploit paths
suspicious user agents
automated scanners

But I do not rely on WAF alone.

The origin server must still be hardened.

If the origin IP is exposed and accepts traffic directly, attackers can bypass the CDN. So I restrict the origin to CDN IP ranges where possible, or I put the app behind private networking and only expose the proxy.

A good pattern is:

Internet
  -> CDN / WAF
    -> Nginx
      -> local app on 127.0.0.1
        -> private database/cache

Not:

Internet
  -> Node.js app directly
  -> database accidentally exposed

Nginx security habits that helped me

Here are Nginx patterns I use often.

Hide server tokens

server_tokens off;

Limit request body size

client_max_body_size 10m;

Basic rate limiting

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

server {
    location /api/ {
        limit_req zone=api_limit burst=20 nodelay;
        proxy_pass http://127.0.0.1:3000;
    }
}

Security headers

add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;

For HSTS, I only enable it when I am sure HTTPS is fully correct:

add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

Deny direct access to internal paths

location ~* ^/(internal|private|backup|storage|vendor)/ {
    deny all;
}

Return 444 for obvious garbage

Sometimes I use:

return 444;

for abusive traffic. It closes the connection without a response. I use it carefully because normal debugging becomes harder if overused.

How WatchTower-Sentinel helped me see patterns

I built WatchTower-Sentinel because I wanted lightweight visibility without deploying a heavy SIEM for every small server.

The idea is simple:

tail Nginx access logs
detect new client IPs
detect request bursts
detect sensitive path scans
watch CPU/RAM pressure
inspect suspicious processes
send compact Telegram alerts

In my Telegram reports, I could see events like:

SENSITIVE_PATH_SCAN
path=/.env
status=404

or:

NEW_IP
path=/credentials.json
status=404

or:

NEW_IP
path=/actuator/env
status=404

This changed the conversation from “maybe bots are scanning us” to “these are the exact paths they are probing.”

That is a big difference.

Once I had the patterns, I could turn them into:

Nginx deny rules
Fail2Ban filters
WAF rules
alert categories
incident review notes

That feedback loop is the most valuable part:

observe -> classify -> block -> monitor -> tune

Security improves when the server teaches you what is happening.

My practical hardening checklist

This is the checklist I like to apply before I trust a server:

Create non-root sudo user
Install SSH key
Disable root SSH login
Disable password SSH login
Move SSH to a non-default port
Allow SSH only from trusted IPs if possible
Enable UFW with default deny incoming
Allow only required ports
Install and configure Fail2Ban
Add custom Nginx filters for sensitive path scans
Block hidden files and secret files in Nginx
Keep .env outside public web roots
Set .env permissions to 600 or 640
Never commit .env
Never bake secrets into Docker images
Run app containers as non-root
Drop Docker capabilities
Avoid privileged containers
Bind internal services to 127.0.0.1
Put production apps behind CDN/WAF
Restrict origin access where possible
Monitor CPU/RAM/process/network behavior
Check cron/systemd persistence during incidents
Rotate secrets after any suspected exposure
Rebuild compromised servers when trust is lost

None of these steps are exotic. But together, they make a huge difference.

Final thoughts

The internet constantly tests basic mistakes.

Most of the traffic I observed was not sophisticated. It was automated, repetitive, and opportunistic.

But that is exactly why basic hardening matters.

If a bot requests /.env and receives 404 or 403, that is good.

If Nginx blocks it before the app sees it, better.

If Fail2Ban detects repeated probes and bans the source, better.

If the origin is behind a CDN/WAF and SSH is restricted, better.

If secrets are outside the web root, permissioned correctly, never committed, and rotated after exposure, much better.

My biggest lesson from running and monitoring real servers is this:

Security is not one big tool. It is a set of small decisions that reduce blast radius.

That is how I approach Linux hardening now.

I start with the boring layers, I watch real traffic, I convert patterns into controls, and I keep improving the system based on what the internet is actually doing to it.

That is also why I built WatchTower-Sentinel.

Not because alerts are cool, but because visibility changes how you defend a server.

Repository:

https://github.com/amirsefati/WatchTower-Sentinel

References

Ubuntu UFW documentation: https://ubuntu.com/server/docs/how-to/security/firewalls/
Nginx access module: https://nginx.org/en/docs/http/ngx_http_access_module.html
Nginx documentation: https://nginx.org/en/docs/
Fail2Ban filters documentation: https://fail2ban.readthedocs.io/en/latest/filters.html
Docker rootless mode: https://docs.docker.com/engine/security/rootless/

From DeepSeek to Quack: When the Dream of Distributed DuckDB Started to Feel Real

amir — Tue, 19 May 2026 09:31:49 +0000

At the beginning of 2025, DeepSeek changed the conversation around AI infrastructure.

Most people focused on the model quality, the training cost, and the geopolitical story around a new Chinese AI lab suddenly competing with the biggest names in the industry. That part was interesting, of course. But as an engineer, the part that caught my attention was not only the model.

It was the data pipeline behind it.

DeepSeek released Smallpond, a lightweight data processing framework built on DuckDB and 3FS. The idea was surprisingly simple: instead of building everything around a traditional big-data engine like Spark, run many independent DuckDB-based processing jobs close to the data, partition the workload carefully, and let each local engine do what it does best.

That sounds almost too simple.

But that is exactly why it is interesting.

The uncomfortable question: do we always need Spark?

As a senior engineer, I have worked on systems where Spark was the default answer before the problem was even fully understood.

Need to process files? Use Spark.

Need to aggregate logs? Use Spark.

Need to transform Parquet? Use Spark.

Need to join medium-sized datasets? Still Spark.

Spark is powerful, and I am not arguing against it. But in many real projects, the operational cost becomes the hidden tax: cluster configuration, memory tuning, shuffle behavior, executor sizing, dependency packaging, job retries, monitoring, and the constant pain of debugging a distributed job that fails somewhere in the middle of a long DAG.

DuckDB sits on the opposite side of that spectrum.

It is embedded. It runs inside your process. It reads Parquet beautifully. It speaks SQL. It is fast for analytical workloads. And most importantly, it makes local data processing feel boring again.

That boring part is a compliment.

When I first started using DuckDB seriously, it replaced a lot of small Python scripts in my workflow. Instead of loading CSV or Parquet files into Pandas, fighting memory limits, and then exporting results again, I could write SQL directly over files:

SELECT customer_id, sum(amount) AS total_spent
FROM read_parquet('orders/*.parquet')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 20;

For one-off analysis, this is already great. But Smallpond suggested something bigger: what if DuckDB is not just a local helper tool, but the execution unit of a distributed data system?

Smallpond's lesson: distribute the plan, not the database

Smallpond is interesting because it does not try to turn DuckDB itself into a distributed database. Instead, it treats DuckDB as a fast local execution engine.

The pattern looks like this:

Split the dataset into partitions.
Send partitions to different workers.
Let each worker run DuckDB locally.
Write intermediate results back to shared storage.
Merge or repartition when needed.

That is not a new idea in distributed systems, but using DuckDB as the local analytical core makes it feel lightweight.

In the Smallpond repository, the basic example is simple: read Parquet, repartition by a key, run SQL, and write Parquet output again. The README also mentions a GraySort benchmark where Smallpond processed more than 100 TiB of data on a cluster. That is a strong reminder that not every scalable system needs to look like the traditional Hadoop/Spark stack.

The deeper lesson for me is this:

Sometimes the best distributed architecture is not one giant distributed database. Sometimes it is thousands of small, predictable local engines coordinated well.

That idea maps nicely to modern AI pipelines.

Training a model is not only about GPUs. Before the GPU sees anything, there is a long chain of data cleaning, deduplication, filtering, tokenization, feature extraction, metadata joins, quality checks, and batch generation. A lot of that work is analytical. A lot of it is file-based. And a lot of it can be pushed close to storage.

DuckDB is very good at that style of work.

Where DuckDB used to hurt

DuckDB's strength has always been its simplicity: it is an in-process analytical database.

But the same simplicity creates a limitation.

If you have one Python process, one CLI session, or one application working with a DuckDB database file, everything feels clean. But once multiple processes want to write to the same database file, you quickly run into locking and concurrency constraints.

That is not a bug. It is part of the design.

DuckDB was originally optimized for embedded OLAP workloads, not for being a shared multi-client server like PostgreSQL.

In my own projects, I usually solved this by avoiding shared writes completely:

write partitioned Parquet instead of writing into one shared database file
let each worker produce immutable output
use object storage as the coordination layer
run a final compaction or merge step later
keep DuckDB as the query engine, not the source of truth

This works well, but it has trade-offs. You start building your own small coordination layer. You need naming conventions, idempotent writes, retry logic, cleanup jobs, and sometimes a metadata database just to track what happened.

That is why Quack is so interesting.

Quack: DuckDB starts speaking over the network

In 2026, Hannes Mühleisen introduced Quack, a remote protocol that turns DuckDB into a client-server database.

The idea is elegant: both the client and the server are DuckDB instances, but they communicate through the quack: protocol. The server owns the data. The client sends queries. The heavy work happens near the data, and the result comes back to the client.

A simplified example looks like this:

INSTALL quack FROM core_nightly;
LOAD quack;

CREATE SECRET (
    TYPE quack,
    TOKEN 'super_secret'
);

ATTACH 'quack:bigserver:9494' AS remote;

SELECT customer_id, sum(amount) AS total_amount
FROM remote.transactions
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 10;

This is not just a nicer syntax. It changes the deployment model.

Before Quack, DuckDB was mostly local-first. With Quack, DuckDB can become remote-first when needed.

That means:

the data can stay on a powerful server
laptop clients can query without downloading huge datasets
multiple clients can connect to the same DuckDB server
DuckDB can be used in more traditional application architectures
DuckDB-Wasm and browser-based analytical tools become more interesting

The official DuckDB documentation describes Quack as an RPC protocol for DuckDB and mentions use cases like concurrent read-write access, moving computation closer to data, and querying powerful servers from local clients.

For me, the key phrase is: compute near data.

That is one of the most important ideas in data engineering.

Moving 500 GB to a laptop is a bad plan. Sending a SQL query to the machine that already has the data is a better plan.

A prototype I would actually build

The first thing I wanted to try with this architecture was not a huge AI training pipeline. It was something more realistic: a lightweight analytics service for event data.

Imagine this setup:

application events are written as Parquet files to object storage
a small ingestion service batches new events
DuckDB reads and validates those files locally
embeddings are generated for selected text fields
analytical metadata is stored in DuckDB or DuckLake
vector search is handled by a dedicated vector database
Quack exposes the central DuckDB instance to internal tools

This kind of architecture is attractive because each tool does one job well.

DuckDB is great for analytical SQL.

Object storage is great for cheap durable files.

A vector database is great for similarity search.

Quack becomes the bridge that lets multiple clients query the analytical layer without copying everything locally.

Where vector databases fit into this story

A vector database stores embeddings instead of just rows and columns.

An embedding is a numerical representation of text, image, audio, code, or another object. For example, a support ticket like:

“The payment failed after I changed my billing address.”

can be converted into a vector such as:

[0.012, -0.441, 0.087, ...]

The numbers themselves are not meaningful to humans, but their position in vector space captures semantic meaning. Similar texts produce vectors that are close to each other.

That enables queries like:

Find tickets semantically similar to this new complaint.

Traditional SQL is not designed for that kind of similarity search. SQL is excellent when you know the exact fields and predicates:

WHERE status = 'failed'
AND country = 'AM'

Vector search is different:

Find documents close to this embedding.

This is why systems like Qdrant, Milvus, Weaviate, Pinecone, pgvector, and others became popular. They use indexes such as HNSW or IVF to make nearest-neighbor search fast.

But here is the important part: vector search alone is rarely enough.

In production, you usually need hybrid retrieval:

vector similarity for semantic meaning
SQL filters for structured constraints
full-text search for exact keywords
metadata joins for permissions, customers, time ranges, or product categories
analytical queries to evaluate quality and drift

That is where DuckDB becomes valuable again.

I do not want my vector database to become my entire analytics platform. I want it to retrieve candidates. Then I want SQL to inspect, filter, aggregate, evaluate, and debug the system.

For example:

SELECT source, count(*) AS total, avg(score) AS avg_score
FROM retrieval_logs
WHERE created_at >= now() - INTERVAL '7 days'
GROUP BY source
ORDER BY avg_score DESC;

This kind of query belongs naturally in DuckDB or a lakehouse layer, not inside the vector database.

Why Quack makes this architecture cleaner

Without Quack, I would normally run DuckDB locally inside each service and write files back to object storage. That is still a good pattern. But it makes interactive querying harder.

With Quack, I can imagine a cleaner workflow:

ETL workers process raw data locally with DuckDB.
Processed files are written to object storage.
A central DuckDB/DuckLake server exposes curated tables.
Internal tools connect through Quack.
BI dashboards query the same analytical layer.
Vector search services write retrieval logs back into the lake.
Engineers debug everything with SQL.

This is not a replacement for every data warehouse. It is not a replacement for Kafka, Spark, PostgreSQL, or a vector database.

But it is a powerful middle layer.

It fits the space where many teams actually live: too much data for one Pandas script, but not enough operational complexity to justify a full big-data platform.

The part I like most as an engineer

What I like about this direction is that it respects mechanical sympathy.

DuckDB is fast because it understands analytical execution: vectorized processing, columnar storage, efficient scans, smart Parquet reads, and local execution without network overhead.

Smallpond says: keep that local execution model, but run it many times in parallel.

Quack says: keep DuckDB's engine, but allow it to communicate when a shared server model is useful.

That is a healthy evolution.

It does not throw away the original design. It extends it.

As someone who has spent too much time debugging over-engineered pipelines, I appreciate systems that scale by composition instead of magic.

What I would be careful about

I would still be cautious before using Quack as the core of a production system today.

The official page describes it as a beta release. That matters.

Before depending on it, I would test:

write concurrency under real workload
authentication and network exposure
backup and restore strategy
failure behavior during long-running writes
compatibility with existing DuckDB extensions
observability and query logging
behavior behind load balancers or proxies
performance with many small writes versus large batches

I would also avoid pretending DuckDB suddenly became PostgreSQL.

DuckDB is still fundamentally an analytical engine. Even if Quack makes multi-client access possible, I would not immediately use it for high-volume OLTP workloads like payments, orders, or user sessions.

For those, PostgreSQL is still the boring and correct answer.

But for analytical workloads, internal dashboards, data pipelines, AI preprocessing, evaluation datasets, batch transformations, and lakehouse-style metadata, Quack opens a very interesting door.

Final thought

The story from DeepSeek's Smallpond to DuckDB's Quack is not just about one tool becoming distributed.

It is about a shift in how we think about data systems.

For years, the default answer to scale was often: use a bigger distributed framework.

Now we are seeing another pattern:

keep compute simple
keep data in open formats
run fast local engines near the data
coordinate through lightweight protocols
use specialized systems where they actually make sense

That is why this space is exciting.

DuckDB made local analytics feel simple.

Smallpond showed that many local DuckDB jobs can become a serious distributed processing pattern.

Quack now makes DuckDB instances talk to each other.

And when you combine that with object storage, DuckLake, Parquet, and vector databases, you get a very pragmatic architecture for modern AI and data engineering.

Not because it is trendy.

Because it removes unnecessary complexity.

Understanding PID Namespaces: The Small Linux Feature Behind Container Process Isolation

amir — Mon, 18 May 2026 18:11:35 +0000

Understanding PID Namespaces: The Small Linux Feature Behind Container Process Isolation

When people first learn containers, they usually hear this sentence:

“A container is just a process.”

That sentence is true, but incomplete.

A better version is:

“A container is a regular Linux process running with a different view of the system.”

One of the most important parts of that different view is the PID namespace.

A PID namespace controls what processes a process can see and what process IDs look like from inside that environment. It is one of the Linux kernel features that makes containers feel isolated, even though everything is still running on the same host kernel.

Docker, containerd, runc, Kubernetes, and even small learning projects like a tiny Docker-like runtime all rely on this idea.

What problem does a PID namespace solve?

On a normal Linux machine, every process has a PID:

ps aux

You may see things like:

PID 1      systemd
PID 842    sshd
PID 1201   nginx
PID 2300   node

Without PID isolation, a process inside a container could see host processes. That would be noisy, confusing, and dangerous.

With a PID namespace, the container gets its own process ID view.

Inside the container:

PID 1      app
PID 7      worker
PID 12     shell

On the host, those same processes still have real host PIDs:

PID 34520  app
PID 34541  worker
PID 34610  shell

So the same process can have two identities:

one PID inside the container
another PID on the host

This is not magic. It is namespace-based translation done by the Linux kernel.

PID 1 is not just “the first process”

A very common beginner mistake is thinking PID 1 is only a number.

It is not.

Inside a PID namespace, the first process becomes PID 1, and PID 1 has special responsibilities.

In a normal Linux system, PID 1 is usually systemd or another init system. In a container, PID 1 might be your application:

docker run my-api

If your app becomes PID 1 directly, it now behaves like the init process of that namespace.

That matters because PID 1 is responsible for handling orphaned child processes and reaping zombies. The Linux man pages describe the first process in a new PID namespace as the namespace init process, and orphaned children in that namespace are reparented to it.

This is why senior engineers often care about tiny init processes like:

tini
dumb-init

Without a proper init process, long-running containers can slowly accumulate zombie processes.

A container may look healthy from the outside, but inside it can be leaking process table entries because PID 1 is not doing its job.

The senior-level lesson: containers are isolation, not virtualization

A VM gets its own kernel.

A container does not.

A container shares the host kernel, but gets isolated views using kernel features like:

PID namespaces
mount namespaces
network namespaces
UTS namespaces
IPC namespaces
user namespaces
cgroups

The PID namespace only isolates process visibility and PID numbering. It does not magically secure everything.

That is a critical mental model.

A PID namespace can stop a container from seeing host processes, but it does not protect you from:

dangerous Linux capabilities
privileged containers
host filesystem mounts
exposed Docker socket
weak seccomp, AppArmor, or SELinux profiles
kernel vulnerabilities
bad Kubernetes security context settings

This is why container security is usually about layers, not one feature.

How Docker uses PID namespaces

By default, Docker gives containers their own PID namespace.

Docker exposes this through the --pid option. The default mode isolates processes, while --pid=host makes the container use the host PID namespace.

Example:

docker run --rm -it ubuntu ps aux

Inside the container, you may see only a few processes.

But with host PID mode:

docker run --rm -it --pid=host ubuntu ps aux

The container can see host processes.

That flag is useful for debugging, monitoring, and observability tools, but it should be treated carefully. In production, --pid=host removes an important isolation boundary.

What is the “hash” inside `/proc/<pid>/ns/pid`?

When you inspect namespaces, you may see something like this:

readlink /proc/$$/ns/pid

Output:

pid:[4026531836]

People sometimes casually call this a “namespace hash”, but it is not a cryptographic hash.

It is a kernel namespace identifier exposed through procfs. Namespace references are shown as special symbolic links, and the number helps identify whether two processes are in the same namespace.

If two processes show the same namespace ID for pid, they share the same PID namespace.

Example:

readlink /proc/1/ns/pid
readlink /proc/$$/ns/pid

If both return the same value, both processes are in the same PID namespace.

This is very useful for debugging containers.

How to check PID namespace isolation

From inside a container:

ps aux

If you only see the container’s own processes, PID isolation is probably enabled.

Check the namespace ID:

readlink /proc/1/ns/pid
readlink /proc/$$/ns/pid

From the host, inspect a container process:

docker inspect --format '{{.State.Pid}}' <container_id>

Then:

readlink /proc/<host_pid>/ns/pid

You can compare namespace IDs between host processes and container processes.

Another useful command:

lsns -t pid

This shows PID namespaces on the system.

For deeper debugging:

pstree -p

or:

ps -eo pid,ppid,cmd

The trick is to always remember that the host sees the full truth, while the container sees a translated view.

How PID namespace isolation can be weakened

This is where many real-world mistakes happen.

PID namespaces are not usually “bypassed” by magic. They are usually weakened by configuration choices.

Here are common examples.

1. Running with host PID namespace

--pid=host

This makes the container see host processes.

Sometimes this is used by monitoring tools, but it should not be the default for normal application containers.

2. Running privileged containers

--privileged

A privileged container receives broad access that removes many normal container restrictions.

This is sometimes convenient during development, but it should be avoided for normal production workloads.

3. Mounting sensitive host paths

Examples:

-v /proc:/host/proc
-v /:/host
-v /var/run/docker.sock:/var/run/docker.sock

Mounting the Docker socket is especially dangerous because it can effectively give control over the Docker daemon.

4. Adding dangerous capabilities

Capabilities such as these should be reviewed carefully:

SYS_ADMIN
SYS_PTRACE
NET_ADMIN
DAC_READ_SEARCH

For PID and process security, SYS_PTRACE is especially sensitive because it relates to inspecting and tracing processes.

5. Weak Kubernetes security context

In Kubernetes, settings like these are important:

hostPID: true
privileged: true
allowPrivilegeEscalation: true

For normal workloads, these should usually be avoided.

Defensive checklist for real projects

When reviewing a containerized service, I usually ask these questions.

Runtime

docker inspect <container_id> | grep -i pid

Check whether the container is using host PID mode.

Capabilities

docker inspect <container_id> | grep -i cap

Prefer dropping unnecessary capabilities:

--cap-drop=ALL

Then add back only what is truly required.

Privileged mode

docker inspect <container_id> | grep -i privileged

For most application containers, this should be false.

Process tree

docker exec -it <container_id> ps aux

Look for zombie processes:

ps aux | grep Z

If you see zombies, check whether PID 1 is properly reaping children.

Namespace comparison

readlink /proc/1/ns/pid
readlink /proc/$$/ns/pid

Compare host and container namespace IDs.

Kubernetes

Check pod specs for:

hostPID: true
securityContext:
  privileged: true
  allowPrivilegeEscalation: true

These settings should be intentional, documented, and reviewed.

A practical example from building a tiny container runtime

When building a minimal Docker-like runtime, PID namespace support usually starts with something like:

SysProcAttr: &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWPID,
}

But there is a subtle detail.

When you create a new PID namespace, the child process becomes PID 1 inside that namespace. The parent still lives in the old namespace.

That means your runtime has to think carefully about:

who becomes PID 1
whether PID 1 launches the user command directly
whether you need a small init process
how signals are forwarded
how child processes are reaped
what happens when PID 1 exits

This is where the learning becomes real.

Creating a namespace is easy.

Managing a namespace correctly is the hard part.

Senior engineering lessons

1. Do not confuse isolation with security

PID namespaces provide process isolation, but they are only one part of the security model.

2. PID 1 behavior matters

If your application runs as PID 1, signal handling and zombie reaping become your problem.

3. Debugging containers requires two views

Always check both:

inside the container
from the host

The same process has different PIDs depending on where you look from.

4. Most “container escapes” start with bad configuration

In real systems, the issue is often not the PID namespace itself. The issue is combining weak settings:

privileged mode
host PID
host mounts
excessive capabilities
exposed Docker socket

5. Use namespaces intentionally

For observability tools, hostPID or --pid=host may be required.

For normal application workloads, it is usually unnecessary risk.

References

Linux man-pages: PID namespaces
Linux Kernel Documentation: Namespaces
Docker documentation: docker run --pid
OWASP Docker Security Cheat Sheet

Final thought

PID namespaces are one of those Linux features that look simple at first:

“The container gets its own process IDs.”

But after working with real systems, you realize the deeper lesson:

Process isolation is not only about hiding PIDs. It is about controlling visibility, lifecycle, signals, debugging, and failure boundaries.

That is why PID namespaces are not just a container feature.

They are a production engineering concept.

If you understand PID namespaces well, Docker feels less like magic and more like a thin layer over powerful Linux primitives.

Running My Tiny Docker-like Runtime on macOS with Lima

amir — Sun, 17 May 2026 14:40:30 +0000

Running My Tiny Docker-like Runtime on macOS with Lima: Lessons, Mistakes, and a Simple Benchmark

When I started building my own tiny Docker-like runtime in Go, I had one simple assumption:

“It is written in Go, so I should be able to run it anywhere.”

That assumption was only half correct.

Yes, Go makes it easy to compile binaries for different platforms. But a container runtime is not just a Go application. A container runtime depends heavily on operating system features, especially Linux kernel features.

In my case, the project needed things like:

Linux namespaces
cgroups v2
mount isolation
chroot
process isolation
bridge networking
veth pairs
iptables/NAT

And that is where macOS becomes a problem.

macOS is not Linux. It does not provide Linux namespaces or cgroups in the same way. So even if my code compiled on macOS, the actual container runtime logic could not work directly on macOS.

This was the point where I started using Lima.

In this article, I want to share how I used Lima to run my tiny Docker-like runtime from macOS, the mistakes I made, the small design decisions I learned from, and a simple experimental benchmark at the end.

This is not a “Lima vs Docker Desktop” article.

It is more about understanding the boundary between macOS, Linux, Docker, and a custom container runtime.

What I Was Building

The project is a small Docker-like runtime written in Go.

The goal was not to replace Docker.

The goal was to understand what Docker does under the hood.

Docker gives us a very clean developer experience:

docker run alpine echo hello

But behind that simple command, many things happen:

image resolution
filesystem preparation
namespace creation
cgroup configuration
mount setup
network setup
process execution
log tracking
metadata storage
cleanup

The official Docker documentation describes Docker as an open platform for developing, shipping, and running applications:

https://docs.docker.com/get-started/docker-overview/

That high-level explanation is useful, but when you build a tiny runtime yourself, Docker becomes much less magical.

You start seeing the lower-level Linux pieces.

For example, my runtime supports commands like:

tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh
tiny-docker-go ps
tiny-docker-go logs -f <container-id>
tiny-docker-go stop <container-id>

On Linux, this makes sense.

On macOS, it immediately raises a question:

Where do Linux namespaces and cgroups come from?

The answer is: they do not come from macOS.

I needed Linux.

The First Mistake: Thinking Go Portability Means Runtime Portability

My first mistake was confusing language portability with operating system feature portability.

I was thinking like this:

Go can build on macOS.
Therefore, my runtime should work on macOS.

But the correct mental model is:

Go can compile the program for macOS.
But Linux container primitives still require Linux.

A Go program can be cross-platform.

But this does not mean every syscall or kernel feature exists on every platform.

For example, when a container runtime wants to isolate a process, it may need Linux-specific features like:

CLONE_NEWUTS
CLONE_NEWPID
CLONE_NEWNS
CLONE_NEWNET
cgroup filesystem
mount operations
veth networking
iptables rules

These are not available as normal Linux container primitives on macOS.

So the real problem was not the programming language.

The real problem was the kernel.

That was a very important lesson for me.

Why macOS Cannot Run This Directly

macOS uses the XNU kernel.

Linux containers depend on the Linux kernel.

This matters because containers are not virtual machines. A container is usually a regular process with a restricted view of the system.

That restricted view is created by kernel features.

For example:

PID namespace      -> gives the process its own process tree
UTS namespace      -> gives the process its own hostname
mount namespace    -> gives the process its own mount view
network namespace  -> gives the process its own network stack
cgroups            -> limit and track resource usage
chroot/rootfs      -> changes the visible filesystem root

On a Linux machine, a runtime can call these features directly.

On macOS, the features are not available in the same way.

So the architecture had to change.

Instead of this:

macOS
  -> tiny-docker-go
      -> Linux namespaces/cgroups

I needed this:

macOS
  -> Linux VM
      -> tiny-docker-go
          -> Linux namespaces/cgroups

This is where Lima became useful.

What Is Lima?

Lima is a tool that runs Linux virtual machines on macOS.

Official documentation:

https://lima-vm.io/docs/

Installation guide:

https://lima-vm.io/docs/installation/

The important thing is this:

Lima is not Docker.
Lima is not my container runtime.
Lima gives me a Linux VM.

That Linux VM gives my project access to the Linux kernel features it needs.

A simple mental model:

MacBook
  └── macOS
       └── Lima VM
            └── Linux
                 └── my tiny Docker-like runtime
                      └── container-like process

This separation helped me understand the problem much better.

Lima is the environment.

My Go runtime is the thing doing the container work.

Alpine rootfs is the container filesystem.

Why I Did Not Just Use Docker Desktop

Docker Desktop is great.

I use Docker Desktop for normal development work.

But for this project, Docker Desktop was not the cleanest learning environment.

Docker Desktop itself uses a Linux VM behind the scenes on macOS. That is how Docker can run Linux containers on macOS.

But I was not trying to simply run containers.

I was trying to build a small runtime that behaves like a container runtime.

So if I put everything behind Docker Desktop too early, I would hide some of the details I wanted to learn.

My goal was not:

How do I run an app in Docker?

My goal was:

How does a container runtime use Linux features to isolate a process?

For that goal, Lima felt cleaner.

The distinction became:

Docker Desktop:
  Great for running Docker containers and application stacks.

Lima:
  Great for getting a Linux environment on macOS and experimenting with Linux internals.

So for this project, Lima gave me a better learning path.

Installing Lima

On macOS, installing Lima with Homebrew is simple:

brew install lima

Then I created a VM for the project:

limactl start --name=tiny-docker --cpus=4 --memory=4 --disk=20

Then I entered the VM:

limactl shell tiny-docker

or sometimes simply:

lima

Inside the VM, I installed the Linux packages my runtime needed:

sudo apt update
sudo apt install -y golang-go curl tar iproute2 iptables

Each dependency had a reason:

golang-go  -> build and test the runtime
curl       -> download rootfs archives
tar        -> extract rootfs archives
iproute2   -> work with Linux networking
iptables   -> configure NAT for isolated networking

This was already a useful learning point.

A container runtime is not just one binary.

It also depends on Linux system capabilities and tools, especially if you are implementing networking.

Preparing the Root Filesystem

A container needs a filesystem.

Docker normally handles this using images and layers.

My project was simpler. I used an Alpine minirootfs.

Inside the Lima VM:

mkdir -p rootfs/alpine

ARCH=$(uname -m)

curl -L -o alpine-rootfs.tar.gz \
  "https://dl-cdn.alpinelinux.org/alpine/latest-stable/releases/${ARCH}/alpine-minirootfs-3.23.4-${ARCH}.tar.gz"

sudo tar -xzf alpine-rootfs.tar.gz -C rootfs/alpine

Then I could run:

sudo ./tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh

At this point, the rootfs became the filesystem that the process sees inside the container-like environment.

This made the concept very concrete for me.

Before this project, I mostly thought about Docker images.

After this project, I started thinking more clearly about root filesystems.

A simplified version is:

Docker image:
  A packaged filesystem with metadata and layers.

Rootfs:
  The actual filesystem view used by the container process.

My tiny runtime does not implement Docker image layers, registries, manifests, or OCI image pulling.

It simply uses an extracted root filesystem.

That is enough for learning.

Architecture: macOS as a Wrapper, Linux as the Runtime

After experimenting, I ended up with this architecture:

If host is Linux:
  run the runtime directly.

If host is macOS:
  route the command through Lima.
  execute the Linux binary inside the Lima VM.

From the user's perspective, I wanted the command to still feel simple:

tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh

But internally, on macOS, it becomes closer to:

limactl shell tiny-docker sudo ./bin/tiny-docker-go-linux-amd64 \
  run \
  --rootfs ./rootfs/alpine \
  /bin/sh

So the macOS binary acts more like a dispatcher.

The actual runtime work happens in Linux.

A simplified architecture:

macOS terminal
  |
  | tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh
  v
Darwin service layer
  |
  | limactl shell tiny-docker sudo Linux binary ...
  v
Lima VM
  |
  v
Linux runtime binary
  |
  v
namespaces + cgroups + chroot + networking

This design helped me keep a clean boundary.

macOS handles the developer command.

Linux handles the container primitives.

Building the Linux Binary

One mistake I made was forgetting that the binary inside Lima must match the Linux VM architecture.

For example, if the Lima VM is x86_64, I can build:

mkdir -p bin

GOOS=linux GOARCH=amd64 \
  go build -o bin/tiny-docker-go-linux-amd64 ./cmd/tiny-docker-go

But if the Lima VM is aarch64, for example on Apple Silicon, I should build:

mkdir -p bin

GOOS=linux GOARCH=arm64 \
  go build -o bin/tiny-docker-go-linux-arm64 ./cmd/tiny-docker-go

To check the VM architecture:

limactl shell tiny-docker uname -m

Possible outputs:

x86_64   -> GOARCH=amd64
aarch64  -> GOARCH=arm64

If you build the wrong architecture, you may see:

exec format error

This error is simple but confusing when you first see it.

It usually means:

The binary architecture does not match the machine trying to execute it.

This was one of those small details that reminded me how important platform boundaries are.

Mistake: Assuming Host Paths Always Exist Inside Lima

Another mistake was around file sharing.

My project existed on macOS at something like:

/Users/amir/Desktop/tiny-docker

I expected that path to always work inside Lima.

Sometimes it did.

Sometimes the mount configuration was not what I expected.

So if I ran this inside the VM:

cd /Users/amir/Desktop/tiny-docker

and got:

No such file or directory

the problem was not my Go code.

It was not the runtime.

It was simply a shared folder issue.

The lesson was:

Always verify that the path exists inside the VM, not only on the host.

Useful checks:

limactl shell tiny-docker pwd
limactl shell tiny-docker ls -la /Users
limactl shell tiny-docker ls -la /Users/amir/Desktop

If the project is not mounted, the easiest workaround is to clone the repository inside the VM:

git clone <your-repo-url>
cd tiny-docker

The cleaner long-term solution is to configure Lima mounts properly.

But the important lesson is that a VM has its own filesystem view.

Never assume the host path exists inside the guest.

Mistake: Running the Wrong Binary

Another mistake was running the macOS binary when I actually needed the Linux binary.

This is easy to do when you have files like:

./tiny-docker-go
./bin/tiny-docker-go-linux-amd64
./bin/tiny-docker-go-linux-arm64

The macOS binary can be useful as a CLI wrapper.

But the Linux binary must perform the real runtime operations.

The separation became:

macOS binary:
  command parsing
  platform detection
  Lima dispatching

Linux binary:
  namespaces
  cgroups
  chroot
  mount setup
  networking
  process lifecycle

This made the code easier to reason about.

On macOS, I do not pretend to support Linux container primitives directly.

I route the work to the Linux VM.

Mistake: Not Checking Prerequisites Early

At first, failures happened too late.

For example, I could run a command and only later discover:

limactl is not installed
Lima instance does not exist
Lima instance is not running
Linux binary is missing
Rootfs is not accessible inside Lima

This created confusing errors.

So I started adding validation before running the actual command.

Good prerequisite checks include:

Is limactl installed?
Does the Lima instance exist?
Is the Lima instance running?
Does the Linux binary exist?
Is the Linux binary accessible inside Lima?
Is the rootfs path accessible inside Lima?

This improves developer experience a lot.

Instead of a low-level error, I want an error like:

Linux binary not found at "./bin/tiny-docker-go-linux-amd64";
build it first and share it with Lima.

or:

rootfs "./rootfs/alpine" is not accessible inside Lima;
ensure the workspace is shared with the VM.

This is not the most exciting part of a runtime project.

But it is an important engineering detail.

As a senior engineer, I have learned that good error messages are part of the product.

Even if the product is just a learning project.

Example: Running a Shell

After preparing the rootfs and building the runtime, I can run:

sudo ./tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh

Inside the shell:

cat /etc/os-release

Example output:

NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.23.4

Check the hostname:

hostname

Run a process:

sleep 30

From another terminal:

sudo ./tiny-docker-go ps

Example output:

ID            STATUS   PID   CREATED              COMMAND
ab12cd34ef56  running  1234  2026-05-17 12:30:45  /bin/sh

This helped me understand process tracking better.

A container runtime does not only start processes.

It also needs to track them, store metadata, collect logs, stop them, and clean up after them.

Example: Running From macOS Through Lima

From macOS, I wanted a command like this:

./tiny-docker-go run --rootfs ./rootfs/alpine /bin/echo "hello from linux"

Internally, the command is routed through Lima:

limactl shell tiny-docker sudo ./bin/tiny-docker-go-linux-amd64 \
  run \
  --rootfs ./rootfs/alpine \
  /bin/echo "hello from linux"

Expected output:

hello from linux

This gave me a nice workflow.

I could stay in my macOS terminal, but still execute the real Linux runtime inside the VM.

Example: Testing a Memory Limit

If cgroup v2 is available and the runtime supports memory limits, I can run:

sudo ./tiny-docker-go run \
  --memory 128m \
  --rootfs ./rootfs/alpine \
  /bin/sh

Inside the container-like shell:

cat /sys/fs/cgroup/memory.max

Expected output:

134217728

That is 128 MiB in bytes.

This small test made cgroups much more real for me.

Before this project, a memory limit felt like a Docker CLI option:

docker run --memory 128m alpine

After implementing a small version, I started seeing it differently:

Docker exposes a nice option.
The Linux kernel enforces the limit through cgroups.

That is a very different level of understanding.

Example: Isolated Networking

For networking, the runtime can create a basic isolated mode using Linux networking primitives.

The rough idea is:

host bridge
  |
  veth pair
  |
container network namespace

Run:

sudo ./tiny-docker-go run \
  --net isolated \
  --rootfs ./rootfs/alpine \
  /bin/sh

Inside the shell:

ip addr
ip route

Depending on the implementation, I expect to see a container-side interface and a default route.

This was one of the most interesting parts for me.

Docker networking feels simple from the outside:

docker run nginx

But underneath, there is a lot of Linux networking:

network namespaces
veth pairs
bridges
routes
iptables
NAT

Building even a small version made me appreciate how much complexity Docker hides.

What I Learned About chroot

My runtime uses chroot as a simple way to change the visible root filesystem.

For learning, this is useful.

But chroot is not the same as a full production container filesystem model.

With chroot, the process sees a different root directory:

Before:
/

After:
./rootfs/alpine becomes /

But production runtimes usually involve more advanced concepts:

pivot_root
overlay filesystems
image layers
OCI runtime spec
capability dropping
seccomp
AppArmor
SELinux
user namespaces
read-only mounts
masked paths

So I try to be careful when describing the project.

It is better to say:

This is a tiny educational runtime.
It demonstrates some basic container building blocks.
It is not a production replacement for Docker, containerd, or runc.

That honesty matters.

Learning projects are valuable, but they should not be oversold.

My Updated Mental Model of Docker

Before this project, I mostly used Docker at the command level:

docker build
docker run
docker ps
docker logs
docker stop

After building a tiny runtime, I started seeing Docker in layers:

Docker CLI
Docker daemon
containerd
runc
Linux namespaces
Linux cgroups
root filesystem
network namespace
mount namespace
process lifecycle

A command like this:

docker run alpine echo hello

looks small.

But conceptually, it involves:

resolving the image
downloading layers
preparing the root filesystem
creating namespaces
configuring cgroups
setting up mounts
configuring networking
starting the process
attaching stdio
tracking metadata
collecting exit status
cleaning up resources

My tiny runtime only implements a small part of this.

But that small part was enough to make Docker feel less like magic.

Lima vs Docker Desktop: My Practical Conclusion

I do not see Lima and Docker Desktop as direct replacements for each other in every situation.

For normal application development, Docker Desktop is usually more convenient.

It gives me:

Docker CLI
Docker Compose
image management
container lifecycle management
volume support
networking
developer-friendly tooling

But for learning Linux container internals, Lima gave me a cleaner mental model.

Lima gave me:

a Linux VM
direct access to Linux tools
a clean environment for experiments
less abstraction around Docker itself

So my conclusion is:

Use Docker Desktop when your goal is to run and ship applications.
Use Lima when your goal is to understand or control the Linux environment.

For this project, Lima was the better learning tool.

Simple Experimental Benchmark

This benchmark is not scientific.

I only wanted to understand the rough overhead of routing commands from macOS through Lima compared with running directly inside the Lima VM.

The command I tested was intentionally small:

/bin/echo hello

Because the command is tiny, the overhead of the runtime and VM boundary becomes easier to notice.

Test 1: Running Directly Inside Lima

Inside the Lima VM:

time sudo ./tiny-docker-go run --rootfs ./rootfs/alpine /bin/echo hello

Example result:

hello

real    0m0.045s
user    0m0.008s
sys     0m0.020s

Test 2: Running From macOS Through Lima

From macOS:

time ./tiny-docker-go run --rootfs ./rootfs/alpine /bin/echo hello

Internally, this routes through something like:

limactl shell tiny-docker sudo ./bin/tiny-docker-go-linux-amd64 \
  run \
  --rootfs ./rootfs/alpine \
  /bin/echo hello

Example result:

hello

real    0m0.180s
user    0m0.020s
sys     0m0.030s

Benchmark Interpretation

The direct Linux execution was faster.

The macOS-to-Lima path had extra overhead because the command crossed the VM boundary through limactl shell.

In this rough experiment:

Direct inside Lima:       ~45 ms
macOS through Lima:       ~180 ms
Extra routing overhead:   ~135 ms

For very short commands like /bin/echo, the overhead is visible.

For long-running processes, the overhead matters much less.

For example, if I run a service for 10 minutes, an extra 100-200 ms at startup is not very important.

My practical conclusion:

For a nice macOS developer experience, routing through Lima is acceptable.
For tight benchmark loops, run directly inside the VM.
For production-grade runtimes, this approach is educational, not final.

Things I Would Improve Next

There are many things I would like to improve in this project.

Some of them are runtime-related:

use pivot_root instead of only chroot
improve cgroup v2 handling
support better cleanup
add more robust metadata storage
improve log streaming
support better TTY handling
add user namespace support
drop Linux capabilities
add seccomp profiles

Some of them are macOS/Lima-related:

detect Lima architecture automatically
choose the correct Linux binary automatically
improve Lima instance setup
validate shared paths more clearly
provide a bootstrap command for macOS users
make error messages more actionable

A better macOS setup command could eventually look like this:

tiny-docker-go setup lima

And it could handle:

checking limactl
creating the Lima instance
building the Linux binary
preparing the rootfs
validating mounts
testing a hello-world container

That would make the project much easier to try.

Final Thoughts

Lima helped me understand the boundary between macOS and Linux much better.

The biggest lesson was simple:

Go can make the binary portable, but it cannot make Linux kernel features exist on macOS.

For a container runtime, the kernel matters.

My final mental model is:

macOS is my workstation.
Lima gives me a Linux VM.
The Linux VM gives me namespaces and cgroups.
My Go runtime uses those Linux features.
The rootfs gives the process its filesystem.

This project also changed how I look at Docker.

Docker is not magic.

But Docker is impressive because it hides a lot of complexity behind a simple interface.

A command like this:

docker run alpine echo hello

is easy to type.

But behind it, there are many layers of runtime, filesystem, networking, isolation, and process management.

Building tiny-docker-go in Go: What I Learned from Building a Tiny Docker-like Runtime

amir — Sat, 16 May 2026 15:14:08 +0000

Building `tiny-docker-go` in Go: What I Learned from Building a Tiny Docker-like Runtime

I use Docker almost every day.

I use it for local development, backend services, databases, staging environments, CI/CD pipelines, and sometimes even for debugging production-like issues. Like many developers, I became comfortable with commands like:

docker run
docker ps
docker logs
docker stop
docker compose up

But for a long time, Docker still felt like a black box to me.

I knew how to use it.

I knew how to write Dockerfiles.

I knew how to debug containers when something failed.

But I did not deeply understand what actually happens under the hood when we run a container.

So I decided to build a small Docker-like container runtime in Go.

The project is called tiny-docker-go.

GitHub repository:

https://github.com/amirsefati/tiny-docker-go

The goal was not to rebuild Docker.

Docker is a mature platform with a huge ecosystem: image builds, registries, storage drivers, networking drivers, logging drivers, security features, orchestration integrations, plugins, and many other production-grade details.

My goal was much smaller:

Build a tiny runtime step by step, so I can understand the Linux ideas behind containers.

Docker’s own documentation describes containers as isolated processes that run on a host and have their own filesystem, networking, and process tree. That sentence looks simple, but it hides a lot of Linux internals.

To understand that sentence, I needed to touch the real building blocks:

Linux namespaces
cgroups
root filesystems
chroot
/proc
process lifecycle
signals
logs
network namespaces
bridge networking
veth pairs
NAT
container metadata

This article is a summary of the full 10-day journey.

It is not a tutorial for building a production runtime.

It is a developer story about learning containers by building a tiny version of one.

Why I started with Go

I chose Go because it fits this kind of project very well.

Go makes it simple to build CLI tools, execute processes, work with files, handle signals, and call lower-level Linux syscalls when needed.

Also, many important container projects are written in Go. Docker itself, containerd, runc, Kubernetes, and many cloud-native tools use Go heavily.

So using Go felt natural.

For this project, I wanted the code to stay simple and readable. I did not want to hide everything behind too many abstractions too early.

At the same time, I wanted the structure to be extensible enough so I could add one feature every day without rewriting the whole project.

That balance became one of the main lessons of the project.

When you build systems software, the hard part is not only writing code that works today.

The hard part is writing code that can survive the next feature.

The 10-day plan

I split the project into 10 small parts:

Project structure and CLI foundation
Linux namespaces
Root filesystem isolation
Container IDs and metadata
Logs
Stop and lifecycle management
cgroups and memory limits
Network namespace
Bridge and veth networking
Polish, README, roadmap, and lessons learned

This helped me avoid one common mistake:

Trying to build “Docker” in one step.

That is too much.

Instead, I treated each day as one small question.

Day 1:

Can I execute a command through my own CLI?

Day 2:

Can I run that command inside new Linux namespaces?

Day 3:

Can I give that process a different root filesystem?

Day 4:

Can I remember what I started?

Day 5:

Can I capture logs?

Day 6:

Can I stop a running container?

Day 7:

Can I limit memory?

Day 8:

Can I isolate networking?

Day 9:

Can I connect the container back to the outside world?

Day 10:

Can I explain the architecture clearly?

That made the project much easier to continue.

Day 1: Project Setup and CLI Foundation

On Day 1, I did not start with namespaces.

That may sound strange because namespaces are one of the most exciting parts of containers.

But I wanted to start with the boring foundation first.

The initial project structure looked like this:

tiny-docker-go/
├── cmd/
│   └── tiny-docker-go/
│       └── main.go
├── internal/
│   ├── app/
│   ├── cli/
│   └── runtime/
├── go.mod
└── README.md

The idea was simple:

cmd/ contains the executable entrypoint.
internal/cli handles user-facing commands.
internal/runtime handles process execution.
internal/app wires things together.

I added basic commands:

tiny-docker-go run
tiny-docker-go ps
tiny-docker-go stop
tiny-docker-go logs

At this stage, only run actually did something.

It executed a normal Linux command on the host.

Example:

go run ./cmd/tiny-docker-go run echo hello

Output:

hello

This was not a container yet.

There was no isolation.

No cgroups.

No rootfs.

No networking.

But this step mattered because it gave me a stable CLI shape.

I wanted the outside interface to look like a tiny version of Docker:

tiny-docker-go run /bin/sh
tiny-docker-go ps
tiny-docker-go logs <id>
tiny-docker-go stop <id>

Even before the internals were ready, the product shape was clear.

That helped a lot later.

Small lesson from Day 1

A container runtime is still a command runner at the beginning.

Before thinking about advanced kernel features, I needed a clean way to receive a command, validate it, execute it, and return output to the terminal.

A lot of systems projects start like this.

First, build a simple interface.

Then make the implementation smarter behind that interface.

Day 2: Adding Linux Namespaces

Day 2 was where the project started to feel like a real container runtime.

Linux namespaces are one of the core ideas behind containers.

A namespace gives a process a different view of some system resource.

For example:

PID namespace gives a different process tree.
UTS namespace gives a different hostname.
Mount namespace gives a different mount table.
Network namespace gives a different network stack.
User namespace gives a different view of user and group IDs.
IPC namespace isolates IPC resources.
Cgroup namespace isolates cgroup views.

The important thing is this:

A container is not a virtual machine. It is still a Linux process, but it sees a more isolated view of the system.

That sentence changed how I think about Docker.

When I run:

docker run alpine sh

Docker does not boot a new kernel like a VM.

It starts a process on the host kernel, but configures isolation around it.

In Go, I started experimenting with syscall.SysProcAttr and clone flags.

A simplified version looks like this:

cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS |
        syscall.CLONE_NEWPID |
        syscall.CLONE_NEWNS,
}

This creates the child process in new namespaces.

The first namespaces I added were:

UTS namespace
PID namespace
Mount namespace

UTS namespace

UTS namespace lets the container have its own hostname.

Inside the child process, I could call:

syscall.Sethostname([]byte("tiny-container"))

Then inside the container:

hostname

would show:

tiny-container

That was a small moment, but it felt important.

The process was still running on my machine, but it had its own hostname.

That was the first visible sign of isolation.

PID namespace

PID namespace was more interesting.

With a new PID namespace, the process inside the container can see itself as PID 1.

That is a big deal.

On Linux, PID 1 is special.

It is the init process of that namespace. It has responsibilities around signal handling and reaping zombie processes.

This is why container entrypoints matter.

If the main process inside a container does not handle signals correctly, stopping the container can behave badly.

This also helped me understand why tools like tini exist in container environments.

Mount namespace

Mount namespace gave the container its own mount table.

That means the process can have different mounts from the host.

At this point, I was not yet fully changing the filesystem, but I prepared the project for mounting /proc later.

One small Linux detail I learned here:

When working with mount namespaces, mount propagation can surprise you.

If mounts are shared with the host, changes inside one namespace may propagate in ways you do not expect. Real runtimes are careful about making mounts private before doing container setup.

This is one of those details that you do not think about when using Docker normally.

But when building a runtime, it becomes visible very quickly.

Parent and child process model

One design pattern I used was the parent/child model with:

/proc/self/exe

The parent process receives the CLI command.

Then it starts a child process by re-executing the same binary:

exec.Command("/proc/self/exe", "child", ...)

The parent is responsible for setup and management.

The child enters the isolated environment and runs the target command.

This pattern made the code easier to reason about.

There is a clear split:

parent process
├── parse CLI
├── prepare config
├── start child with namespaces
└── track metadata

child process
├── set hostname
├── prepare filesystem
├── mount proc
└── exec user command

This was the first time tiny-docker-go started to feel like a real runtime.

Day 3: RootFS and `chroot`

On Day 3, I added filesystem isolation.

Namespaces isolate views of system resources, but a container also needs a filesystem.

When I run an Alpine container, I expect to see Alpine files:

/bin/sh
/etc/os-release
/lib
/usr

I should not see the host root filesystem.

For the first version, I used chroot.

The idea is simple:

syscall.Chroot(rootfs)
os.Chdir("/")

After that, / inside the process points to the rootfs directory.

Example:

sudo tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh

Inside the container:

cat /etc/os-release

shows Alpine information if the rootfs is Alpine.

This was another important moment.

Now the process had:

its own hostname
its own PID namespace
its own mount namespace
its own root filesystem

It still was not Docker, but it started to look like the core of a container.

`chroot` is not full container security

One important note:

chroot is useful for learning, but it is not complete container isolation by itself.

Historically, chroot was not designed as a full security boundary.

A real runtime usually uses more careful filesystem setup, often with pivot_root, mount namespaces, read-only mounts, bind mounts, capabilities, seccomp, AppArmor or SELinux, and other hardening layers.

For this project, chroot was enough because my goal was educational.

I wanted to understand the basic idea:

Give the process a different /.

That one idea explains a lot.

A container process does not magically have a filesystem.

The runtime prepares one.

Mounting `/proc`

After entering the rootfs, I mounted /proc:

syscall.Mount("proc", "/proc", "proc", 0, "")

Without /proc, commands like ps may not work correctly inside the container.

This helped me understand another detail:

Many Linux tools do not get information from some secret API.

They read from virtual filesystems like /proc.

For example, ps depends on /proc to inspect processes.

So if the container has a PID namespace but /proc is not mounted correctly, the view inside the container can be confusing.

This is one of those small details that makes containers feel less magical.

Day 4: Container ID and Metadata

After Day 3, I could start isolated processes.

But I had a new problem:

How do I remember them?

Docker can do:

docker ps
docker inspect <container>
docker logs <container>
docker stop <container>

That means Docker stores metadata about containers.

So on Day 4, I added a simple metadata store.

I used a local directory like:

/var/lib/tiny-docker/containers/<id>/

Each container gets a config.json.

Example fields:

{
  "id": "abc123",
  "command": ["/bin/sh"],
  "hostname": "tiny-container",
  "rootfs": "./rootfs/alpine",
  "status": "running",
  "created_at": "2026-05-12T10:00:00Z",
  "pid": 12345
}

This was simple, but it changed the architecture.

Before this, run was just executing a process.

After this, run was creating a managed container record.

That is a big conceptual difference.

A runtime needs memory.

Not RAM memory, but operational memory.

It needs to remember:

What did I start?
What PID belongs to this container?
Where are its logs?
Is it running or stopped?
What command did it start with?
What rootfs did it use?

Then ps became meaningful.

Instead of being a placeholder, it could read metadata files and show containers.

A very simple output could look like:

CONTAINER ID   PID     STATUS    COMMAND
abc123         12345   running   /bin/sh

Small lesson from Day 4

A container runtime is partly a process manager and partly a state manager.

Starting the process is only half of the job.

Remembering and managing it is the other half.

This helped me understand why Docker has a daemon.

If containers can continue running after the CLI exits, something needs to track them.

My tiny runtime did this in a simple way with JSON files.

Docker does it in a much more complete way.

But the idea is similar.

Day 5: Logs

On Day 5, I added logging.

This sounded easy at first.

Just redirect stdout and stderr to a file, right?

Something like:

logFile, _ := os.Create("container.log")
cmd.Stdout = logFile
cmd.Stderr = logFile

For detached containers, that works.

Then:

tiny-docker-go logs <container-id>

can read:

/var/lib/tiny-docker/containers/<id>/container.log

and print it.

But logs became more interesting when I thought about interactive mode.

If I run:

tiny-docker-go run /bin/sh

I want stdin, stdout, and stderr attached to my terminal.

But if I run a detached process, I want logs written to a file.

So the runtime needs to understand different modes:

interactive mode
├── stdin  -> terminal
├── stdout -> terminal
└── stderr -> terminal

detached mode
├── stdin  -> maybe closed
├── stdout -> log file
└── stderr -> log file

Docker has this same concept in a more advanced way.

docker logs reads logs from the container’s configured logging driver, and docker logs --follow streams new output.

For my tiny version, I kept it simple:

tiny-docker-go logs <id>
tiny-docker-go logs -f <id>

The -f mode can be implemented like a basic tail -f.

Small Linux detail: stdout and stderr matter

A container does not need to know about “logging” as a high-level concept.

Most container logging starts from something simple:

The process writes to stdout and stderr.

The runtime captures those streams.

That is why good containerized apps usually log to stdout/stderr instead of writing only to local files.

This is a small detail, but it matters a lot in production.

If your app logs only to a file inside the container, then your logging pipeline may not see it unless you mount volumes or configure extra collection.

Day 6: Stop and Lifecycle Management

On Day 6, I implemented stop.

The first version was simple:

tiny-docker-go stop <container-id>

The runtime reads metadata, gets the PID, and sends a signal.

The normal graceful flow is:

send SIGTERM
wait
if still running, send SIGKILL
update metadata

This is similar to Docker’s stop behavior.

Docker sends a termination signal first, and after a timeout it sends SIGKILL if the process does not exit.

This taught me a practical lesson:

Stopping a container is not the same as killing a process immediately.

A good runtime gives the process a chance to clean up.

For example, a backend service may need to:

close database connections
flush logs
finish current requests
release locks
write final state

If we send SIGKILL immediately, the process cannot handle it.

SIGKILL cannot be caught.

SIGTERM can be caught.

So graceful shutdown starts with SIGTERM.

PID 1 problem

This day also connected back to PID namespaces.

Inside a PID namespace, the main process becomes PID 1.

PID 1 has special behavior on Linux.

If it does not handle signals properly, stopping the container may not behave as expected.

That helped me understand why some containers use an init process.

It also made me more careful about what command I use as the container entrypoint.

A simple shell may behave differently from a proper application process.

This is one reason container lifecycle management is more subtle than it looks.

Day 7: cgroups and Memory Limits

Day 7 was about cgroups.

Namespaces answer this question:

What can the process see?

cgroups answer a different question:

How much can the process use?

That difference is important.

Namespaces isolate visibility.

cgroups control resources.

With cgroups, the runtime can limit or account for resources such as:

memory
CPU
pids
IO
sometimes devices and other controllers depending on system configuration

For this project, I focused on memory limit using cgroup v2.

On many modern Linux systems, cgroup v2 is mounted around:

/sys/fs/cgroup

A simplified container cgroup path might be:

/sys/fs/cgroup/tiny-docker/<container-id>/

To limit memory, the runtime can write to:

memory.max

For example:

echo 134217728 > memory.max

That means 128 MB.

Then the runtime adds the process PID to:

cgroup.procs

Example:

echo <pid> > cgroup.procs

After that, the kernel applies the limit to that process group.

This was one of my favorite parts of the project.

Because suddenly “memory limit” stopped being an abstract Docker option.

When I write:

docker run --memory 128m ...

behind the scenes, the runtime eventually has to express that limit to the kernel.

The exact implementation is more complex in Docker, but the basic idea became clear.

Testing memory limits

A simple way to test memory limits is to run a command that allocates memory.

For example, inside a container rootfs with Python:

python3 -c "a = 'x' * 200 * 1024 * 1024; print('allocated')"

If the memory limit is 128 MB, the process should fail or be killed by the kernel.

This is where container behavior becomes very real.

The runtime does not “watch memory” manually in a loop.

The kernel enforces the limit.

That is the power of cgroups.

cgroup v1 vs cgroup v2

I focused on cgroup v2 because it is the modern unified hierarchy.

In cgroup v1, different controllers could be mounted in different hierarchies.

In cgroup v2, the model is unified and cleaner.

But cgroup v2 also has rules that you need to respect.

For example, controller availability depends on the system, and some controllers must be enabled in parent cgroups before child cgroups can use them.

This is where I learned another systems programming lesson:

The code can be correct but the host can still reject the setup because the kernel or systemd cgroup configuration is different.

So a real runtime needs strong detection, good errors, and compatibility handling.

My tiny runtime does not handle every host setup.

But it made the concept clear.

Day 8: Network Namespace

On Day 8, I added network namespace support.

This was the day where containers became both clearer and more confusing.

A network namespace gives a process its own network stack.

That includes its own:

interfaces
routing table
IP addresses
firewall rules view
loopback device

When I added:

syscall.CLONE_NEWNET

the container got its own network namespace.

But then something interesting happened:

The container had no network.

That is expected.

A new network namespace starts isolated.

Even loopback may need to be brought up manually.

So the first step was simply:

ip link set lo up

inside the namespace.

This taught me a simple but important point:

Network isolation does not automatically mean working networking.

It means the container has a separate network world.

The runtime still needs to connect that world to something.

At this stage, I added a --net none or --net isolated style mode.

That made the behavior explicit.

tiny-docker-go run --net isolated --rootfs ./rootfs/alpine /bin/sh

Inside the container:

ip addr

would show only the isolated namespace interfaces.

No internet.

No host access.

Just isolation.

Small lesson from Day 8

Before this project, I mostly thought about Docker networking from the user side:

-p 8080:80
docker network ls
docker network inspect

But from the runtime side, networking starts much lower:

create network namespace
create interface
move interface into namespace
assign IP
set route
configure NAT

Docker hides all of that.

Building even a tiny version forced me to see the real steps.

Day 9: Bridge and veth Networking

Day 9 was one of the most difficult and useful parts.

The goal was to give the container internet access.

For that, I needed a simple bridge and veth pair.

The model looks like this:

Host network namespace
│
├── eth0 / main host interface
│
├── td0 bridge
│   └── veth-host
│
└── container network namespace
    └── veth-container

A veth pair works like a virtual cable.

Whatever enters one side comes out the other side.

The host keeps one side.

The container gets the other side.

The bridge connects the host-side veth to a small virtual network.

A simple IP plan:

bridge td0:       10.10.0.1/24
container eth0:   10.10.0.2/24
default gateway:  10.10.0.1

The steps are roughly:

ip link add td0 type bridge
ip addr add 10.10.0.1/24 dev td0
ip link set td0 up

ip link add veth-host type veth peer name veth-container
ip link set veth-host master td0
ip link set veth-host up

ip link set veth-container netns <container-pid>

Then inside the container namespace:

ip addr add 10.10.0.2/24 dev veth-container
ip link set veth-container name eth0
ip link set eth0 up
ip route add default via 10.10.0.1

Finally, on the host, NAT is needed:

iptables -t nat -A POSTROUTING -s 10.10.0.0/24 -j MASQUERADE

Also IP forwarding must be enabled:

sysctl -w net.ipv4.ip_forward=1

This is the point where I started to appreciate Docker networking much more.

Because every simple Docker command hides many small Linux networking operations.

Debugging container networking

The useful commands were:

ip addr
ip link
ip route
ip netns
iptables -t nat -L -n -v
sysctl net.ipv4.ip_forward
ping

Some issues I hit or expected:

loopback was down
veth interface was created but not moved correctly
IP address was missing
default route was missing
NAT rule was missing
host forwarding was disabled
DNS was not configured
interface name inside namespace was not what I expected

This part reminded me that networking bugs are usually not one big bug.

They are often one missing small step.

One missing route.

One down interface.

One missing NAT rule.

One wrong namespace.

Day 10: Polish, README, and Architecture

On Day 10, I focused on making the project understandable.

A learning project is more valuable when other people can read it.

So I improved the README and documented:

project goal
architecture
installation
usage examples
known limitations
roadmap
what each feature demonstrates

The final mental model looks like this:

tiny-docker-go
│
├── CLI
│   ├── run
│   ├── ps
│   ├── logs
│   └── stop
│
├── Runtime
│   ├── parent process
│   ├── child process
│   ├── namespace setup
│   ├── rootfs setup
│   └── command execution
│
├── State
│   ├── container id
│   ├── metadata json
│   ├── pid
│   ├── status
│   └── created_at
│
├── Logs
│   └── stdout/stderr capture
│
├── Cgroups
│   ├── memory.max
│   └── cgroup.procs
│
└── Network
    ├── network namespace
    ├── bridge
    ├── veth pair
    └── NAT

And the user-facing commands look like this:

tiny-docker-go run --rootfs ./rootfs/alpine /bin/sh
tiny-docker-go ps
tiny-docker-go logs <container-id>
tiny-docker-go stop <container-id>

This is still tiny.

But it is not just a toy CLI anymore.

It demonstrates many of the core ideas behind containers.

What I learned about containers

After building this project, my mental model of Docker changed.

Before, I thought of Docker mostly as:

images + containers + Dockerfile + ports + volumes

Now I think about it more like:

container = isolated Linux process + prepared filesystem + resource limits + networking + lifecycle metadata

That is a much more useful model.

A container is not magic.

It is a process.

But it is a carefully prepared process.

The runtime says:

this process should see this hostname
this process should see this PID tree
this process should use this root filesystem
this process should have this memory limit
this process should write logs here
this process should be connected to this network
this process should be stopped with these signals

That is the core idea.

Namespaces vs cgroups

One of the clearest lessons was the difference between namespaces and cgroups.

I would explain it like this:

Namespaces control what a process can see.
Cgroups control what a process can use.

Examples:

PID namespace:
The process sees its own process tree.

UTS namespace:
The process sees its own hostname.

Mount namespace:
The process sees its own mount table.

Network namespace:
The process sees its own network interfaces and routes.

Cgroups:
The process can only use a limited amount of memory, CPU, pids, or IO.

This distinction is simple, but it explains so much.

If a container cannot see host processes, that is namespace isolation.

If a container gets killed after using too much memory, that is cgroup enforcement.

If a container has its own IP address, that is network namespace plus virtual networking.

If a container sees Alpine files instead of host files, that is rootfs setup plus mount isolation.

Docker combines all of these into one clean developer experience.

Small Linux details that mattered

This project taught me many small Linux details that are easy to miss when only using Docker.

1. PID 1 is special

The first process inside a PID namespace becomes PID 1.

PID 1 handles signals differently and is responsible for reaping orphaned child processes.

This matters for container shutdown.

2. `/proc` must match the PID namespace

If /proc is not mounted inside the container correctly, tools like ps may show confusing information.

Mounting proc inside the container is not just cosmetic.

It affects how process information is visible.

3. `chroot` changes `/`, but it is not a complete security model

chroot is useful for learning filesystem isolation.

But real containers need stronger filesystem and security handling.

4. Logs are mostly stdout and stderr

Container logging starts with capturing process output.

If your app logs to stdout/stderr, the runtime can collect it naturally.

5. Graceful stop matters

A runtime should usually send SIGTERM first.

SIGKILL should be the fallback.

This gives the process a chance to shut down cleanly.

6. cgroups are kernel-enforced

The runtime does not manually police memory in a loop.

It writes limits into cgroup files, then the kernel enforces them.

7. A new network namespace has no useful network by default

Isolation comes first.

Connectivity must be built.

8. veth pairs are like virtual cables

One side stays on the host.

One side goes into the container.

That simple idea powers a lot of container networking.

9. NAT is what makes outbound internet work in the simple bridge model

Without NAT and IP forwarding, the container may have an IP but still not reach the internet.

10. Metadata turns a process into something manageable

Without metadata, you only started a process.

With metadata, you can list it, stop it, inspect it, and read its logs.

What this project is not

tiny-docker-go is not a Docker replacement.

It does not support real image pulling.

It does not implement OCI fully.

It does not have production security.

It does not have a daemon.

It does not have advanced volume management.

It does not have complete port publishing.

It does not handle all cgroup configurations.

It does not support all namespace combinations safely.

It does not include seccomp, AppArmor, SELinux, or capabilities hardening yet.

And that is okay.

The goal is not production.

The goal is learning.

Actually, keeping it small made the learning better.

When a project becomes too complete, it can hide the concept again.

I wanted the opposite.

I wanted the concept to stay visible.

What I want to add next

After these 10 days, there are many possible next steps.

Some features I want to explore:

1. Better image support

Right now, rootfs is local.

A next step could be:

tiny-docker-go pull alpine

Even if it is not a full registry implementation, I can start with downloading and unpacking rootfs archives.

2. OverlayFS

Docker images are layer-based.

A good next step is to use OverlayFS:

lowerdir = image layer
upperdir = container writable layer
workdir  = overlay work directory
merged   = final container rootfs

This would make the filesystem model closer to real containers.

3. Port mapping

Outbound internet is one thing.

Publishing container ports is another.

A next step:

tiny-docker-go run -p 8080:80 ...

This would require NAT/DNAT rules or a proxy approach.

4. Better process supervision

The runtime could track exit status, update metadata automatically, and clean up resources more reliably.

5. Capabilities

Linux capabilities are very important for container security.

Instead of giving a process full root power, Linux can split privileges into smaller capabilities.

Dropping capabilities would make the runtime more realistic.

6. Seccomp

Seccomp can restrict which syscalls a process can use.

This is another important container hardening feature.

7. User namespace

User namespaces are powerful because they can make a process think it is root inside the container while mapping it to a less privileged user on the host.

This is a very interesting security feature.

8. OCI runtime spec

Eventually, I want to read more about the OCI runtime spec and compare my tiny runtime with how real runtimes are structured.

Final thoughts

This project made Docker feel less magical and more impressive.

Less magical because I can now see the Linux pieces behind it.

More impressive because I understand how many details Docker handles for us.

Running a container sounds simple:

docker run nginx

But under that command, a runtime needs to prepare isolation, filesystem, networking, logs, metadata, signals, and resource limits.

Building tiny-docker-go helped me understand those pieces one by one.

The most important lesson for me was this:

A container is just a Linux process, but the runtime carefully shapes the world around that process.

That world includes what the process can see, what it can use, where its files come from, how its logs are captured, how it receives signals, and how it connects to the network.

This is why building a tiny container runtime is such a useful learning project.

You do not need to rebuild Docker completely.

You only need to rebuild enough of it to understand the ideas.

That is what I tried to do with tiny-docker-go.

You can follow the project here:

https://github.com/amirsefati/tiny-docker-go

References

Docker docs — Running containers: https://docs.docker.com/engine/containers/run/
Docker docs — docker run: https://docs.docker.com/reference/cli/docker/container/run/
Docker docs — Container logs: https://docs.docker.com/reference/cli/docker/container/logs/
Docker docs — Container stop: https://docs.docker.com/reference/cli/docker/container/stop/
Linux man-pages — namespaces: https://man7.org/linux/man-pages/man7/namespaces.7.html
Linux man-pages — PID namespaces: https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
Linux kernel docs — cgroup v2: https://docs.kernel.org/admin-guide/cgroup-v2.html

Forem: amir

How I Analyzed the Linux Kernel's Deadliest Logic Bug: A Deep Dive into Dirty Pipe (CVE-2022-0847)

The Conceptual Backstory: Page Cache, Pipes, and splice()

1. The Page Cache: RAM as a Disk Mirror

2. The Pipe Buffer

3. The splice() Syscall: Zero-Copy Magic

Digging Into the Code: The Bug in lib/iov_iter.c

The Intersection of Two Commits

1. Commit 241699cd72a8 — October 2016

2. Commit f6dd975583bd — May 2020

Step-by-Step: How the Exploit Mechanics Worked

Stage 1: Polluting the Pipe Buffers

Stage 2: Draining the Pipe

Stage 3: Splicing File Data into the Pipe

Stage 4: Writing into the Pipe

Why Dirty Pipe Was So Dangerous

No Race Condition

No Classic Memory Corruption

High Reliability

Page Cache Impact

Dirty Pipe vs Dirty COW

The Upstream Fix

Key Developer Takeaways

1. Always Initialize Reused Structures

2. Flags Are Security Boundaries

3. Subsystem Interactions Matter

4. Logic Bugs Can Be More Reliable Than Memory Corruption

5. Defensive Coding Is Not Optional in Systems Programming

Exploit Discussion: Why I Will Not Weaponize It Here

Safe Validation: How to Check Exposure Without Exploiting the Machine

Mitigation: The Real Fix Is a Kernel Update

Reducing the Attack Surface

Containers: Do Not Forget the Host Kernel

Monitoring Sensitive Files

My Practical Takeaway for Security Engineers

Final Thoughts

References

Composition over Inheritance in Go: The Design Choice That Makes Microservices Boring in the Best Way

The short version

Why Go did not choose classical inheritance

Inheritance example: mobile phone and digital watch

Composition in Go: model capability, not family tree

Embedding is not inheritance

Interfaces: polymorphism without inheritance

Why this polymorphism feels better in microservices

Real-world reference: Docker/Moby and interfaces

context.Context becomes easier with composition

Hotel reservation example: composition, Outbox, and Saga

Why Outbox becomes cleaner with interfaces

Why Saga becomes cleaner with composition

interface{} and any: same type, different readability

Testing becomes easier

A production-style benchmark from a hotel reservation service

Common mistake: creating Java-style interfaces in Go

Practical rules I follow

1. Start concrete

2. Accept interfaces, return structs

3. Keep interfaces small

4. Pass context.Context at boundaries

5. Prefer composition for capabilities

6. Use any carefully

Final thought

References

Hardening a Linux Server in the Real World: Firewall, SSH, Fail2Ban, Nginx, Docker, .env Protection, and Bot Forensics

The first rule: assume your server is already being scanned

Step 1: create a normal user and stop working as root

Step 2: harden SSH before enabling aggressive firewall rules

Step 3: enable UFW carefully

Step 4: install Fail2Ban for SSH and Nginx behavior

Step 5: make Nginx reject sensitive files before the application sees them

The .env file deserves its own section

1. Never place .env under the public web root

2. Use strict permissions

3. Never commit .env

4. Do not bake secrets into Docker images

5. Rotate secrets after exposure

Docker: do not casually run everything as root

Process limits and resource protection

Detecting miner-like infections and suspicious processes

CDN and WAF: why I prefer putting apps behind a protective layer

The Conceptual Backstory: Page Cache, Pipes, and `splice()`

3. The `splice()` Syscall: Zero-Copy Magic

Digging Into the Code: The Bug in `lib/iov_iter.c`

1. Commit `241699cd72a8` — October 2016

2. Commit `f6dd975583bd` — May 2020

`context.Context` becomes easier with composition

`interface{}` and `any`: same type, different readability

4. Pass `context.Context` at boundaries

6. Use `any` carefully

The `.env` file deserves its own section

1. Never place `.env` under the public web root

3. Never commit `.env`

What is the “hash” inside `/proc/<pid>/ns/pid`?