Forem: 0xAlphaSecurity

Chapter 5: Linux Control Groups (cgroups)

0xAlphaSecurity — Sun, 22 Mar 2026 20:23:15 +0000

This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.

When dozens or hundreds of applications share the same Linux system, managing their access to hardware resources, like CPU, memory, and disk I/O, becomes an absolute necessity. Without strict boundaries, a single misbehaving or compromised process can easily consume all available resources. This starves other applications, degrades system performance, and can even bring the entire host down.

From a security perspective, an attacker exploiting an unbounded application can intentionally cause this resource exhaustion, resulting in a severe Denial of Service (DoS). Because containers are ultimately just processes running on a shared host kernel, they are equally susceptible to this risk. To keep services stable and secure, we need a way to enforce fairness and strict isolation.

In this chapter, we will explore Linux Control Groups (cgroups), a powerful kernel feature that allows us to limit and isolate the resource usage of processes.

Introduction to cgroups

At its core, cgroups v2 is a Linux kernel mechanism that allows the system to organize processes into hierarchical groups and apply strict resource limits to them. With cgroups, administrators and container runtimes can precisely dictate how much CPU time, memory, and disk I/O throughput a specific set of processes is allowed to consume.

Understanding how cgroups operate is essential because they are the mechanism Linux uses to enforce resource fairness at the kernel level.

Consider a scenario where a process is allowed to consume unlimited memory. It will eventually starve other critical processes on the same host. This might happen inadvertently due to a bug, like a memory leak in a poorly written application. However, from a security perspective, an attacker can deliberately trigger or exploit this leak to perform a resource exhaustion attack. By strictly capping the memory and other resources a containerized process can access, you neutralize the blast radius of this kind of attack, ensuring the rest of the host system continues operating normally.

cgroups v1 vs. cgroups v2

Control groups have been around for a long time, but the ecosystem has fundamentally shifted. While version 2 of cgroups has been in the Linux kernel since 2016 (with Fedora leading the charge as the first major distro to default to it in mid-2019), it is now the undeniable standard for modern Linux systems and orchestration platforms.

The biggest architectural difference lies in how processes are grouped. In cgroups v1, controllers (the mechanisms that actually govern resources like memory or PIDs) were completely independent. A single process could belong to entirely different groups for different resources. For example, a process could simultaneously join /sys/fs/cgroup/memory/mygroup and /sys/fs/cgroup/pids/yourgroup. This fragmented design led to incredibly complex, confusing hierarchies that were hard to manage and secure.

Cgroups v2 fixes this by introducing a single unified hierarchy. The semantics are much cleaner: a process joins one specific group (e.g., /sys/fs/cgroup/ourgroup) and is automatically subject to all the active controllers configured for that group.

Beyond making resource management much easier to reason about, cgroups v2 brings several massive improvements to stability and security:

Safer Sub-tree Delegation: It safely allows delegating cgroup management to less-privileged users. This is a crucial feature that makes rootless containers possible, allowing resource limits to be applied without requiring root privileges.
Unified Memory Accounting: It properly accounts for different types of memory usage that v1 missed or handled poorly, including network memory, kernel memory, and non-immediate resource changes like page cache write-backs.
Pressure Stall Information (PSI): A newer feature that provides rich, real-time metrics on system resource pressure, allowing systems to proactively detect and respond to resource shortages before a crash occurs.
Enhanced Isolation: Better cross-resource allocation management prevents edge-case scenarios where high usage of one resource unexpectedly impacts another.

Throughout this guide, we will focus entirely on cgroups v2, as it is the modern implementation used by secure container environments.

Before we dive deeper, you should verify that your host system is actually running cgroups v2. You can easily check this by querying the filesystem type of the cgroup mount point:

user@container-security:~$ stat -fc %T /sys/fs/cgroup/
cgroup2fs # Note: If the output reads cgroup2fs, you are ready to go. If the output is tmpfs or cgroupfs, your system is still using the legacy cgroups v1 hierarchy.

Exploring Cgroups

⚠️ Warning: Always run the commands in this guide on a disposable Virtual Machine (VM) and never on your personal host machine. Playing with kernel resource limits can easily freeze or crash your system! The examples in this course were run on an Ubuntu Server 24.04 VM.

The core idea behind cgroups is elegantly simple: Processes are organized into hierarchical groups, and each group is assigned specific resource limits.

In Linux, "everything is a file," and cgroups are no exception. There is no special CLI tool you must use to interact with them. Instead, cgroups are exposed directly through a virtual filesystem, usually mounted at /sys/fs/cgroup. Inside this directory, groups are represented as folders, and resource limits are represented as plain text files.

Writing values into these text files directly changes the kernel's behavior.

Let's look at the root of the cgroups v2 filesystem with running ls /sys/fs/cgroup/:

This directory is the root control group. Every single process running on your Linux machine belongs to this root group by default.

In modern Linux systems that use systemd, the cgroups v2 filesystem mounted at /sys/fs/cgroup forms a hierarchical tree where processes are organized and managed. The root directory represents the root control group, and systemd automatically creates subgroups such as init.scope (which contains the system’s PID 1 process), system.slice (which holds system services and daemons), and user.slice (which organizes user sessions). Because systemd manages most services on the system, container runtimes like Docker or orchestration platforms like Kubernetes typically run as system services under system.slice. As a result, the containers they start appear as nested cgroup directories beneath those services, for example, under system.slice/docker.service/docker-container.scope. This means containers are still part of the same overall cgroup hierarchy, just placed deeper in the tree according to the service that created them:

/sys/fs/cgroup (root)
│
├── init.scope
│
├── system.slice
│   ├── docker.service
│   │   └── docker-container.scope
│   │
│   └── ssh.service
│
└── user.slice
    └── user-1000.slice

Whenever a new subdirectory is created here, it represents a new child cgroup that inherits from its parent.

If you look closely at the files in a cgroup directory, you'll notice a strict naming convention. Files are divided into two main categories: Core files and Controller files.

Core Files (cgroup.*): Files prefixed with cgroup. manage the mechanics of the cgroup hierarchy itself, rather than specific hardware resources.
- cgroup.procs: The most important file. It contains a list of Process IDs (PIDs) that belong to this group. To move a process into a cgroup, you simply echo its PID into this file.
- cgroup.controllers: A read-only file showing which resource controllers (cpu, memory, io) are currently available to this specific group.
- cgroup.kill: A v2 feature that lets you instantly kill all processes within the cgroup by writing 1 to it.
Controller Files (cpu.,memory., pids.*, etc.): Controllers are the actual engines that distribute and limit system resources. Files prefixed with a controller name dictate how that specific resource is managed. Furthermore, these files generally fall into two types:
- Configuration (Read-Write): Files you modify to set limits. (e.g., memory.max)
- Status (Read-Only): Files you read to get live metrics. (e.g., memory.stat). For example watch cat /sys/fs/cgroup/memory.stat will show you real-time memory usage stats for that cgroup.

Key Controllers Files

While the Linux kernel supports many controllers, a few are absolutely critical for securing containerized workloads against resource exhaustion and Denial of Service (DoS) attacks.

Memory (memory.*): Regulates RAM usage.
- memory.max sets an absolute hard limit. If the processes in the cgroup try to use more memory than this, the kernel's Out-Of-Memory (OOM) killer will step in and terminate them.
- memory.high is a softer throttle limit. If breached, the kernel heavily throttles the processes and forces them to reclaim memory, but avoids outright killing them.
CPU (cpu.*): Regulates processor time.
- cpu.max limits the absolute maximum amount of CPU time the group can use (bandwidth).
- cpu.weight dictates proportional share. If the system is busy, a cgroup with a higher weight gets priority over one with a lower weight.
PIDs (pids.*): Regulates process creation.
- pids.max sets a hard limit on how many processes can exist inside the cgroup. From a security standpoint, this is your primary defense against a Fork Bomb attack, where a malicious script rapidly clones itself to crash the host.
Block I/O (io.*): Regulates disk read/write bandwidth.
- io.max can prevent a compromised container from thrashing the host's storage drives and starving other containers of database reads or log writes.

For highly specialized workloads, cgroups v2 offers several other controllers. While you might not interact with these daily, it's good to know they exist:

Cpuset (cpuset.*): Pins tasks to specific CPU cores and Memory Nodes. This is crucial for high-performance computing on NUMA architectures where memory access latency matters.
Devices: Controls which device nodes (like /dev/sda or /dev/random) a cgroup can access. In v2, this is actually implemented using eBPF programs rather than standard text files.
HugeTLB (hugetlb.*): Limits the usage of Huge Pages (large blocks of memory) to prevent a single group from exhausting them.
RDMA (rdma.*): Manages Remote Direct Memory Access resources, often used in high-speed clustered networking.

Creating cgroups

Now that we understand how the cgroups filesystem works, let's create a custom cgroup hierarchy.

⚠️ Warning: As mentioned earlier, do not run these commands on your host machine. Use a VM (examples work with Ubuntu Server 24.04).

Most of the commands we are about to run require root privileges. Let's switch to the root user and install cgroup-tools, which provides useful utilities like cgcreate.

sudo su
apt update && apt -y install cgroup-tools

Next, let's export some environment variables to make our commands easier to read. We are going to create:

a parent cgroup called scripts (A parent cgroup is the higher-level group that can contain one or more subgroups. It usually defines the overall resource limits that apply to everything inside it.)
a child cgroup called production (A child cgroup is a subgroup created inside the parent group. Processes can be placed into the child group, and it can have its own additional limits, but it can never exceed the limits set by its parent.)

export PARENT_CGROUP="scripts"
export CHILD_CGROUP="production"

If the parent scripts group had a limit of 2 GB of memory, then the child production group could only use up to that 2 GB, even if it tried to set a higher limit. The child can further restrict resources, but it cannot escape the limits of its parent.

So the structure will look like this:

/sys/fs/cgroup/
└── scripts        (parent cgroup)
    └── production    (child cgroup)

While you can create cgroups using standard Linux commands (e.g., mkdir /sys/fs/cgroup/scripts), using the cgcreate utility allows us to explicitly request which controllers we want to enable.

Let's create our parent cgroup and request only the memory and cpu controllers:

cgcreate -g memory,cpu:/${PARENT_CGROUP}

If the command returns no output, it was successful. Let's look inside the newly created directory: ls /sys/fs/cgroup/${PARENT_CGROUP}

You will see a large list of files representing the parameters and statistics for this new group. However, if you look closely at the active controllers, you might notice something unexpected:

root@container-security:/home/user# cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.controllers
cpu memory pids

The pids controller is active, even though we only requested memory and cpu.

To understand why pids showed up, we need to look at the root cgroup (/sys/fs/cgroup/). Run this command: cat /sys/fs/cgroup/cgroup.subtree_control

root@container-security:~# cat /sys/fs/cgroup/cgroup.subtree_control
cpu memory pids

In cgroups v2, resource controllers are strictly delegated top-down. The cgroup.subtree_control file dictates which controllers are passed down to a group's immediate children. Because the root cgroup is configured to delegate cpu, memory, and pids, our new ${PARENT_CGROUP} automatically inherited all three.

The pids controller in cgroups limits the number of processes (PIDs) that a group can create. A PID is simply a process identifier used by the Linux kernel to track running processes. It is usually enabled by default to prevent fork bombs and runaway process creation. Without it, cgroups could limit CPU and memory, but not process count, which is a safety risk.

Before we create our child cgroup, there is a crucial cgroups v2 rule you must know: The No Internal Process Constraint.

In v2, a cgroup can either have processes assigned to it, OR it can delegate controllers to child cgroups, it cannot do both. (The only exception is the root cgroup).

Because our ${PARENT_CGROUP} is going to delegate cpu and memory to its children, the kernel will refuse to let you assign any running processes directly to ${PARENT_CGROUP}. Instead, processes must be assigned to the leaf nodes of the tree (the final child directories).

Let's create the child cgroup where our actual demo processes will live:

cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}

Although we created the parent and child cgroups in two separate steps, this was mainly for demonstration purposes. In practice, the first cgcreate command is technically redundant because running the second command (cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}) would automatically create both the parent (scripts) and the child (production) cgroups if the parent does not already exist.

When we ran this, cgcreate automatically updated the cgroup.subtree_control file in the parent directory to delegate the requested controllers down to the child. We can verify this:

root@container-security:/home/user# cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.subtree_control
cpu memory

Finally, let's look inside our new child cgroup: ls /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}

If you check the files here, you will see cpu.* and memory.* files, but absolutely no pids.* or io.* files. We now have a perfectly isolated, highly specific leaf cgroup ready to constrain our applications.

Setting Resource Limits

Having created our isolated cgroup hierarchy, it is time to actually enforce some boundaries. This is where the core security value of cgroups shines: by setting strict resource limits, we protect the host system from resource exhaustion attacks and ensure predictable performance.

While you can configure these limits by directly writing to the files with echo (e.g., echo "20000 50000" > /sys/fs/cgroup/my_group/cpu.max), we will use the cgset utility from the cgroup-tools package we installed earlier, as it provides a cleaner syntax for setting multiple limits at once.

Before we apply the limits to our cgroup, let's understand exactly what we are controlling.

CPU Throttling (cpu.max): In cgroups v2, CPU limits use a simple quota-based model formatted as $MAX $PERIOD. If you set the value to 100000 1000000, you are telling the kernel: For every 1,000,000 microseconds (1 second) of time, this group is allowed to use the CPU for 100,000 microseconds (a tenth of a second). This effectively limits the cgroup to 10% of a single CPU core. Security Note: Unlike memory limits, CPU limits act as a throttle. If a process hits its CPU limit, the kernel simply pauses it until the next period begins. CPU throttling slows applications down, but it never outright kills them.
Memory Limits (memory.max & memory.swap.max): Memory limits set an absolute ceiling on RAM usage. If a cgroup exceeds the value in memory.max, the kernel initiates heavy throttling. It will aggressively try to reclaim memory by dropping cached data or swapping memory pages out to disk. However, if the process continues demanding memory and the kernel cannot reclaim enough (or if swap is also exhausted), the kernel triggers the Out-Of-Memory (OOM) killer. It calculates an OOM score and terminates the most offending process within that cgroup to protect the rest of the host system.

For the tests, we want to intentionally induce an early OOM kill. To guarantee this happens, we need to strictly limit both the physical memory and the swap memory. Otherwise, the kernel might just push our runaway process into swap space, delaying the crash.

Let's apply a 15% CPU limit and a roughly 200MB limit for both RAM and swap to our production child cgroup:

cgset -r memory.max=200000000 ${PARENT_CGROUP}/${CHILD_CGROUP} # (Note: Memory values here are in bytes, but you could also use suffixes like 100M or 1G.)
cgset -r memory.swap.max=200000000 ${PARENT_CGROUP}/${CHILD_CGROUP}
cgset -r cpu.max="150000 1000000" ${PARENT_CGROUP}/${CHILD_CGROUP}

Let’s verify that the kernel accepted our new limits by reading the files directly:

cat /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/{memory,cpu,memory.swap}.max

You should see an output similar to this:

199999488
150000 1000000
199999488

You might be wondering why the 200000000 bytes we assigned for memory suddenly changed to 199999488.

The kernel manages memory in fixed-size blocks called "pages." On most standard systems, a memory page is exactly 4096 bytes (you can verify your system's page size by running getconf PAGE_SIZE).

When you request a memory limit, the kernel rounds your request down to the nearest whole page. If you divide our requested 200,000,000 bytes by 4096, you get roughly 48,828.125 pages. The kernel drops the decimal, granting you exactly 48,828 pages. Multiply 48,828 by 4096, and you get 199,999,488, the exact byte limit the kernel applied.

Testing and Managing Cgroup Processes

Now that our resource limits are strictly defined in our production cgroup, it’s time to put them to the test. We will observe how cgroups throttle CPU usage, how they handle memory exhaustion, and how we can use built-in tools to manage these processes.

Stressing the CPU

Let's start by establishing a baseline. We will run a command that is notorious for hogging 100% of a CPU core: copying an infinite stream of zeros into the void. Run this command directly on your host (outside our restricted cgroup):

dd if=/dev/zero of=/dev/null &
sleep 2
ps -p $! -o %cpu

Because this process has no bounds, the output will show it consuming nearly 100% of the CPU:

%CPU
98.0

Run kill $! to stop the process before we moving on.

Now, let's run that exact same command, but this time we will use cgexec to launch it directly inside our restricted child cgroup:

cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} dd if=/dev/zero of=/dev/null &
sleep 2
ps -p $! -o %cpu

Check the output now:

%CPU
15.3

Run kill $! to stop the process.

The CPU usage hovers right around the 15% limit we defined earlier! If you watch this process in a live monitor like htop, you will see it consistently stay at or below that threshold. The kernel is aggressively pausing and resuming the process to enforce our quota.

Filling Up the Memory (Triggering an OOM Kill)

Let's see what happens when a process refuses to stay within its memory limits. We are going to launch a bash process inside our cgroup that continuously appends 10MB of random data to a variable every half-second until it crashes. This script will quickly breach the roughly 200MB limit we imposed. Because we also limited swap space, the kernel won't be able to page the data to disk.

cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} \
bash -c 'a=(); while true; do a+=("$(head -c 10M /dev/zero | tr "\0" "A")"); sleep 1; done' &

You can watch the memory footprint (RSS - Resident Set Size) grow rapidly in real-time using the watch command:

watch ps -p $! -o rss,sz

Within a few seconds, the cgroup will run completely out of memory, and the kernel's Out-Of-Memory (OOM) killer will intervene to protect the host. You will see an output like this:

[1]+  Killed                  cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} bash -c 'a=(); while true; do a+=("$(head -c 10M /dev/zero | tr "\0" "A")"); sleep 1; done'

In this setup, memory.max was used, which acts as a hard limit and triggers the OOM killer when exceeded. A softer and safer approach is to use memory.high instead. When a process reaches memory.high, the kernel heavily throttles the process and applies strong memory reclaim pressure. This forces the process to slow down and release memory, acting more like a “speed bump” than a hard stop. This behavior provides monitoring systems and administrators time to react and take action before the application is terminated by the OOM killer.

Monitoring with `systemd-cgtop` and `systemd-cgls`

Just as you use top and ls to view standard processes, Linux provides systemd-cgtop and systemd-cgls specifically for monitoring cgroups.

First, let's populate our cgroup with a few sleeping background processes so we have something to look at:

for p in {1..5} ; do cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000 & done
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} dd if=/dev/zero of=/dev/null &

Now, run: systemd-cgtop

You will get a clean, live-updating table showing the resource consumption aggregated by cgroup:

If you want a hierarchical tree view of exactly which PIDs belong to which groups, use systemd-cgls:

root@container-security:/home/user# systemd-cgls /scripts
CGroup /scripts:
└─production
  ├─2142 sleep 2000
  ├─2143 sleep 2000
  ├─2144 sleep 2000
  ├─2145 sleep 2000
  ├─2146 sleep 2000
  └─2147 dd if=/dev/zero of=/dev/null

Killing All Processes in a Cgroup

One of the best new features in cgroups v2 is the cgroup.kill file. Instead of hunting down individual PIDs, you can instantly terminate everything inside a cgroup by writing a 1 to this file:

echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill

If you press enter a couple of times, you will see the terminal report that all the sleep processes we spawned earlier have been instantly killed. Checking systemd-cgls /scripts will now show an empty group.

Moving an Already-Running Process (`cgclassify`)

So far, we have been launching new processes directly into our cgroup using cgexec. But what if a runaway process is already running on the host, and you want to lock it down on the fly?

We can use the cgclassify command for this. Let's start our CPU hog on the host system without limits:

dd if=/dev/zero of=/dev/null &

It is currently consuming 100% of a core. Time to cage it. We use cgclassify and pass it the PID (using $! for the last background process):

cgclassify -g cpu,memory:${PARENT_CGROUP}/${CHILD_CGROUP} $!

If you run ps -p $! -o %cpu right after classifying the process, you might notice something strange. It might say the CPU usage is 75% or 50%, slowly ticking down, rather than an instant 10%. Why? This is because the ps command does not show instantaneous CPU usage. It calculates the average CPU usage over the entire lifetime of the process. Because the process ran at 100% for a few seconds before we caged it, that lifetime average takes a while to drop! If you look at the process in htop or systemd-cgtop instead, you will see that its actual, real-time usage dropped to 10% the exact millisecond you ran the cgclassify command.

Kill the process with: echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill

Viewing Configuration with `cgget`

If you ever need to audit a cgroup to see exactly how it is configured and what its current stats are, cgget is your go-to command:

cgget ${PARENT_CGROUP}/${CHILD_CGROUP}

This dumps the contents of all the controller files into an easy-to-read list, showing you your max limits, current usage metrics, and even how many times the OOM killer has been triggered (oom_kill).

Cleaning Up

To keep your system clean, you can recursively delete the cgroups we just created:

cgdelete -r -g cpu:/${PARENT_CGROUP}

You might wonder why commands like cgexec and cgdelete require you to specify a controller (like -g cpu:) even though cgroups v2 uses a unified hierarchy. This is simply a quirk for backward compatibility with cgroups v1 syntax. The command requires it to run, but in a v2 environment, the process is applied to the unified group regardless of which specific controller you type here.

Containers and Cgroups

Throughout this chapter, we manually created cgroups, configured resource limits, and assigned processes to them. While this is the best way to learn how the Linux kernel enforces resource distribution, you rarely have to do this by hand in the real world. You don't have to be using containers to take advantage of cgroups, but modern container runtimes provide an incredibly convenient abstraction layer over them.

When you run a containerized application, runtimes like Docker or containerd automatically interact with the cgroups filesystem on your behalf. Behind the scenes, the runtime creates a dedicated cgroup hierarchy specifically for that container (typically using the long container ID as the directory name).

When you pass a flag like --memory 100M to a Docker run command, or define a CPU limit in a Kubernetes Pod specification, the container engine translates those human-readable requests directly into the memory.max and cpu.max files we explored earlier.

From a security standpoint, understanding this underlying mechanism is critical. Constraining resources provides a powerful layer of protection against resource exhaustion.

Whether an attacker deliberately exploits an application to consume excess memory, or a simple bug causes an accidental CPU spike, an unbounded container can easily starve legitimate applications running on the same host. By setting explicit memory and CPU limits on your container deployments, you ensure that the kernel's cgroups will throttle or kill the offending process before it can bring down your entire infrastructure.

This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Chapter 4: Linux Capabilities

0xAlphaSecurity — Fri, 06 Mar 2026 15:05:03 +0000

This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.

Understanding Linux capabilities is a fundamental step in mastering container security, as it allows us to move beyond the "all-or-nothing" approach of the traditional root user. By breaking down the monolithic power of root into granular privileges, we can grant a container exactly what it needs to function while significantly reducing the potential blast radius of an exploit.

Introduction: Understanding capabilities

To understand how to secure a container, we first need to understand how the Linux kernel handles privileges. The security model of containers is built directly on top of a kernel feature called Capabilities.

The "All or Nothing" Problem

Traditionally, UNIX-like systems operated on a binary permission model. For the purpose of permission checks, the kernel distinguished between only two categories of processes:

Privileged processes (Root): Processes with an effective User ID (UID) of 0.
Unprivileged processes (Standard User): Processes with a non-zero UID.

This created a significant security gap known as the "All or Nothing" problem. A privileged process (UID 0) bypasses almost all kernel permission checks, allowing it to modify system files, install software, and reconfigure the network stack. A standard user, conversely, is strictly bound by permission checks.

The problem arises when a standard user needs to perform a specific action that requires elevated privileges, such as opening a network socket (like ping using ICMP) or binding to a restricted port (like a web server on port 80). In the old model, the only solution was to give the process full root privileges, usually via the SUID (Set User ID) bit.

As discussed in previous chapters, the SUID bit is a security risk. It effectively grants a program full superuser powers just to perform one minor task. If a hacker exploits a bug in a SUID binary, they don't just compromise that specific application, they gain full control over the entire system.

What are Capabilities?

To solve this "security risk," kernel developers introduced a more nuanced solution called Capabilities. Starting with Linux Kernel 2.2 (in 1999), the privileges traditionally associated with the superuser were broken down into distinct, independent units. These units are called capabilities.

The concept is straightforward: instead of checking "Is this user root?" the kernel checks "Does this thread have the specific capability to perform this action?"

For example:

Instead of being "Root," a process might only have CAP_NET_BIND_SERVICE (to bind to ports < 1024).
Instead of being "Root," a process might only have CAP_CHOWN (to change file ownership).

While this feature was originally scoped only to processes, support for assigning capabilities directly to files was added in 2008. This evolution allows us to assign fine-grained permissions to executables so that processes that previously required UID 0/root permissions no longer need them to function.

Capabilities are the technical implementation of the Principle of Least Privilege. This security principle dictates that a process should possess only the bare minimum privileges necessary to perform its function and nothing more.

By using capabilities, we can drastically reduce the attack surface. If a web server runs as a non-root user with only the minimal required capabilities (e.g., CAP_NET_BIND_SERVICE), then the impact of a compromise can be reduced.

The Capability Sets

Up to this point, capabilities sound simple: break root privileges into smaller pieces and assign only what is necessary. The real complexity begins when we look at how capabilities are stored, inherited, and transformed between processes and files. If you read the man capabilities page, you might find it terse and difficult to map to real-world scenarios.

The confusion often stems from two sources:

Naming Collisions: The kernel uses the same names (like "Effective" or "Inheritable") for both processes and files, but they function quite differently depending on where they are applied.
Counter-Intuitive Behavior: Capabilities don't behave like the simple "SUID Root" model we are used to. Just because a parent process has a capability doesn't automatically mean the child process gets it.

To demystify this, we first need to distinguish between the Process (the active entity) and the File (the passive storage).

Process vs. File Capabilities

execve() is a Linux system call that replaces the current running process with a new program. It loads the new executable into memory and starts it, keeping the same process ID but with new code and data.

Process Capabilities: When we talk about a "process" having capabilities, we are technically talking about a thread. In Linux, capability sets are maintained per thread.
- Role: These determine what the running task is actually allowed to do right now.
- Lifecycle: Thread capability sets are copied during a fork() (creating a new thread/process) and are specially transformed during an execve() (running a new program). Capabilities are especially important during execve(), because that's when capability transformation rules apply.
- Note: Most normal processes (like your text editor or shell) have and need zero capabilities. They rely on standard file permissions. Capabilities are generally only needed for system-level administration tasks.
File Capabilities: Binaries on the disk can also have capabilities associated with them.
- Role: These are not "active" permissions. Instead, they are a set of instructions that tell the kernel: "When this file is executed, grant the process these specific privileges."
- Storage: These are stored in the file's Extended Attributes (xattrs), specifically within security.capability. File capabilities depend on filesystem support for extended attributes (most modern filesystems support this). For example, in ext3/ext4, extended attributes are stored in the inode or in additional disk blocks. Many backup tools do not preserve extended attributes by default. Without preserving xattrs, file capabilities will be silently lost.
- When copied from one place to another, a binary will lose its capabilities. In order to keep capabilities, you can copy the file with --preserve=all option. Example: cp --preserve=all /origin/path /dest/path
- Constraint: Writing to this extended attribute requires the CAP_SETFCAP capability. This ensures that standard users cannot simply grant themselves superpowers by editing a binary's attributes.

The 5 Capability Sets

To manage how privileges are granted, inherited, and limited, Linux uses five distinct "sets" of capabilities (which are represented as bit masks). Think of these as five different buckets that a process carries.

Set	Purpose	Process Capabilities	File Capabilities
Permitted (P)	The superset of what a process can do. A process can move capabilities from here to the Effective set, but it cannot add new ones that aren't already here.	✅	✅
Effective (E)	The Active set. This is the only set the kernel actually checks when a process tries to do something (like open a port). If a capability is in Permitted but not Effective, the action fails.	✅	❌
Inheritable (I)	Capabilities that can be passed down to a child process. However, simply having a capability here isn't enough; the child executable must also be "willing" to receive it (via File Inheritable sets).	✅	✅
Bounding (B)	The hard limit. No capability can ever be added to the Permitted or Inheritable sets if it doesn't exist in the Bounding set.	✅	❌
Ambient (A)	Added in newer kernels to fix the "Inheritance Problem." It allows non-SUID binaries (which aren't capability-aware) to blindly inherit capabilities from their parent.	✅	❌

Linux defines five capability sets for each thread:

Thread Permitted Set (P): The permitted set is the thread's upper bound of capabilities. It defines the maximum privilege scope the thread can ever exercise. A thread may call capset() to move capabilities from Permitted into the Effective set (the capabilities that are actually checked by the kernel), and it may also use capset() to place capabilities into the Inheritable set (capabilities it is allowed to pass across an execve() when combined with the executed file's inheritable capabilities). A thread cannot use capset() to add new capabilities to its permitted set (i.e., capabilities it doesn't already have) unless it has CAP_SETPCAP in its effective set.
Thread Effective Set (E): This is the set that the kernel actually checks during permission evaluation. If a capability is not in the effective set, the kernel behaves as if the process does not have it. The effective set is what truly matters during system calls.
Thread Inheritable Set (I): The inheritable set controls what capabilities may be passed across execve() to a different binary. A capability in the thread inheritable set is not automatically granted to child processes. It only influences what may become permitted in the new program. Both the thread inheritable set and the file inheritable set must agree. The thread inheritable set and file inheritable set are different things (This is where many people get confused - more on that later).
Bounding Set (B): The bounding set acts as a hard ceiling on what capabilities a process can ever gain through execve(). Even if a file has a capability marked as permitted, if that capability isn't in the bounding set, the process can never acquire it. It also limits which capabilities can be added to the inheritable set.
Ambient Set (A): The ambient set was introduced in Linux 4.3 to solve the problem of passing capabilities to ordinary binaries that have no file capabilities set. Any capability in the ambient set is automatically added to both the permitted and inheritable sets of the new process after execve(), even for plain, unmodified binaries. To add a capability to the ambient set it must already be in both your permitted and inheritable sets, and dropping it from either one automatically removes it from the ambient as well.

Files only have:

File Permitted Set: The file permitted set defines the capabilities a binary is allowed to gain when executed, regardless of what the thread already has. These capabilities are added to the new process's permitted set after execve(), but only if they are also allowed by the bounding set.
File Inheritable Set: The file inheritable set specifies which capabilities the binary is willing to accept from the thread's inheritable set during execve(). Only capabilities present in both the thread's inheritable set and the file's inheritable set will be carried over into the new process's permitted set.
File Effective Flag: Unlike the other file sets, the effective field is just a single bit, not a set. When set, it tells the kernel to automatically move all of the new process's permitted capabilities into its effective set after execve(), which is needed for older binaries that don't explicitly call capset() to raise their own capabilities.

How Capabilities are Calculated

When a process executes a binary (via execve()), the kernel calculates the new capabilities for the process based on a specific formula. This formula combines what the parent thread had and what the file allows.

When a thread executes a new binary the logic can be simplified as follows:

Permitted Set Calculation: The new permitted set is the union of two sources: capabilities that exist in both the thread's inheritable set and the file's inheritable set, plus capabilities that exist in the file's permitted set filtered through the bounding set.
- Formula: New Permitted = (Old Inheritable AND File Inheritable) OR (File Permitted AND Bounding Set)
Effective Set Calculation: If the file's effective bit is set, the new effective set equals the full new permitted set, meaning all capabilities are immediately active. Otherwise the effective set starts empty and the process must raise them manually.
- Formula: New Effective = New Permitted if File Effective Flag is set, else 0
Inheritable Set Calculation: The inheritable set is simply carried over unchanged from the old thread, execve() does not modify it.
- Formula: New Inheritable = Old Inheritable

The following diagram shows the relationship between the different capability sets and how they interact during process creation and execution.

You might notice a gap in the logic described above. If you wanted to run an ordinary binary or script with capabilities, say a plain Python script, you were stuck. Putting a capability in the inheritable set had no effect unless the target binary also had that capability in its file inheritable set, which meant you couldn't pass privileges down to unmodified binaries without touching the files themselves.

The ambient set solves this. Any capability in the ambient set is automatically added to the new process's permitted and effective sets after execve(), even if the binary has no file capabilities set at all. This is how modern container runtimes can run standard, unmodified applications with specific privileges without needing to alter the binaries inside the container image.

Inspecting & Manipulating Capabilities

Common Linux Capabilities

Before we can meaningfully inspect anything, it helps to have a mental map of the most important capabilities and what they actually allow. Linux defines over 40 capabilities, but a handful appear constantly in security-relevant contexts. The full list can be found in the documentation.

Capability	Short Description
`CAP_CHOWN`	Allows a process to make arbitrary changes to file UIDs and GIDs. `CAP_SETUID` and `CAP_SETGID` allow a process to change its own UID and GID, which is how su and sudo work. A process with `CAP_SETUID` can effectively become any user on the system, including root, making it nearly as dangerous as having full root.
`CAP_DAC_OVERRIDE`	Stands for "Discretionary Access Control Override." A process with this capability can bypass standard file read, write, and execute permission checks. In practical terms, this means it can read or write any file on the system regardless of its ownership or permissions. It does not bypass MAC (Mandatory Access Control) systems like SELinux or AppArmor, but it completely defeats the traditional UNIX permission model.
`CAP_NET_BIND_SERVICE`	Allows a process to bind to privileged ports, those below 1024, without needing full root access. This is the correct and minimal capability to assign to a web server that needs to listen on port 80 or 443. Without it, only root processes can bind to these ports.
`CAP_NET_RAW`	Grants the ability to use raw and packet sockets, and to bind to any address for transparent proxying. This is what the `ping` command historically needed to craft ICMP packets. It is also what an attacker needs to perform packet sniffing or craft arbitrary network packets, making it a capability worth watching closely.
`CAP_NET_ADMIN`	This is one of the most powerful networking capabilities. It grants permission to perform a broad range of network configuration tasks: configuring network interfaces, managing routing tables, setting firewall rules with `iptables`, enabling promiscuous mode on a network interface, and modifying network namespaces. Because it covers so much ground, it's a frequent target during container escapes. A container that has `CAP_NET_ADMIN` can potentially reconfigure the host's network stack if it escapes its namespace.
`CAP_SYS_ADMIN`	This is often described as the "new root." It is by far the broadest capability in Linux, covering an enormous range of system administration operations: mounting and unmounting filesystems, managing namespaces via `clone()` and `unshare()`, loading kernel modules (in combination with `CAP_SYS_MODULE`), performing `chroot()`, and dozens of other privileged operations. The Linux man page lists so many permissions under `CAP_SYS_ADMIN` that security practitioners generally treat its presence in a container as equivalent to running the container as root. If you see it, treat it as a red flag.
`CAP_SYS_PTRACE`	Allows a process to trace arbitrary processes using `ptrace()`, the system call that debuggers like `gdb` rely on. In a container context, this is particularly dangerous because `ptrace()` can be used to inspect and modify the memory of other processes, potentially leading to container escape if the target process runs in a different namespace or with higher privileges.
`CAP_SYS_MODULE`	Allows loading and unloading kernel modules. This is an extremely high-risk capability because a kernel module runs in kernel space with no restrictions whatsoever. A process with this capability can load a malicious module that does anything the kernel itself can do.

Inspecting Capabilities

Linux provides several tools for examining what capabilities are assigned, whether to a running process or to a file on disk. Using them together gives you a complete picture of the privilege landscape on any system.

The examples in the following sections are run on a standard Ubuntu Server 24.04 VM. Always run these exercises on a disposable test environment, as you may encounter binaries with capabilities that can be dangerous if misused.

Inspecting File Capabilities with `getcap`

The getcap command reads the security.capability extended attribute from a file and displays it in a human-readable format. This is the primary tool for checking what privileges an executable binary has been granted.

$ getcap /usr/bin/ping

You would typically see output like:

/usr/bin/ping cap_net_raw=ep

This tells you that the ping binary has the cap_net_raw capability, and the =ep suffix tells you which sets it's in. The letter e means the Effective flag is set, and p means the capability is in the Permitted file set. Referring back to our capability calculation formula, this means that when ping is executed, cap_net_raw will be added to the new process's permitted set, and because the effective flag is set, it will also be immediately active in the effective set.

Historical Note: Due to an update where "ping sockets" were added directly to the kernel, the ping command technically no longer requires any additional Linux capabilities to work (though this is gated by a config setting disabled by some distros). The CAP_NET_RAW capability is still commonly assigned to the binary for backward compatibility with older kernels and configurations where raw sockets are still required.

You'll commonly see these suffixes in capability strings:

p - The capability is in the file's Permitted set.
e - The file's Effective bit is set (applies all permitted caps to the effective set immediately).
i - The capability is in the file's Inheritable set.
ep - Both Effective and Permitted; the most common combination for binaries that need to self-elevate.

One particularly useful flag for getcap is -r, which enables recursive searching. To scan an entire filesystem for any binary that has capabilities assigned, run:

$ getcap -r / 2>/dev/null

The 2>/dev/null part discards permission errors from directories you can't read. This one-liner is a standard step in security audits and CTF (Capture the Flag) challenges alike, since a misconfigured binary with an overly broad capability is a common privilege escalation vector.

Inspecting Process Capabilities with `getpcaps`

While getcap deals with files, getpcaps shows you the capabilities of a running process, identified by its PID. Let's look at the difference between a normal user process and a root process.

First, find the PID of your current shell and inspect it:

user@container-security:~$ ps
    PID TTY          TIME CMD
   2736 pts/0    00:00:00 bash
   2755 pts/0    00:00:00 ps

user@container-security:~$ getpcaps 2736
2736: =

The = output means the process has an empty capability set across all five sets. This is exactly what you'd expect for an ordinary shell running as a non-root user. It doesn't need any capabilities because it relies entirely on standard file permission checks for everything it does.

Now compare that to a shell running as root:

user@container-security:~$ sudo bash
[sudo] password for user:
root@container-security:/home/user#
root@container-security:/home/user# ps
    PID TTY          TIME CMD
   2761 pts/1    00:00:00 sudo
   2762 pts/1    00:00:00 bash
   2769 pts/1    00:00:00 ps
root@container-security:/home/user# getpcaps 2762
2762: =ep

A root shell carries the full complement of capabilities (=ep) in both its Permitted and Effective sets, giving it unconstrained access to virtually every privileged operation on the system. This is exactly the scenario that the Principle of Least Privilege is designed to avoid.

One subtle and dangerous pitfall to be aware of is the empty capability set. When you inspect such a process with getpcaps, you'll see something like (what we got in the the root shell example above):

<PROCESS_PID>: =ep

This looks like the file has no specific capabilities, and one might assume it's harmless. It is the exact opposite. An empty capability set with the ep flags means all capabilities are enabled. The empty set before =ep is shorthand for "all capabilities" making this the equivalent of <PROCESS_PID>: all=ep. The same is true for files.

Inspecting the Current Shell with `capsh`

capsh (Capability Shell) is a versatile tool for both inspecting and launching processes with specific capability sets. Its --print flag dumps a comprehensive view of the current shell's capability state:

user@container-security:~$ capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB:
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1000(user) euid=1000(user)
gid=1000(user)
groups=27(sudo),1000(user)
Guessed mode: HYBRID (4)

The output tells you several things at once. Current is the thread's effective and permitted capability sets. Bounding set shows the hard ceiling, notice that even for a non-root user, the bounding set may contain many capabilities, but they won't appear in the current set unless explicitly granted. Ambient set is empty here, meaning no capabilities will be passed to child processes automatically.

This is much richer than getpcaps for understanding the full capability context of your current process.

Reading Raw Bitmasks from `/proc`

For low-level inspection or scripting, you can read capability information directly from the kernel's process filesystem. Every running process has a status file under /proc/<pid>/status that contains raw hexadecimal bitmask values for each capability set:

user@container-security:~$ ps
    PID TTY          TIME CMD
   2736 pts/0    00:00:00 bash
   3184 pts/0    00:00:00 ps
user@container-security:~$ cat /proc/2736/status | grep -i cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

Each line corresponds to a capability set: CapInh (Inheritable), CapPrm (Permitted), CapEff (Effective), CapBnd (Bounding), and CapAmb (Ambient). The values are 64-bit hexadecimal bitmasks where each bit position corresponds to a specific capability number.

Reading these raw masks directly isn't very human-friendly, but capsh can decode them for you with the --decode flag:

user@container-security:~$ capsh --decode=000001ffffffffff
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,...

This is especially useful in automated scripts or when you need to understand the capabilities of a process that getpcaps can't reach, such as inside a container's namespace.

Assigning Capabilities with `setcap`

setcap writes a capability set directly into the security.capability extended attribute of a file. The general syntax is:

$ sudo setcap <capability>+<sets> /path/to/binary

For example, to grant a binary the CAP_SETUID capability in both the Permitted and Effective sets:

$ sudo setcap cap_setuid+ep /path/to/file

Note that running setcap itself requires the CAP_SETFCAP capability. This privilege is automatically granted to root, which is why the sudo prefix is needed when running as a normal user.

An important subtlety: setcap is not additive. Each invocation of setcap completely replaces the capability set of the file. If you want to assign multiple capabilities, you must specify all of them in a single command:

$ sudo setcap cap_net_bind_service,cap_net_raw+ep /path/to/binary

Running setcap twice with different capabilities will result in only the second set being stored.

Removing Capabilities with `setcap -r`

To strip all capabilities from a file, use the -r (remove) flag:

$ sudo setcap -r /path/to/program

After this, getcap on that file will return no output, and the binary will run with whatever privileges the executing user's process has, just like any other ordinary binary.

A Practical Example: Assigning `CAP_NET_BIND_SERVICE` to a Custom Binary

On Linux, ports below 1024 are called privileged ports. Binding to them is restricted by the kernel to prevent unprivileged users from impersonating well-known services like HTTP (port 80) or HTTPS (port 443). Traditionally, the only way to bind to these ports was to run your process as root. With CAP_NET_BIND_SERVICE we can grant exactly that one permission to a specific binary, and nothing else.

Try to start a Python HTTP server on port 80 as a non-root user:

user@container-security:~$ python3 -m http.server 80
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/lib/python3.12/http/server.py", line 1314, in <module>
    test(
  File "/usr/lib/python3.12/http/server.py", line 1261, in test
    with ServerClass(addr, HandlerClass) as httpd:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/socketserver.py", line 457, in __init__
    self.server_bind()
  File "/usr/lib/python3.12/http/server.py", line 1308, in server_bind
    return super().server_bind()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.12/socketserver.py", line 473, in server_bind
    self.socket.bind(self.server_address)
PermissionError: [Errno 13] Permission denied

The kernel blocks it immediately. Checking the capabilities of the Python binary confirms that it has no special permissions:

user@container-security:~$ getcap /usr/bin/python3.12 # empty output, no capabilities assigned, so binding to a privileged port is forbidden

Port 1024 and above work fine without any capabilities:

user@container-security:~$ python3 -m http.server 8080
Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/) ...

This confirms the problem is specifically about privileged ports, not Python itself.

Assign CAP_NET_BIND_SERVICE to the Python Binary:

user@container-security:~$ which python3
/usr/bin/python3
user@container-security:~$ readlink -f /usr/bin/python3
/usr/bin/python3.12
user@container-security:~$ sudo setcap cap_net_bind_service+ep /usr/bin/python3.12
user@container-security:~$ getcap /usr/bin/python3.12
/usr/bin/python3.12 cap_net_bind_service=ep

And confirm the file permissions are completely unchanged:

user@container-security:~$ ls -l /usr/bin/python3.12
-rwxr-xr-x 1 root root 8020928 Jan 22 20:57 /usr/bin/python3.12

No SUID bit. No ownership change. Nothing visible to a ls check.

Confirm It Works:

user@container-security:~$ python3 -m http.server 80
Serving HTTP on 0.0.0.0 port 80 (http://0.0.0.0:80/) ...

It binds successfully. From another terminal, verify the running process has exactly one capability and nothing more:

user@container-security:~$ pgrep python3
3257
user@container-security:~$ getpcaps 3257
3257: cap_net_bind_service=ep
user@container-security:~$ cat /proc/3257/status | grep -i cap
CapInh: 0000000000000000
CapPrm: 0000000000000400
CapEff: 0000000000000400
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
user@container-security:~$
user@container-security:~$ capsh --decode=0000000000000400 # Decode the bitmask to confirm
0x0000000000000400=cap_net_bind_service

Bit 10 (0x400) is CAP_NET_BIND_SERVICE and nothing else. The process cannot read arbitrary files, cannot change file ownership, cannot kill other processes. It can only bind to privileged ports.

To remove the capability and confirm it no longer works:

user@container-security:~$ sudo setcap -r /usr/bin/python3.12
user@container-security:~$ python3 -m http.server 80
Traceback (most recent call last):
  ...
  File "/usr/lib/python3.12/socketserver.py", line 473, in server_bind
    self.socket.bind(self.server_address)
PermissionError: [Errno 13] Permission denied

Capabilities Security Implications

While capabilities were designed to implement the Principle of Least Privilege and secure your system, they can become a massive liability if misconfigured. Assigning the wrong capability to the wrong binary effectively hands an attacker a clean, built-in mechanism for privilege escalation.

As we saw in the previous chapter, an empty capability set assigned with the Effective and Permitted flags (=ep) is actually shorthand for granting all available capabilities. If a system administrator mistakenly applies this or even just a specific capability like CAP_SETUID to a script interpreter or a common binary, the entire security model collapses.

Let's look at how easily an attacker can exploit this using Python. Assume an administrator accidentally ran sudo setcap =ep /usr/bin/python3.12 (or sudo setcap cap_setuid+ep /usr/bin/python3.12) while trying to fix a permissions issue. In that scenario, escalating privileges to root becomes trivial. All an attacker needs to do is write a one-liner to change their User ID (UID) to root and spawn a shell.

user@container-security:~$ sudo setcap cap_setuid+ep /usr/bin/python3.12
user@container-security:~$ getcap /usr/bin/python3.12
/usr/bin/python3.12 cap_setuid=ep
user@container-security:~$
user@container-security:~$ python3 -c 'import os; os.setuid(0); os.execl("/bin/bash", "bash")'
root@container-security:~# # notice the prompt changed to root, we are now running a root shell
root@container-security:~# exit
exit
user@container-security:~$
# remove the capability to prevent this from happening again
user@container-security:~$ sudo setcap -r /usr/bin/python3.12

Let's break down exactly what is happening here:

python3 -c: Tells the Python interpreter to execute the following inline code string.
import os: Imports the standard OS module required to make system calls.
os.setuid(0): Leverages the CAP_SETUID capability to change the process's effective UID to 0, which is the root user.
os.execl("/bin/bash", "bash"): Replaces the current Python process with a brand-new bash shell. Because the UID was just changed to 0, this new shell runs entirely as root.

Just like that, a standard user account is transformed into a superuser, bypassing all traditional access controls.

The privilege escalation scenarios above become significantly more consequential in a containerized environment, and this is a topic we will explore in depth in a dedicated chapter.

This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Chapter 3: Linux File Permissions

0xAlphaSecurity — Thu, 05 Feb 2026 14:45:21 +0000

This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.

Next, we will explore how Linux manages file permissions to ensure security and proper access control. File permissions are a fundamental aspect of Linux security, determining who can read, write, or execute files and directories, whether on a local system or within a containerized environment. In Linux everything is a file (program code, configuration, hardware devices, etc.), so understanding file permissions is crucial for managing system security effectively.

Understanding Linux file permissions

Linux is built as a multi-user environment, where security of user data and system integrity is very important. Sometimes the efficient file security built into Linux can create problems for users and administrators who are not familiar with how it works.

File permissions have 3 basic components:

User: The owner of the file.
Group: A set of users who share access permissions. Groups are used for better administration control. Each user will belong to at least one default group.
Others: Everyone else who is not the owner or in the group.

When we create a file and check its permissions using the ls -l command, we see something like this:

If we break down the output column by column:

The first column shows the file type and permissions. The first character indicates the file type (- for regular files, d for directories, etc.). The next nine characters are the file permissions. The permissions are divided into three sets of three characters. First set is for the owner, second set is for the group, and the third set is for others. There are 3 possible attributes that make up file access permissions.
- r - Read permission. Whether the file may be read. In the case of a directory, this would mean the ability to list the contents of the directory.
- w - Write permission. Whether the file may be written to or modified. For a directory, this defines whether you can make any changes to the contents of the directory. If write permission is not set then you will not be able to delete, rename or create a file.
- x - Execute permission. Whether the file may be executed. In the case of a directory, this attribute decides whether you have permission to enter, run a search through that directory or execute some program from that directory.
The second column indicates the number of hard links to the file.
The third column shows the owner of the file.
The fourth column shows the group associated with the file.

Let's look at our 3 basic examples from above:

drwxrwxr-x 2 max max 4096 Jan 13 23:04 documents: This is a directory (d at the start). The owner max has read, write, and execute permissions (rwx). The group max also has read, write, and execute permissions (rwx). Others have read and execute permissions (r-x), but not write permission.
-rw-rw-r-- 1 max administrator 0 Jan 13 22:49 myfile: This is a regular file (- at the start). The owner max has read and write permissions (rw-). The group administrator also has read and write permissions (rw-). Others have only read permission (r--).
-rwxrwxr-x 1 max max 0 Jan 13 23:05 script.sh: This is a regular file (- at the start). The owner max has read, write, and execute permissions (rwx). The group max also has read, write, and execute permissions (rwx). Others have read and execute permissions (r-x), but not write permission.

Changing File Permissions

Now that we understand how to read file permissions, let's look at how to change them. The main command used to modify permissions in Linux is:

chmod

To change file permissions, you must be:

the owner of the file, or
the root (superuser).

Permissions can be changed in two main ways:

Symbolic mode (letters and operators): more readable
Numeric mode (octal numbers): faster and widely used in scripts

Symbolic Mode (Using Letters)

Permissions can be defined for:

u: user (owner)
g: group
o: others
a: all (user + group + others)

Operators:

+: add permission
-: remove permission
=: set exactly (overwrite existing permissions)

Permission bits:

r: read
w: write
x: execute

Let's look at a basic symbolic mode example where we remove the execute permission from the user, add write permission to the group, and set read and write permission for others:

# BEFORE: -rwxr--r-- 1 max max 0 Jan 13 23:05 script.sh

chmod u-x,g+w,o+rw script.sh

# AFTER: -rw-rw-rw- 1 max max 0 Jan 13 23:05 script.sh

Here are a few more practical examples of using symbolic mode:

# 1) EXAMPLE: Removes write and execute from group.

# BEFORE: -rwxrwxrwx
chmod g-wx somefile
# AFTER: -rwxr--rwx

# 2) EXAMPLE: Give execute permission to everyone.
chmod a+x somefile # (Equivalent to chmod +x somefile)

# 3) EXAMPLE: Apply same change to group and others together.
chmod go-rx somefile

# 4) EXAMPLE: Set user and group permissions exactly to rwx, removing anything else.
chmod ug=rwx somefile

# 5) EXAMPLE: Copy permissions from another class. Others will receive the same permissions that group currently has.
chmod o=g somefile

Using chmod +x is a common way to make a script executable. But this gives execute permission to everyone. If you want to give execute permission only to the owner, use chmod u+x instead.

Numeric Mode (Using Octal Numbers)

Linux also allows permissions to be set using numbers. This is called octal mode and is very common in administration and scripting.

Each permission has a numeric value:

r = 4
w = 2
x = 1

Add the values together to get the permission number. For example:

Permissions	Value
rwx	7
rw-	6
r-x	5
r--	4
-wx	3
-w-	2
--x	1
---	0

The syntax for using numeric mode is:

chmod XYZ filename

Where X is the permission for the user, Y is the permission for the group, and Z is the permission for others.

A few common examples of using numeric mode:

# 1) EXAMPLE: Owner can read/write, everyone else can read only.
chmod 644 somefile

# 2) EXAMPLE: Owner can read/write/execute; others can read and execute. This is very common for executable scripts and programs.
chmod 755 somefile

# 3) EXAMPLE: Private file: only owner can read/write, no permissions for group and others.
chmod 600 secret.txt

⚠️ Avoid setting permissions like 777 unless absolutely necessary. Giving everyone full access is convenient but unsafe. Even on a personal system, good permission habits prevent accidental damage and security issues.

Changing File Ownership

In addition to permissions, Linux also allows you to change the ownership of files and directories. This is done using the chown command. Only root (or sudo) can change file ownership in most systems.

The syntax for chown is:

chown newuser somefile

You can also change the group ownership using the chgrp command:

chgrp newgroup somefile

Understanding the setuid bit

In addition to the standard read (r), write (w), and execute (x) permissions, Linux supports three special permission bits that modify how files and directories behave:

setuid (Set User ID): The setuid bit applies to executable files.
setgid (Set Group ID): The setgid bit changes group behavior.
sticky bit: The sticky bit applies mainly to directories. It controls who can delete files inside a writable directory.

These special bits are powerful and commonly used on multi-user systems. They enable controlled privilege elevation and safer collaboration, but when misused, they can create serious security risks.

The focus of this section will be on the setuid bit, as it is the most relevant to understanding privilege escalation risks in Linux. When you run an executable file, the process that gets created inherits your user ID (the current user of the shell).

Some programs need temporary elevated privileges to perform specific tasks. This is where the setuid bit comes into play. When the setuid bit is set on an executable file, it causes a program to run with the effective user ID of the file’s owner. This allows a regular user to run a specific program with elevated privileges, if the program owner is root.

We advice to run this example in a disposable Ubuntu 24 Docker container. Run the container with: docker run --rm -it ubuntu:24.04 bash -lc "set -e; apt update; apt install -y sudo nano build-essential; useradd -m -s /bin/bash test; usermod -aG sudo test; echo 'test ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/test; chmod 440 /etc/sudoers.d/test; exec su - test"

Let's look at an common example of a SetUID program: the passwd command. The passwd command allows users to change their passwords, but updating the password file requires root privileges. To allow regular users to change their passwords, the passwd executable has the setuid bit set and is owned by root.

test@d4f8d29c759d:~$ ls -l `which passwd`
-rwsr-xr-x 1 root root 64152 May 30  2024 /usr/bin/passwd
test@d4f8d29c759d:~$
test@d4f8d29c759d:~$ cp /usr/bin/passwd ./mypasswd
test@d4f8d29c759d:~$
test@d4f8d29c759d:~$ ls -l mypasswd
-rwxr-xr-x 1 test test 64152 Feb  5 13:55 mypasswd
test@d4f8d29c759d:~$

When you copy the passwd command to your home directory, the setuid bit is not preserved, and the file is owned by your user (test). Therefore, when you run ./mypasswd, it runs with your user privileges, not root and the command will not work as intended.

To demonstrate both “SetUID changes privileges” and why SetUID programs are risky we will create a simple SetUID demo program. That tiny program will:

print real UID vs effective UID
try a root-only action (write into /root/...)
sleep so we can inspect the running process with ps

Follow these steps:

Create a root-only target file: This is the file our demo program will try to write into. Only root should have access to it.
- Run: sudo bash -lc 'echo "TOP SECRET" > /root/secret.txt && chmod 600 /root/secret.txt && ls -l /root/secret.txt'
- Expected output:
```
-rw------- 1 root root 11 Feb  5 14:02 /root/secret.txt
```

Create a SetUID demo program: Open the file in the nano text editor: nano suid_demo.c and paste the following code:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
    int seconds = 30;
    if (argc >= 2) {
        seconds = atoi(argv[1]);
        if (seconds <= 0) seconds = 30;
    }

    uid_t ruid = getuid();
    uid_t euid = geteuid();

    printf("Real UID: %d\n", ruid);
    printf("Effective UID: %d\n", euid);

    // Attempt a root-only action: append to a file inside /root
    const char *path = "/root/secret.txt";
    int fd = open(path, O_WRONLY | O_APPEND);
    if (fd == -1) {
        printf("open(%s) failed: %s\n", path, strerror(errno));
    } else {
        const char *msg = "Appended by suid_demo\n";
        if (write(fd, msg, strlen(msg)) == -1) {
            printf("write() failed: %s\n", strerror(errno));
        } else {
            printf("SUCCESS: appended to %s\n", path);
        }
        close(fd);
    }

    printf("Sleeping %d seconds so you can inspect me with ps...\n", seconds);
    fflush(stdout);
    sleep(seconds);

    return 0;
}

Compile it: gcc suid_demo.c -o suid_demo
Run it normally: ./suid_demo 60

You should see output similar to this:

test@d4f8d29c759d:~$ id
uid=1001(test) gid=1001(test) groups=1001(test),27(sudo)

test@d4f8d29c759d:~$ ./suid_demo 60
Real UID: 1001
Effective UID: 1001
open(/root/secret.txt) failed: Permission denied
Sleeping 60 seconds so you can inspect me with ps...

We can see that both Real UID and Effective UID are the same (1001, our user), and the attempt to open /root/secret.txt failed due to permission denied.

If running the ps ajf command from another terminal (while the program is running) we can see the following output (The last line shows our suid_demo process running with UID 1001.):

root@d4f8d29c759d:/# ps ajf
PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0  1774  1774  1774 pts/1     1783 Ss       0   0:00 bash
1774  1783  1783  1774 pts/1     1783 R+       0   0:00  \_ ps ajf
    0     1     1     1 pts/0     1782 Ss       0   0:00 su - test
    1  1746  1746     1 pts/0     1782 S     1001   0:00 -bash
1746  1782  1782     1 pts/0     1782 S+    1001   0:00  \_ ./suid_demo 60

Turn it into a SetUID-root binary

Run these commands:

sudo chown root:root suid_demo
sudo chmod 4755 suid_demo # set the setuid bit
ls -l suid_demo

We should now see that the owner is root and the permissions show an s in place of the user execute bit:
```
test@d4f8d29c759d:~$ ls -l suid_demo
-rwsr-xr-x 1 root root 16488 Feb  5 14:06 suid_demo
```
Run it again as a normal user: ./suid_demo 60: The script runs again, but this time successfully appends to /root/secret.txt. We can see the effective UID is now 0 (root), but the real UID is still our user (1001).:
```
test@d4f8d29c759d:~$ ./suid_demo 60
Real UID: 1001
Effective UID: 0
SUCCESS: appended to /root/secret.txt
Sleeping 60 seconds so you can inspect me with ps...
test@d4f8d29c759d:~$
```

Running ps ajf from another terminal confirms the effective UID is 0:

root@d4f8d29c759d:/# ps ajf
PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0  1774  1774  1774 pts/1     1793 Ss       0   0:00 bash
1774  1793  1793  1774 pts/1     1793 R+       0   0:00  \_ ps ajf
    0     1     1     1 pts/0     1792 Ss       0   0:00 su - test
    1  1746  1746     1 pts/0     1792 S     1001   0:00 -bash
1746  1792  1792     1 pts/0     1792 S+       0   0:00  \_ ./suid_demo 60

This exmaple confirms that the setuid bit allowed our program to run with the privileges of the file owner (root), even though we executed it as a normal user. Look inside the sudo cat /root/secret.txt file to verify the append worked:
```
test@d4f8d29c759d:~$ sudo cat /root/secret.txt
TOP SECRET
Appended by suid_demo
```

This simple demo shows why SetUID-root binaries are treated as high-risk and must be written extremely carefully. A SetUID program runs with the effective user ID of the file owner, rather than the calling user. When the file owner is root, this results in privilege elevation, which is why SetUID-root programs are security sensitive. If such a program contains a bug, for example unsafe input handling, path confusion, or command execution issues, it can become a privilege-escalation vector if the file is owned by root.

The setuid bit comes from a time when Linux privilege management was simpler and more coarse-grained. The basic model was:

root → full privileges
non-root → very limited privileges

SetUID was introduced as a mechanism to let non-root users perform specific privileged operations through carefully controlled programs. Starting with Linux kernel 2.2, more advanced security mechanisms were introduced, most notably Linux capabilities.

Capabilities break the all-powerful root privilege into many smaller, specific privileges that can be granted independently. This follows the principle of least privilege: give a program only the exact permissions it needs, nothing more.

Historically, the ping command required the setuid bit because it needed to open raw network sockets, which is a privileged operation.

SetUID is still used for some core system tools, but it should be considered a legacy elevation mechanism and applied only when absolutely necessary.

This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Chapter 2: Linux System Calls

0xAlphaSecurity — Wed, 07 Jan 2026 21:54:02 +0000

This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.

To understand how containers work, and how to secure them, it helps to know a few Linux fundamentals. One of these fundamentals is Linux system calls. Later chapters will build on this to explain how containers provide isolation, resource management, and security boundaries.

What are Linux System Calls?

Linux splits execution into two main "worlds":

Userspace: Userspace is where user-facing applications run: web servers, Chrome, text editors, command-line tools, background services, etc. It's a restricted zone: applications cannot directly access hardware or manage critical system resources on their own. This restriction improves stability: if an application crashes, it usually doesn't crash the whole OS.
Kernel Space: Kernel space is where the Linux kernel runs the core of the operating system. The kernel controls everything: memory, processes, scheduling, hardware and drivers, filesystems, networking, security, and more. It also interacts directly with the CPU, RAM, disk, and other hardware with full privileges.

So where do system calls fit in?

Applications run in userspace with lower privileges. If an application wants to do something that requires kernel privileges, like:

opening a file
reading/writing data
creating a process
allocating memory
sending network traffic
getting the current time

…it must ask the kernel to do it.

That request is made through the system call interface, also called the syscall interface.

Definition (in plain terms): A system call is a programmatic way for a user-space application to request a service from the Linux kernel, safely and in a controlled way.

This distinction exists for security and stability:

User programs can't directly touch hardware or kernel memory because that would be dangerous.
System calls provide controlled entry points into the kernel.

Also, not everything needs the kernel. For example:

Tokenizing a string happens entirely in userspace.
But anything involving files, devices, networking, or process management requires syscalls.

Linux has 300+ system calls (the exact number varies by kernel version and CPU architecture). A few examples of common system calls:

What the program wants	System call
Read a file	`read()`
Write a file	`write()`
Open a file	`open()`
Start a new program	`execve()`
Create a process	`fork()`
Allocate memory	`mmap()`
Send network data	`send()`
Get current time	`clock_gettime()`

You can browse the full list via the man page: syscalls(2)

How do System Calls Work?

At a high level, a syscall looks like a normal function call from the programmer's perspective but under the hood it performs a controlled transition into kernel mode.

Typical flow:

The user application calls a standard library function (for example read()).
That function triggers a system call using a system call number.
The CPU switches from user mode to kernel mode.
The Linux kernel executes the requested operation.
Control returns to the application with a result (or an error).

Example idea: calling read(fd, buffer, size) triggers the kernel's read implementation for that file descriptor and returns the number of bytes read (or -1 on error, with details stored in errno).

Small Example in C

As an application developer, you rarely need to invoke syscalls "raw." Usually you use higher-level abstractions:

In C/C++: glibc provides wrapper functions (like read(), write(), open(), etc.)
In Go: you may encounter the syscall package

These wrappers:

validate and arrange arguments,
perform the transition to kernel mode,
return the result in a familiar way.

Here's a minimal C example that uses write() to print to standard output (file descriptor 1):

#include <unistd.h>

int main() {
    const char msg[] = "Hello, World!\n";
    write(1, msg, sizeof(msg) - 1);
    return 0;
}

What's happening step-by-step?

write(1, msg, sizeof(msg) - 1) is called from userspace.
write() (from glibc) is a wrapper that prepares the syscall.
The process enters the kernel through the syscall interface.
The kernel validates:
- that file descriptor 1 is valid,
- that the process is allowed to write to it,
- that the buffer points to accessible memory.
The kernel writes the bytes to stdout (often your terminal).
The kernel returns the number of bytes written, and execution continues in userspace.

Even though the code looks simple, the important takeaway is this:
any time you interact with files, processes, networking, memory mapping, etc., you're going through system calls.

Containers and System Calls

A key point that many people miss early on: Containers are just processes running on the host Linux kernel.

That means containers don't have a separate kernel. They share the host kernel, and system calls are the only way container processes interact with that kernel.

So everything a container does, reading files, opening sockets, creating processes, flows through syscalls.

The application code uses syscalls the same way whether it runs on the host or inside a container. But containers introduce security implications, because:

The container still depends on the host kernel.
If a process can access powerful syscalls, it may be able to do powerful things.

This is where least privilege matters: Not all applications need all system calls. By restricting which syscalls a containerized application can use, you reduce the attack surface.

Conclusion

System calls are the "front door" into the kernel. Since containers are just Linux processes sharing the host kernel, every action a container takes ultimately becomes a syscall. That makes syscalls a powerful security control point: if an attacker compromises a containerized app, the damage they can do depends heavily on which syscalls and privileges that process is allowed to use.

This is why container hardening often focuses on reducing kernel exposure, using least privilege and Linux controls like seccomp (restricting syscalls), capabilities (dropping unnecessary privileges), and namespaces/cgroups (isolation and resource limits). In later chapters, we'll build directly on this idea to show how containers create boundaries, and how to tighten them.

Few more resources to learn about Linux system calls:

This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Chapter 1: Container Security Threat Model

0xAlphaSecurity — Sun, 04 Jan 2026 17:45:51 +0000

This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.

Before talking about tools, configurations, or best practices, it is important to understand what we are actually trying to protect and from whom. This is where a threat model comes in. A threat model provides a structured way to reason about security risks.

What Is a Threat Model?

To understand threat modeling, it helps to clearly distinguish a few related concepts.

A risk is a potential problem and the impact it would have if it occurred.
A threat is a possible path that could lead to that risk becoming real.
A mitigation is a countermeasure that reduces the likelihood of a threat succeeding or limits its impact.

A simple example can help illustrate this: Imagine you work from home and rely on a laptop that contains sensitive work data. The risk is that this data could be stolen. The threats are the different ways this might happen: someone breaking into your house, stealing the laptop from your car, or tricking you into installing malware. Mitigations could include locking your doors, encrypting the disk, or using strong authentication.

The key point is that a single risk can have many different threats, and each threat may require different mitigations.

Why Threat Models Differ

Risks vary significantly depending on the context.

A bank holding customer funds will focus heavily on preventing financial theft. An e-commerce platform may prioritize fraud and availability. A personal blog might be most concerned with account takeover or defacement.

Regulatory environments also affect risk. For example, leaking personal data may be primarily a reputational issue in some regions, while in others, such as the European Union, regulations like GDPR can result in substantial financial penalties.

Because risks differ, the importance of specific threats and the appropriate mitigations will also differ. This is why threat modeling is not about finding a single “correct” list of threats, but about systematically identifying and prioritizing the threats that matter in a given environment.

Threat modeling is the process of identifying and enumerating potential threats to a system by examining its components, interfaces, and modes of operation. Done well, it highlights where a system is most exposed and where security efforts will have the greatest effect.

The goal of this chapter is to establish a shared mental model that will be used throughout the rest of the series. We will look at different ways container threats are commonly structured and explain which approach this series will follow.

Common Ways to Structure a Container Threat Model

There is no single comprehensive threat model that fits all environments. However, several well-established approaches are commonly used in container security. Each emphasizes a different perspective.

Component-Based (Data-Centric) Threat Model

One common approach is to model threats around the core components of a containerized environment. This is the approach taken by NIST Special Publication 800-190: Application Container Security Guide, which identifies major risks associated with the following components:

Image risks
Registry risks
Orchestrator risks
Container risks
Host OS risks

The NIST Special Publication 800-190 was written in 2017 and, unfortunately, has not been updated. Due to this, it does not touch on some of the newer threats and technologies that have emerged since then.

This type of threat model is often called component-based, because each component represents a distinct surface an attacker might target.

By examining each component independently, this model helps architects and operators understand where controls must exist and how failures in one area can affect the rest of the system.

NIST SP 800-190 uses this component-based structure to remain vendor-neutral and applicable across different container platforms.

In this series, we use the same underlying risks identified by NIST, but reorganize them along the container lifecycle to make them easier to learn and apply in practice.

Attacker-Centric Threat Model

Another way to structure a threat model is to focus on who the attacker is, rather than where they attack.

In the book Container Security, Liz Rice describes a threat model based on the different actors that may interact with or compromise a containerized system, including:

External attackers attempting to access a deployment from outside.
Internal attackers who have gained some level of access.
Malicious insiders, such as developers or administrators with legitimate privileges.
Inadvertent insiders who accidentally introduce security issues.
Application processes that may misuse their programmatic access.

This attacker-centric approach is particularly useful for understanding intent, privilege levels, and realistic attack paths. It is often used in incident response, threat hunting, and security reviews.

MITRE ATT&CK for Containers

A more technique-driven approach is provided by MITRE ATT&CK for Containers.

This framework categorizes adversary behavior into tactics and techniques across the stages of an attack lifecycle, such as:

initial access,
execution,
persistence,
privilege escalation,
defense evasion,
credential access,
discovery,
lateral movement,
and impact.

Image source: MITRE ATT&CK for Containers

MITRE ATT&CK is especially useful for detection and response, as it helps security teams understand how attacks progress over time and which behaviors to monitor at runtime. While powerful, it is often too detailed to serve as an introductory threat model on its own.

Lifecycle-Based Threat Model

The final approach is to structure threats along the container lifecycle. This model focuses on when threats occur rather than where or who is involved. It aligns closely with how containerized systems are built and operated in practice.

In this series, we use the following lifecycle stages:

Build - PART 2: Secure Container Image Building
Distribution - PART 3: Registries & Supply Chain Security
Deployment - PART 4: Host & Container Platform Security
Runtime - PART 5: Container Runtime Security

This approach allows us to reason about threats in the same order containers move through the system, while still incorporating insights from component-based and attacker-centric models.

There is no single threat model that fits every environment, but many of the threats discussed in this series are common to most container deployments, regardless of scale or platform.

Identifying Attack Vectors

Once a threat model is defined, the next step is to identify attack vectors, the concrete entry points an attacker may use to exploit the system.

In containerized environments, attack vectors can appear at every stage of the container lifecycle and across multiple components. Common examples include:

Vulnerable application code running inside containers
Insecure container image build configurations
Compromised or untrusted container image supply chains
Insecure image storage and retrieval mechanisms
Weak host machine and kernel security
Exposed or over-privileged credentials and tokens
Flat or poorly segmented container networking
Container escape vulnerabilities

General Security Principles

Regardless of the specific threat model or deployment architecture, certain security principles consistently reduce risk in containerized environments.

These principles do not replace a threat model. Instead, they guide how mitigations are selected and applied, and they will be revisited throughout the rest of this series:

Regular audits and timely updates
Applying the principle of least privilege
Network segmentation and isolation
Runtime visibility and enforcement
Continuous image scanning
Defense in depth rather than single controls
Reducing the exposed attack surface
Limiting the blast radius of a compromise
Clear segregation of duties between roles and systems

This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Ultimate Container Security Series

0xAlphaSecurity — Sun, 04 Jan 2026 17:30:57 +0000

Welcome to the Ultimate Container Security Series.

Over the past few years, I’ve been working extensively with containers and their security aspects. I’ve read many great books, blogs, and tutorials, and I’ve also run containerized workloads in production environments. During this time, I felt the need for a well-organized, practical series that covers the most important container security topics in one place.

This series is my attempt to bring together the key concepts, real-world scenarios, and practical recipes needed to understand and apply container security effectively. The goal is to help readers learn the topics faster, with examples that can be easily applied in real production environments. Whenever possible, I’ll include working examples to make the concepts easier to understand.

Series Structure

The series will be divided into five main parts, and each part will consist of multiple chapters. Each chapter will be published as a separate blog post.

You can use this post as the main reference to see:

what has already been written,
when a chapter was last updated,
and which topics are coming next.

I recommend bookmarking this page, as it will also be used for future announcements and updates related to the series.

Main Outline

PART 1: Foundations

Chapter 1: Container Security Threat Model (updated: 4.1.2026)
Chapter 2: Linux System Calls (updated: 7.1.2026)
Chapter 3: Linux File Permissions (updated: 5.2.2026)
Chapter 4: Linux Capabilities (updated: 6.3.2026)
Chapter 5: Linux Control Groups (cgroups) (updated: 22.3.2026)
Chapter 6: Linux Namespaces (writing in progress)
Chapter 7: Understanding Container Isolation (writing in progress)
Chapter 8: Container Related Vulnerabilities and Attacks

PART 2: Secure Container Image Building

PART 3: Registries & Supply Chain Security

PART 4: Host & Container Platform Security

PART 5: Container Runtime Security

Goals of the Series

The goal of this series is to provide an up-to-date overview of the most important container security topics, supported by real examples and best-practice solutions.

Container technologies evolve very quickly, so this series is not static. Chapters may be:

updated,
expanded,
reorganized,
or extended with new topics over time.

The dates listed next to each topic in this post will serve as a reference point to indicate when a resource was last updated.

Release Plan

Writing a full course takes time. My goal is to publish most of the planned topics by July 2026.

I plan to update the series weekly, and in some cases even daily, depending on the topic and complexity.

The full course content will also be available on GitHub, including examples and supporting materials.

If there is a specific container security topic you are interested in, feel free to leave it in the comments. I’ll do my best to cover it as part of this series.

Forem: 0xAlphaSecurity

Chapter 5: Linux Control Groups (cgroups)

Introduction to cgroups

Exploring Cgroups

Key Controllers Files

Creating cgroups

Setting Resource Limits

Testing and Managing Cgroup Processes

Stressing the CPU

Filling Up the Memory (Triggering an OOM Kill)

Monitoring with systemd-cgtop and systemd-cgls

Killing All Processes in a Cgroup

Moving an Already-Running Process (cgclassify)

Viewing Configuration with cgget

Cleaning Up

Containers and Cgroups

Chapter 4: Linux Capabilities

Introduction: Understanding capabilities

The "All or Nothing" Problem

What are Capabilities?

The Capability Sets

Process vs. File Capabilities

The 5 Capability Sets

How Capabilities are Calculated

Inspecting & Manipulating Capabilities

Common Linux Capabilities

Inspecting Capabilities

Inspecting File Capabilities with getcap

Inspecting Process Capabilities with getpcaps

Inspecting the Current Shell with capsh

Reading Raw Bitmasks from /proc

Assigning Capabilities with setcap

Removing Capabilities with setcap -r

A Practical Example: Assigning CAP_NET_BIND_SERVICE to a Custom Binary

Capabilities Security Implications

Chapter 3: Linux File Permissions

Understanding Linux file permissions

Changing File Permissions

Symbolic Mode (Using Letters)

Numeric Mode (Using Octal Numbers)

Changing File Ownership

Understanding the setuid bit

Chapter 2: Linux System Calls

What are Linux System Calls?

How do System Calls Work?

Small Example in C

Containers and System Calls

Conclusion

Chapter 1: Container Security Threat Model

What Is a Threat Model?

Why Threat Models Differ

Common Ways to Structure a Container Threat Model

Component-Based (Data-Centric) Threat Model

Attacker-Centric Threat Model

MITRE ATT&CK for Containers

Lifecycle-Based Threat Model

Identifying Attack Vectors

General Security Principles

Ultimate Container Security Series

Series Structure

Main Outline

PART 1: Foundations

PART 2: Secure Container Image Building

PART 3: Registries & Supply Chain Security

PART 4: Host & Container Platform Security

PART 5: Container Runtime Security

Goals of the Series

Release Plan

Monitoring with `systemd-cgtop` and `systemd-cgls`

Moving an Already-Running Process (`cgclassify`)

Viewing Configuration with `cgget`

Inspecting File Capabilities with `getcap`

Inspecting Process Capabilities with `getpcaps`

Inspecting the Current Shell with `capsh`

Reading Raw Bitmasks from `/proc`

Assigning Capabilities with `setcap`

Removing Capabilities with `setcap -r`

A Practical Example: Assigning `CAP_NET_BIND_SERVICE` to a Custom Binary