Forem: Vincent Du

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Vincent Du — Sat, 17 Jan 2026 18:18:33 +0000

This guide covers running the MLPerf Training v5.1 Llama 2 70B LoRA fine-tuning benchmark on a multi-node AMD Instinct MI325X cluster without a SLURM scheduler.

Overview

AMD provides an official MLPerf Training Docker image (rocm/amd-mlperf:llama2_70b_training_5.1) designed primarily for SLURM-managed clusters. However, many environments use simpler SSH-based orchestration. This post demonstrates how to run multi-node distributed training using PyTorch's rendezvous mechanism.

Hardware Setup

Cluster: 4× MI325X nodes
GPUs: 8× AMD Instinct MI325X per node (32 total)
Network: High-speed interconnect for RCCL communication
Storage: Shared NFS mount at /mnt/shared

Prerequisites

Software Dependencies

Component	Version	Notes
ROCm	6.2+	Host driver and runtime
Docker	24.0+	With GPU support configured
RCCL	Included in container	ROCm Collective Communications Library
PyTorch	2.4+ (ROCm)	Included in container

Host Setup

ROCm Installation: Follow ROCm installation guide for your Linux distribution.
Docker GPU Access: Ensure Docker can access AMD GPUs:

   docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi

Multi-Node Networking:
- Passwordless SSH between all nodes
- High-speed network interface (InfiniBand/RoCE recommended)
- Shared filesystem accessible from all nodes
Pull the MLPerf Container on all nodes:

   docker pull rocm/amd-mlperf:llama2_70b_training_5.1

Data Preparation

The benchmark requires ~270GB for the Llama 2 70B model and GovReport dataset. A HuggingFace token with Llama 2 license acceptance is required:

export HF_TOKEN=your_token_here
./finetune_llama.sh prepare

Two Approaches for Multi-Node Training

AMD's container supports two launch methods:

1. SLURM-Based (AMD Default)

# Requires SLURM scheduler
sbatch run_with_docker_slurm.sh

2. Manual Multi-Node with Rendezvous

For non-SLURM environments, PyTorch's torchrun supports a rendezvous backend that handles rank assignment automatically:

torchrun \
  --nnodes=4 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=MASTER_IP:29500 \
  --rdzv_id=mlperf_run \
  train.py

This command runs identically on all nodes - the c10d backend coordinates rank assignment.

Implementation

Our approach uses SSH to launch training on each node, passing the distributed configuration via environment variables:

Container Launch Pattern

Each node runs:

# Start container with data mounts
docker run --rm --init --detach \
  --net=host --ipc=host \
  --device /dev/dri --device /dev/kfd \
  --name mlperf_llama2sft \
  -v $DATADIR/data:/data \
  -v $DATADIR/model:/ckpt \
  -v $RESULTS:/logs \
  -v $CODE_DIR:/workspace/code \
  rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity

# Execute training with distributed config
docker exec \
  -e MASTER_ADDR=$MASTER_IP \
  -e MASTER_PORT=29500 \
  -e SLURM_NNODES=$NUM_NODES \
  -e SLURM_NODEID=$NODE_RANK \
  -e NCCL_SOCKET_IFNAME=$NET_IF \
  mlperf_llama2sft \
  bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh'

Orchestration Script

The main script SSHs to each node in parallel:

for node_idx in 0 1 2 3; do
  ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" &
done
wait

Key Configuration

The config_MI325X_4x8x1.sh sets critical parameters:

export DGXNNODES=4
export DGXNGPU=8
export FP8=True
export LR=0.0004
export MBS=1  # micro batch size
export MAX_STEPS=1024

Results

Single Node (8 GPUs)

Metric	Value
Throughput	2.79 samples/sec
Time to Converge	20.57 minutes
Final Loss	0.921 (target: ≤0.925)

Four Nodes (32 GPUs)

Metric	Value
Throughput	11.15 samples/sec
Time to Converge	12.40 minutes
Final Loss	0.924 (target: ≤0.925)

Scaling Analysis

Metric	1-Node	4-Node	Scaling Factor
GPUs	8	32	4×
Batch Size	8	32	4×
Throughput	2.79	11.15	4.0×

Near-linear throughput scaling validates that the network interconnect is not a bottleneck.

Comparison with Official Results

Our single-node result (20.57 min) matches AMD's official MLPerf v5.1 submission (~21 min) for MI325X, confirming correct configuration.

Key Takeaways

Container Design: AMD's container expects training scripts at /workspace/code - mount custom configs there rather than extracting files.
Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance.
SLURM Variables: The container's run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID - these can be set manually for non-SLURM environments.
Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics.

Resources

Full Script

The complete finetune_llama.sh script supports:

Single and multi-node runs
Configurable NEXP for MLPerf-compliant submissions
Automatic config selection based on node count

./finetune_llama.sh run 4      # 4-node, single run
./finetune_llama.sh run 4 10   # 4-node, 10 runs (MLPerf submission)

Interested in the full script? Reach out via LinkedIn and I'll be happy to share.

Kubernetes Persistence Series Part 3: Controllers & Resilience — Why Kubernetes Self-Heals

Vincent Du — Sun, 11 Jan 2026 00:42:32 +0000

What You'll Learn

How application controllers (NGINX Ingress, cert-manager) persist through evictions
Why controllers are stateless and can restart anywhere
The complete persistence chain from hardware to application
What survives pod evictions vs. what doesn't

Previously

In Part 1, we debugged a missing ingress after GKE node upgrades. In Part 2, we explored how systemd supervises kubelet, and how kubelet bootstraps the control plane through static pods.

Now we reach the final layer: your application controllers—and the elegant insight that makes Kubernetes truly resilient.

Layer 4: Application Controllers

How Application Controllers Persist

Controllers like NGINX Ingress, cert-manager, and Prometheus Operator are deployed as Deployments or StatefulSets:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
  template:
    spec:
      containers:
      - name: controller
        image: registry.k8s.io/ingress-nginx/controller:v1.9.0

When this pod is evicted:

kubelet stops reporting the pod → control plane marks it terminated
ReplicaSet controller notices: current replicas (0) < desired (1)
ReplicaSet creates a new pod specification
Scheduler assigns the pod to a healthy node
kubelet on that node starts the container
NGINX controller reconnects to API server and resumes watching ingresses

The controller itself doesn't store state—it reads everything from the API server (backed by etcd).

Helm Release Persistence

Helm stores release information in Kubernetes secrets:

kubectl get secret -n monitoring -l owner=helm -o yaml

apiVersion: v1
kind: Secret
metadata:
  name: sh.helm.release.v1.prometheus.v3
  labels:
    owner: helm
    name: prometheus
    version: "3"
type: helm.sh/release.v1
data:
  release: H4sIAAAAAAAAA... # Base64 encoded release manifest

This secret contains:

The chart that was installed
The values that were used
The computed manifest of all resources

Because this is stored in etcd via the API server, Helm releases survive any pod eviction.

The Complete Persistence Chain

┌─────────────────────────────────────────────────────────────────────┐
│                     Linux Host (Physical/VM)                        │
├─────────────────────────────────────────────────────────────────────┤
│  systemd (PID 1)                                                    │
│  ├── Supervises all system services                                 │
│  ├── Restarts failed services automatically                         │
│  └── Config: /etc/systemd/system/                                   │
│      │                                                              │
│      └── kubelet.service                                            │
│          ├── Started and supervised by systemd                      │
│          ├── Watches /etc/kubernetes/manifests/ for static pods     │
│          ├── Watches API server for scheduled pods                  │
│          └── Ensures containers match pod specs                     │
│              │                                                      │
│              ├── Static Pods (/etc/kubernetes/manifests/)           │
│              │   ├── etcd ──────────────────┐                       │
│              │   ├── kube-apiserver ◄───────┤ Persistent            │
│              │   ├── kube-controller-manager│ State Store           │
│              │   └── kube-scheduler         │                       │
│              │                              │                       │
│              └── Regular Pods ◄─────────────┘                       │
│                  │                 (scheduled via API server)       │
│                  │                                                  │
│                  ├── kube-system namespace                          │
│                  │   ├── CoreDNS                                    │
│                  │   ├── kube-proxy                                 │
│                  │   └── CNI plugins                                │
│                  │                                                  │
│                  ├── ingress-nginx namespace                        │
│                  │   └── NGINX Ingress Controller                   │
│                  │       └── Watches Ingress resources              │
│                  │                                                  │
│                  └── Application namespaces                         │
│                      ├── cert-manager                               │
│                      ├── Prometheus Operator                        │
│                      └── Your applications                          │
└─────────────────────────────────────────────────────────────────────┘

The Critical Insight: Controllers Are Stateless

This is the elegant core of the design: controllers don't store state.

Every controller:

Reads desired state from the API server (backed by etcd)
Watches for changes via the API server
Makes changes through the API server
Can be restarted anywhere, anytime, without losing information

The API server + etcd is the single source of truth, not the controllers.

This is why you can:

Delete any controller pod → it restarts and catches up
Move controllers between nodes → they just reconnect
Scale controllers to multiple replicas → they coordinate via the API server
Upgrade controllers → new version reads the same state

What Survives vs. What Doesn't

Survives Any Pod Eviction

Resource	Why It Survives
Kubernetes objects in etcd	Stored independently of pods
Helm releases	Stored as secrets in etcd
Operator-managed CRDs	Reconciled by operator continuously
PersistentVolumes	Storage exists outside the cluster
ConfigMaps/Secrets	Stored in etcd

Doesn't Survive Without Help

Resource	Why It Doesn't Survive
Pod-local EmptyDir volumes	Deleted with the pod
Manually applied resources with missing dependencies	Validation webhooks reject on recreation
In-memory caches	Process restarts lose memory
Node-local state	Unless explicitly persisted

The Elegance of the Design

The Kubernetes architecture embodies several design principles:

Declarative over imperative — Describe desired state, not steps to get there
Reconciliation over transactions — Continuously converge to desired state
Stateless controllers — State lives in etcd, not in components
Hierarchical supervision — Every layer watches the layer above
Failure is normal — Design for recovery, not prevention

This is why Kubernetes clusters can:

Lose nodes unexpectedly
Have pods evicted for resource pressure
Experience network partitions
Undergo rolling upgrades

...and still maintain application availability.

Conclusion

The journey from debugging a missing ingress to understanding the complete supervision hierarchy revealed the sophisticated machinery that makes Kubernetes resilient.

systemd → kubelet → static pods → control plane → controllers → your apps

Each layer supervises the next, with etcd as the persistent memory that survives any component failure.

The key insight: Kubernetes doesn't prevent failures—it recovers from them automatically through layers of supervision, persistent state in etcd, and continuous reconciliation loops.

This is the true power of Kubernetes: not that things don't fail, but that when they do, the system knows how to restore itself to the desired state.

Series Recap

Part 1: When Our Ingress Vanished — The incident that started it all
Part 2: The Foundation — systemd → kubelet → control plane
Part 3: Controllers & Resilience — Why Kubernetes self-heals

Kubernetes Persistence Series Part 2: The Foundation — From systemd to Control Plane

Vincent Du — Sun, 11 Jan 2026 00:37:43 +0000

What You'll Learn

How Linux systemd supervises the kubelet process
The role of static pods in bootstrapping the control plane
How the controller manager implements reconciliation loops
The complete 4-layer supervision model

Previously

In Part 1, we investigated why a Grafana ingress disappeared after GKE node upgrades. The fix was straightforward: use Helm-managed resources instead of manual kubectl apply.

But that raised a deeper question: How do controllers themselves survive pod evictions?

The answer is a hierarchical supervision model—each layer watches the layer above it, ensuring continuous operation despite failures.

The Four Layers of Kubernetes Supervision

In this post, we'll explore Layers 1-3. Part 3 covers Layer 4 and the complete resilience model.

Layer 1: The Linux Foundation

systemd — The Root Supervisor

At the very bottom of the stack is systemd, the init system running as PID 1 on most modern Linux distributions.

# On a Kubernetes node
ps aux | head -5
# USER  PID  COMMAND
# root    1  /sbin/init (systemd)
# root  ...  /usr/bin/kubelet

systemd's job is simple but critical:

Start services in the correct order at boot
Monitor services and restart them if they crash
Provide dependency management between services

The kubelet runs as a systemd service:

# /etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/kubelet \
    --config=/var/lib/kubelet/config.yaml \
    --kubeconfig=/etc/kubernetes/kubelet.conf \
    --container-runtime-endpoint=unix:///run/containerd/containerd.sock
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

The key line: Restart=always

If kubelet crashes, systemd restarts it within 10 seconds. This is the foundation of Kubernetes resilience—the node agent is supervised by the operating system itself.

# View kubelet status
systemctl status kubelet

# Watch kubelet restart after killing it (don't do this in production!)
sudo kill $(pgrep kubelet)
# systemd will restart it automatically

Layer 2: The Node Agent

kubelet — The Pod Supervisor

kubelet is the Kubernetes agent running on every node. It has two critical responsibilities:

1. Running Static Pods

kubelet watches a directory (typically /etc/kubernetes/manifests/) for pod manifests and runs them directly—no API server required.

ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml

This is how the control plane bootstraps itself. The API server can't schedule pods before it exists, so kubelet runs these components directly from files.

# /etc/kubernetes/manifests/kube-apiserver.yaml (simplified)
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  hostNetwork: true
  containers:
  - name: kube-apiserver
    image: registry.k8s.io/kube-apiserver:v1.28.0
    command:
    - kube-apiserver
    - --etcd-servers=https://127.0.0.1:2379
    - --service-cluster-ip-range=10.96.0.0/12
    # ... more flags

2. Running API-Scheduled Pods

Once the control plane is running, kubelet also:

Watches the API server for pods scheduled to its node
Starts containers via the container runtime (containerd)
Reports pod status back to the API server
Restarts failed containers based on restartPolicy

Layer 3: The Control Plane

Static Pods — The Bootstrap Layer

The control plane runs as static pods managed directly by kubelet:

Component	Role
etcd	Distributed key-value store; holds all cluster state
kube-apiserver	REST API frontend; all components communicate through it
kube-controller-manager	Runs built-in controllers (Deployment, ReplicaSet, etc.)
kube-scheduler	Assigns pods to nodes

These components form a supervision loop:

kubelet ensures static pods are running
Control plane components use etcd for persistence
If a component crashes, kubelet restarts it
State is never lost because it's in etcd

kube-controller-manager — The Reconciliation Engine

The controller manager runs dozens of controllers, each implementing the reconciliation pattern:

// Simplified reconciliation loop
func (c *DeploymentController) Run() {
    for {
        // 1. Get desired state from API server (backed by etcd)
        deployment := c.client.GetDeployment(name)
        desiredReplicas := deployment.Spec.Replicas

        // 2. Get current state
        replicaSets := c.client.ListReplicaSets(deployment.Selector)
        currentReplicas := countReadyReplicas(replicaSets)

        // 3. Reconcile
        if currentReplicas < desiredReplicas {
            c.scaleUp(deployment)
        } else if currentReplicas > desiredReplicas {
            c.scaleDown(deployment)
        }

        // 4. Repeat
        time.Sleep(reconciliationInterval)
    }
}

Key controllers and what they manage:

Controller	Watches	Ensures
Deployment	Deployments	Correct ReplicaSets exist
ReplicaSet	ReplicaSets	Correct number of pods exist
StatefulSet	StatefulSets	Pods with stable identities
DaemonSet	DaemonSets	One pod per matching node
Job	Jobs	Pods run to completion
Service	Services + Pods	Endpoints are updated

The Foundation is Set

We've now covered the first three layers:

systemd supervises kubelet (Restart=always)
kubelet runs static pods from /etc/kubernetes/manifests/
Control plane components persist state in etcd and reconcile continuously

But what about your controllers—NGINX Ingress, cert-manager, Prometheus Operator? How do they survive pod evictions?

In Part 3, we'll explore:

How application controllers persist through evictions
The complete persistence chain from hardware to application
Why controllers are stateless (and why that matters)
What survives pod evictions vs. what doesn't

Next in this series: Part 3: Controllers & Resilience — Why Kubernetes Self-Heals

Kubernetes Persistence Series Part 1: When Our Ingress Vanished After a Node Upgrade

Vincent Du — Sun, 11 Jan 2026 00:32:00 +0000

What You'll Learn

Why manually-applied Kubernetes resources can disappear after pod evictions
How NGINX Ingress admission webhooks validate resources
The difference between controller-managed and manually-applied resources
Why Helm-managed resources survive node disruptions

The Problem That Started This Journey

It was a regular Monday morning until the alerts fired: Grafana was unreachable.

When GKE performed automatic node upgrades, our monitoring dashboard disappeared. The investigation that followed revealed a fascinating chain of dependencies—and ultimately led to understanding the elegant hierarchical supervision model that keeps Kubernetes running.

But first, let's solve the immediate problem.

The Incident: Why Ingress Disappeared

What Happened

The sequence of events:

GKE automatically upgraded nodes (routine security patches)
Nodes were drained, causing pod evictions
NGINX Ingress Controller pod was evicted and restarted on a new node
Grafana ingress resource disappeared
Service became inaccessible

The puzzling part: why would an Ingress resource disappear when only pods were evicted? Ingress is a Kubernetes object stored in etcd—it shouldn't just vanish.

The Investigation

# Check if the ingress exists
kubectl get ingress -n monitoring
# No resources found

# Check the NGINX controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i error

The logs revealed admission webhook failures during the controller restart.

Root Cause Discovery

The ingress disappeared because of a perfect storm of issues:

The chain of failures:

TLS Secret was missing — It was manually copied to the cluster months ago, not managed by any controller. When the namespace was recreated during troubleshooting, the secret didn't come back.
NGINX Admission Webhook — The NGINX Ingress Controller includes a validating webhook that checks ingress resources on creation and updates.
Validation Failed — Without the TLS secret referenced in the ingress spec, the webhook rejected the ingress as invalid.
No Reconciliation — The ingress was created via kubectl apply (not Helm or an operator), so nothing knew to recreate it.

The "Aha" Moment

The real issue wasn't the node upgrade—it was our resource management approach:

# Our original ingress (manually applied)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  # No owner reference
  # No Helm labels
  # No operator management
spec:
  tls:
  - hosts:
    - grafana.prod.example.com
    secretName: grafana-tls  # This secret was also manually created!
  rules:
  - host: grafana.prod.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 80

When this ingress needed to be recreated, nothing knew it should exist.

The Solution: Helm-Managed Resources

We solved this by migrating to Helm charts with native ingress support:

# Before: manually applied resources scattered across yaml files
kubectl apply -f grafana-ingress.yaml
kubectl apply -f grafana-tls-secret.yaml

# After: Helm manages everything as a single release
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --set grafana.ingress.enabled=true \
    --set grafana.ingress.hosts[0]=grafana.prod.example.com \
    --set grafana.ingress.tls[0].secretName=grafana-tls \
    --set grafana.ingress.tls[0].hosts[0]=grafana.prod.example.com

Why This Works

Helm stores release state in Kubernetes secrets:

kubectl get secrets -n monitoring -l owner=helm
# NAME                                    TYPE                 DATA
# sh.helm.release.v1.monitoring.v1       helm.sh/release.v1   1

This means:

✅ Helm knows what resources should exist
✅ helm upgrade recreates missing resources
✅ Resources are versioned and can be rolled back
✅ Dependencies (like TLS secrets) are managed together

For the TLS Secret

We also moved TLS management to cert-manager:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: grafana-tls
  namespace: monitoring
spec:
  secretName: grafana-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - grafana.prod.example.com

Now cert-manager (an operator) ensures the TLS secret always exists and stays renewed.

Key Takeaways

What Survives Pod Evictions

Resource Type	Survives?	Why
Helm-managed resources	✅	State stored in release secrets
Operator-managed CRs	✅	Operator reconciles continuously
Resources with owner references	✅	Parent controller recreates them
Manually `kubectl apply`'d resources	⚠️	Survives in etcd, but won't be recreated if deleted
Resources referencing missing dependencies	❌	Validation webhooks may reject them

Best Practices

Never manually apply production resources — Use Helm, Kustomize, or GitOps tools
Manage secrets with operators — External Secrets, cert-manager, Sealed Secrets
Understand admission webhooks — They validate resources on every create/update
Test node disruptions — Use kubectl drain in staging regularly

The Deeper Question

This incident was resolved, but it raised a fundamental question:

How do controllers like Helm, NGINX Ingress, and cert-manager survive pod evictions themselves? What ensures THEY come back?

The answer involves a beautiful hierarchical supervision model that goes all the way down to Linux PID 1.

In Part 2, we'll explore the complete Kubernetes persistence chain—from Linux systemd to application controllers—and understand why Kubernetes is designed to assume failure is normal.

Have you experienced similar "ghost" resources disappearing in Kubernetes? Share your war stories in the comments!

Next in this series: Part 2: The Foundation — From systemd to Control Plane

Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS

Vincent Du — Wed, 07 Jan 2026 21:19:00 +0000

Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS

In Part 1, I built uring-sync—a file copier that's 4.2x faster than cp for local copies using io_uring. Now I've added network transfer with kernel TLS encryption, achieving 58% faster transfers than rsync.

The Problem: SSH is the Bottleneck

When transferring ML datasets between machines, rsync over SSH is the go-to tool:

rsync -az /data/ml_dataset user@server:/backup/

It works, but it's slow. For a 9.7GB dataset (100K files), rsync took 390 seconds—a throughput of just 25 MB/s.

The bottleneck isn't the network. It's encryption in userspace.

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  File   │────▶│ rsync   │────▶│  SSH    │────▶│ Network │
│  Read   │     │ (delta) │     │ encrypt │     │  Send   │
└─────────┘     └─────────┘     └─────────┘     └─────────┘
                                     │
                              Context switches,
                              userspace copies,
                              CPU-bound AES

Every byte passes through the SSH process, which encrypts it using OpenSSL in userspace. This involves:

Multiple context switches between kernel and userspace
Copying data between kernel buffers and userspace buffers
CPU time for AES encryption (even with AES-NI)

The Solution: kTLS (Kernel TLS)

Linux 4.13+ supports kTLS—TLS encryption handled directly in the kernel. Once you set up the TLS session, the kernel encrypts data as it flows through the socket.

┌─────────┐     ┌─────────┐     ┌──────────────────┐
│  File   │────▶│  read   │────▶│ Socket (kTLS)    │
│         │     │         │     │ encrypt + send   │
└─────────┘     └─────────┘     └──────────────────┘
                                        │
                                 One kernel operation,
                                 no userspace copies,
                                 AES-NI in kernel

Benefits:

No userspace encryption process - kernel handles it directly
Fewer copies - data doesn't bounce through userspace
AES-NI in kernel - hardware acceleration without context switches

Implementation

Setting up kTLS requires:

TLS handshake - Exchange keys (we use a pre-shared secret + HKDF)
Configure kernel - setsockopt(SOL_TLS, TLS_TX, ...) with cipher keys
Send data - Regular send() calls, kernel encrypts automatically

// After deriving keys from shared secret...
struct tls12_crypto_info_aes_gcm_128 crypto_info = {
    .info.version = TLS_1_2_VERSION,
    .info.cipher_type = TLS_CIPHER_AES_GCM_128,
};
memcpy(crypto_info.key, key, 16);
memcpy(crypto_info.iv, iv, 8);
memcpy(crypto_info.salt, salt, 4);

setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
// Now all send() calls are automatically encrypted!

Benchmark Results

Testing on real network: Laptop → GCP VM (public internet)

The Headline Number

Dataset	uring-sync + kTLS	rsync (SSH)	Improvement
ml_small (60MB, 10K files)	2.98s	2.63s	~equal
ml_large (589MB, 100K files)	16.4s	24.8s	34% faster
ml_images (9.7GB, 100K files)	165s	390s	58% faster

The Pattern

Data size:    60MB  →  589MB  →   9.7GB
Improvement:   0%   →   34%   →    58%

The larger the transfer, the bigger the kTLS advantage.

Why? Per-connection overhead (handshake, key derivation) is amortized over more data. And SSH's userspace encryption overhead grows linearly with data size.

Throughput Comparison

Method	Throughput	CPU Usage
rsync (SSH)	25 MB/s	High (userspace encryption)
uring-sync + kTLS	60 MB/s	Low (kernel encryption)

kTLS achieves 2.4x the throughput of rsync while using less CPU.

Why Not Zero-Copy Splice?

In theory, kTLS supports splice() for true zero-copy transfers:

File → Pipe → kTLS Socket (no userspace copies!)

I implemented this and expected it to be fastest. Instead, it was 2.9x slower.

The Investigation

Using strace, I found the problem:

splice(file→pipe):   27μs    ← instant
splice(pipe→socket): 33ms    ← 1000x slower!

The splice(pipe → kTLS socket) call blocks waiting for TCP ACKs. The kernel can't buffer aggressively like it does with regular send() calls.

The Lesson

Zero-copy isn't always faster. For many-file workloads:

read/send: Kernel manages buffering efficiently
splice: Blocks on each chunk, killing throughput

Splice might help for single huge files, but for ML datasets (many small files), stick with read/send.

When to Use This

Use kTLS file transfer when:

Transferring large datasets (>500MB)
Network has bandwidth to spare
You control both endpoints
Security is required (not just over VPN)

Stick with rsync when:

You need delta sync (only changed bytes)
Destination already has partial data
SSH infrastructure is already in place
Simplicity matters more than speed

The Protocol

Our wire protocol is minimal:

HELLO (secret hash) ──────────────────▶ Verify
                    ◀────────────────── HELLO_OK (+ enable kTLS)

FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ────────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file

(repeat for all files)

ALL_DONE ──────────────────────────────▶ Complete

No delta encoding, no checksums (kTLS provides integrity via GCM). Just raw file transfer with authentication and encryption.

Code

Usage:

# Receiver (on remote host)
uring-sync recv /backup --listen 9999 --secret mykey --tls

# Sender (on local host)
uring-sync send /data remote-host:9999 --secret mykey --tls

The implementation uses:

HKDF for key derivation from shared secret
AES-128-GCM via kTLS
Simple TCP protocol (no HTTP, no gRPC)

Full source: github.com/VincentDu2021/uring_sync

Conclusion

By moving encryption from userspace SSH to kernel kTLS, we achieved:

58% faster than rsync for large transfers
2.4x throughput (60 MB/s vs 25 MB/s)
Lower CPU usage (kernel AES-NI vs userspace OpenSSL)

The key insight: for bulk data transfer, SSH's flexibility is overhead. A purpose-built tool with kernel encryption wins.

Appendix: Full Benchmark Data

Test Environment

Sender: Ubuntu laptop, local NVMe
Receiver: GCP VM (us-central1-a)
Network: Public internet
All tests with cold cache (echo 3 > /proc/sys/vm/drop_caches)

Raw Results

Dataset	Files	Size	kTLS Time	kTLS Speed	rsync Time	rsync Speed
ml_small	10K	60MB	2.98s	20 MB/s	2.63s	23 MB/s
ml_large	100K	589MB	16.4s	36 MB/s	24.8s	24 MB/s
ml_images	100K	9.7GB	165s	60 MB/s	390s	25 MB/s

Splice Investigation (ml_images)

Mode	Time	Speed	Notes
Plaintext + read/send	146s	68 MB/s	Fastest (no encryption)
Plaintext + splice	157s	63 MB/s	+8% overhead
kTLS + read/send	165s	60 MB/s	+13% (encryption cost)
kTLS + splice	428s	23 MB/s	2.9x slower (broken)

Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.

Tags: #linux #ktls #tls #rsync #performance #networking #encryption

Building a File Copier 4x Faster Than cp Using io_uring

Vincent Du — Wed, 07 Jan 2026 17:45:26 +0000

Building a File Copier That's 4x Faster Than `cp` Using io_uring

I built a high-performance file copier for ML datasets using Linux io_uring. On the right workload, it's 4.2x faster than cp -r. Here's what I learned about when async I/O helps—and when it doesn't.

The Problem: Millions of Small Files

ML training datasets often contain millions of small files:

Dataset	Files	Typical Size
ImageNet	1.28M	100-200KB JPEG
COCO	330K	50-500KB
MNIST	70K	784 bytes
CIFAR-10	60K	3KB

Copying these with cp -r is painfully slow. Each file requires multiple syscalls (open, read, write, close), and the kernel processes them one at a time. For 100,000 files, that's 400,000+ syscalls executed sequentially.

The Solution: io_uring

io_uring is a Linux async I/O interface (kernel 5.1+) that enables:

Batched submission - Queue dozens of operations, submit with one syscall
Async completion - Operations complete out of order
Zero-copy - Splice data directly between file descriptors via kernel pipes

Instead of: open → read → write → close → repeat

We do: submit 64 opens → process completions → submit reads/writes → batch everything

Architecture

┌──────────────┐     ┌─────────────────┐     ┌─────────────────────┐
│ Main Thread  │────▶│  WorkQueue<T>   │────▶│  Worker Threads     │
│ (scanner)    │     │  (thread-safe)  │     │  (per-thread uring) │
└──────────────┘     └─────────────────┘     └─────────────────────┘

Each file progresses through a state machine:

OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING

Key design decisions:

64 files in-flight per worker simultaneously
Per-thread io_uring instances (avoids lock contention)
Inode sorting for sequential disk access
Splice zero-copy for data transfer (source → pipe → destination)
Buffer pool with 4KB-aligned allocations (O_DIRECT compatible)

Benchmark Results

Local NVMe (Cold Cache)

Workload	cp -r	uring-sync	Speedup
100K × 4KB files (400MB)	7.67s	5.14s	1.5x
100K × 100KB files (10GB)	22.7s	5.4s	4.2x

Key insight: Larger files benefit MORE from io_uring on fast storage. The 100KB test shows 4.2x improvement because we're overlapping many large reads/writes.

GCP pd-balanced (SSD-backed, 100GB)

Workload	cp -r	uring-sync	Speedup
100K × 4KB files	67.7s	31.5s	2.15x
100K × 100KB files	139.6s	64.7s	2.16x

Consistent 2x improvement on cloud SSD storage.

Why io_uring Helps

On fast storage (NVMe, SSD), the bottleneck is CPU and syscall overhead, not the disk:

cp -r: Processes files sequentially, 12+ syscalls per file
io_uring: 64 files in-flight, batched syscalls, async completion

The bigger the files, the more time we spend waiting for I/O to complete—and the more io_uring's async approach helps. That's why we see 4.2x speedup for 100KB files vs 1.5x for 4KB files on NVMe.

Implementation Details

The State Machine

Each file copy is a state machine with these transitions:

enum class FileState {
    OPENING_SRC,    // Opening source file
    STATING,        // Getting file size
    OPENING_DST,    // Creating destination
    SPLICE_IN,      // Reading into kernel pipe
    SPLICE_OUT,     // Writing from pipe to dest
    CLOSING_SRC,    // Closing source
    CLOSING_DST,    // Closing destination
    DONE
};

Completions drive state transitions. When a completion arrives, we look up the file context and advance its state.

Splice Zero-Copy

Instead of read() → userspace buffer → write(), we use splice():

Source FD → Kernel Pipe → Destination FD

Data never touches userspace. The kernel moves pages directly between file descriptors.

// Splice from source into pipe
io_uring_prep_splice(sqe, src_fd, offset, pipe_write_fd, -1, chunk_size, 0);

// Splice from pipe to destination
io_uring_prep_splice(sqe, pipe_read_fd, -1, dst_fd, offset, chunk_size, 0);

Inode Sorting

Before copying, we sort files by inode number:

std::sort(files.begin(), files.end(),
    [](const auto& a, const auto& b) { return a.inode < b.inode; });

This encourages sequential disk access since inodes are typically allocated sequentially for files created together.

What I Learned

Single worker beats multi-threading for local NVMe. Lock contention outweighs parallelism benefits when the bottleneck is fast I/O.
Queue depth matters more than thread count. 64 files in-flight per worker is the sweet spot.
Profile your actual workload. Synthetic benchmarks lie. Test with your real data.
io_uring shines on fast storage. When the disk can keep up, reducing syscall overhead yields big gains.

What's Next: Network Transfer

This tool now also supports network file transfer with kTLS encryption, achieving 58% faster transfers than rsync. See the companion post: Beating rsync by 58% with Kernel TLS.

Code

The full implementation is ~1,400 lines of C++20. Key components:

Component	Purpose
`RingManager`	io_uring wrapper with SQE/CQE management
`BufferPool`	4KB-aligned buffer allocation
`PipePool`	Reusable kernel pipes for splice
`WorkQueue`	Thread-safe file queue
`FileContext`	Per-file state machine

Build requirements:

Linux kernel 5.1+ (5.19+ for splice)
liburing
C++20

Conclusion

io_uring can dramatically speed up small-file workloads—4.2x faster on NVMe and 2x faster on cloud SSD. The key is reducing syscall overhead through batching and async I/O.

When to use io_uring for file copying:

Many small files (ML datasets, source trees)
Fast storage (NVMe, SSD)
CPU-bound on syscall overhead

When cp -r is fine:

Single large files (already efficient)
One-off copies where complexity isn't worth it

The code is available at github.com/VincentDu2021/uring_sync. Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on local NVMe and GCP Compute Engine VMs.

Forem: Vincent Du

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Overview

Hardware Setup

Prerequisites

Software Dependencies

Host Setup

Data Preparation

Two Approaches for Multi-Node Training

1. SLURM-Based (AMD Default)

2. Manual Multi-Node with Rendezvous

Implementation

Container Launch Pattern

Orchestration Script

Key Configuration

Results

Single Node (8 GPUs)

Four Nodes (32 GPUs)

Scaling Analysis

Comparison with Official Results

Key Takeaways

Resources

Full Script

Kubernetes Persistence Series Part 3: Controllers & Resilience — Why Kubernetes Self-Heals

What You'll Learn

Previously

Layer 4: Application Controllers

How Application Controllers Persist

Helm Release Persistence

The Complete Persistence Chain

The Critical Insight: Controllers Are Stateless

What Survives vs. What Doesn't

Survives Any Pod Eviction

Doesn't Survive Without Help

The Elegance of the Design

Conclusion

Series Recap

Further Reading

Kubernetes Persistence Series Part 2: The Foundation — From systemd to Control Plane

What You'll Learn

Previously

The Four Layers of Kubernetes Supervision

Layer 1: The Linux Foundation

systemd — The Root Supervisor

Layer 2: The Node Agent

kubelet — The Pod Supervisor

1. Running Static Pods

2. Running API-Scheduled Pods

Layer 3: The Control Plane

Static Pods — The Bootstrap Layer

kube-controller-manager — The Reconciliation Engine

The Foundation is Set

Kubernetes Persistence Series Part 1: When Our Ingress Vanished After a Node Upgrade

What You'll Learn

The Problem That Started This Journey

The Incident: Why Ingress Disappeared

What Happened

The Investigation

Root Cause Discovery

The "Aha" Moment

The Solution: Helm-Managed Resources

Why This Works

For the TLS Secret

Key Takeaways

What Survives Pod Evictions

Best Practices

The Deeper Question

Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS

Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS

The Problem: SSH is the Bottleneck

The Solution: kTLS (Kernel TLS)

Implementation

Benchmark Results

The Headline Number

The Pattern

Throughput Comparison

Why Not Zero-Copy Splice?

The Investigation

The Lesson

When to Use This

Building a File Copier That's 4x Faster Than `cp` Using io_uring