<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vincent Du</title>
    <description>The latest articles on Forem by Vincent Du (@vincentdu2021).</description>
    <link>https://forem.com/vincentdu2021</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3698944%2F6b66cc36-5a2b-4017-b9bf-8955e35fc31b.jpeg</url>
      <title>Forem: Vincent Du</title>
      <link>https://forem.com/vincentdu2021</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vincentdu2021"/>
    <language>en</language>
    <item>
      <title>How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Sat, 17 Jan 2026 18:18:33 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/how-to-run-mlperf-llama-2-70b-training-on-amd-mi325x-without-slurm-1ho4</link>
      <guid>https://forem.com/vincentdu2021/how-to-run-mlperf-llama-2-70b-training-on-amd-mi325x-without-slurm-1ho4</guid>
      <description>&lt;p&gt;This guide covers running the MLPerf Training v5.1 Llama 2 70B LoRA fine-tuning benchmark on a multi-node AMD Instinct MI325X cluster without a SLURM scheduler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;AMD provides an official MLPerf Training Docker image (&lt;code&gt;rocm/amd-mlperf:llama2_70b_training_5.1&lt;/code&gt;) designed primarily for SLURM-managed clusters. However, many environments use simpler SSH-based orchestration. This post demonstrates how to run multi-node distributed training using PyTorch's rendezvous mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster:&lt;/strong&gt; 4× MI325X nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPUs:&lt;/strong&gt; 8× AMD Instinct MI325X per node (32 total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network:&lt;/strong&gt; High-speed interconnect for RCCL communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Shared NFS mount at &lt;code&gt;/mnt/shared&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Software Dependencies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ROCm&lt;/td&gt;
&lt;td&gt;6.2+&lt;/td&gt;
&lt;td&gt;Host driver and runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;24.0+&lt;/td&gt;
&lt;td&gt;With GPU support configured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RCCL&lt;/td&gt;
&lt;td&gt;Included in container&lt;/td&gt;
&lt;td&gt;ROCm Collective Communications Library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyTorch&lt;/td&gt;
&lt;td&gt;2.4+ (ROCm)&lt;/td&gt;
&lt;td&gt;Included in container&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Host Setup
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ROCm Installation:&lt;/strong&gt; Follow &lt;a href="https://rocm.docs.amd.com/en/latest/deploy/linux/index.html" rel="noopener noreferrer"&gt;ROCm installation guide&lt;/a&gt; for your Linux distribution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Docker GPU Access:&lt;/strong&gt; Ensure Docker can access AMD GPUs:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/dri &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/kfd rocm/pytorch:latest rocm-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-Node Networking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Passwordless SSH between all nodes&lt;/li&gt;
&lt;li&gt;High-speed network interface (InfiniBand/RoCE recommended)&lt;/li&gt;
&lt;li&gt;Shared filesystem accessible from all nodes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pull the MLPerf Container&lt;/strong&gt; on all nodes:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   docker pull rocm/amd-mlperf:llama2_70b_training_5.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Data Preparation
&lt;/h3&gt;

&lt;p&gt;The benchmark requires ~270GB for the Llama 2 70B model and GovReport dataset. A HuggingFace token with Llama 2 license acceptance is required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_token_here
./finetune_llama.sh prepare
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Two Approaches for Multi-Node Training
&lt;/h2&gt;

&lt;p&gt;AMD's container supports two launch methods:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. SLURM-Based (AMD Default)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Requires SLURM scheduler&lt;/span&gt;
sbatch run_with_docker_slurm.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Manual Multi-Node with Rendezvous
&lt;/h3&gt;

&lt;p&gt;For non-SLURM environments, PyTorch's &lt;code&gt;torchrun&lt;/code&gt; supports a rendezvous backend that handles rank assignment automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;torchrun &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nnodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nproc_per_node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rdzv_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;c10d &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rdzv_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;MASTER_IP:29500 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rdzv_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mlperf_run &lt;span class="se"&gt;\&lt;/span&gt;
  train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command runs identically on all nodes - the c10d backend coordinates rank assignment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Our approach uses SSH to launch training on each node, passing the distributed configuration via environment variables:&lt;/p&gt;

&lt;h3&gt;
  
  
  Container Launch Pattern
&lt;/h3&gt;

&lt;p&gt;Each node runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start container with data mounts&lt;/span&gt;
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--init&lt;/span&gt; &lt;span class="nt"&gt;--detach&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--net&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/dri &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/kfd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; mlperf_llama2sft &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$DATADIR&lt;/span&gt;/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$DATADIR&lt;/span&gt;/model:/ckpt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$RESULTS&lt;/span&gt;:/logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$CODE_DIR&lt;/span&gt;:/workspace/code &lt;span class="se"&gt;\&lt;/span&gt;
  rocm/amd-mlperf:llama2_70b_training_5.1 &lt;span class="nb"&gt;sleep &lt;/span&gt;infinity

&lt;span class="c"&gt;# Execute training with distributed config&lt;/span&gt;
docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;MASTER_ADDR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$MASTER_IP&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;MASTER_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;29500 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SLURM_NNODES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$NUM_NODES&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SLURM_NODEID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$NODE_RANK&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;NCCL_SOCKET_IFNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$NET_IF&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  mlperf_llama2sft &lt;span class="se"&gt;\&lt;/span&gt;
  bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'cd /workspace/code &amp;amp;&amp;amp; source config_MI325X_4x8x1.sh &amp;amp;&amp;amp; bash ./run_and_time_slurm.sh'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Orchestration Script
&lt;/h3&gt;

&lt;p&gt;The main script SSHs to each node in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;node_idx &lt;span class="k"&gt;in &lt;/span&gt;0 1 2 3&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;ssh node-&lt;span class="nv"&gt;$node_idx&lt;/span&gt; &lt;span class="s2"&gt;"launch_training.sh &lt;/span&gt;&lt;span class="nv"&gt;$node_idx&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$NUM_NODES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;amp;
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Configuration
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;config_MI325X_4x8x1.sh&lt;/code&gt; sets critical parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DGXNNODES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DGXNGPU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FP8&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0004
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MBS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1  &lt;span class="c"&gt;# micro batch size&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MAX_STEPS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1024
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Node (8 GPUs)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;2.79 samples/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to Converge&lt;/td&gt;
&lt;td&gt;20.57 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final Loss&lt;/td&gt;
&lt;td&gt;0.921 (target: ≤0.925)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Four Nodes (32 GPUs)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;11.15 samples/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to Converge&lt;/td&gt;
&lt;td&gt;12.40 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final Loss&lt;/td&gt;
&lt;td&gt;0.924 (target: ≤0.925)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Scaling Analysis
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1-Node&lt;/th&gt;
&lt;th&gt;4-Node&lt;/th&gt;
&lt;th&gt;Scaling Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPUs&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch Size&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;2.79&lt;/td&gt;
&lt;td&gt;11.15&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.0×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Near-linear throughput scaling&lt;/strong&gt; validates that the network interconnect is not a bottleneck.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison with Official Results
&lt;/h3&gt;

&lt;p&gt;Our single-node result (20.57 min) matches AMD's official MLPerf v5.1 submission (~21 min) for MI325X, confirming correct configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Container Design:&lt;/strong&gt; AMD's container expects training scripts at &lt;code&gt;/workspace/code&lt;/code&gt; - mount custom configs there rather than extracting files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network Interface:&lt;/strong&gt; Set &lt;code&gt;NCCL_SOCKET_IFNAME&lt;/code&gt; to your high-speed network interface for optimal RCCL performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SLURM Variables:&lt;/strong&gt; The container's &lt;code&gt;run_and_time_slurm.sh&lt;/code&gt; reads &lt;code&gt;SLURM_NNODES&lt;/code&gt; and &lt;code&gt;SLURM_NODEID&lt;/code&gt; - these can be set manually for non-SLURM environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scaling:&lt;/strong&gt; Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://rocm.blogs.amd.com/artificial-intelligence/mlperf-training-v5.1/README.html" rel="noopener noreferrer"&gt;AMD MLPerf Training v5.1 Technical Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rocm.blogs.amd.com/artificial-intelligence/mlperf-training5.1-repro/README.html" rel="noopener noreferrer"&gt;Reproducing AMD MLPerf Training Results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/system-setup/multi-node-setup.html" rel="noopener noreferrer"&gt;ROCm Multi-Node Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rocm.blogs.amd.com/artificial-intelligence/ddp-training-pytorch/README.html" rel="noopener noreferrer"&gt;PyTorch Distributed Training on AMD GPUs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Full Script
&lt;/h2&gt;

&lt;p&gt;The complete &lt;code&gt;finetune_llama.sh&lt;/code&gt; script supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single and multi-node runs&lt;/li&gt;
&lt;li&gt;Configurable NEXP for MLPerf-compliant submissions&lt;/li&gt;
&lt;li&gt;Automatic config selection based on node count
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./finetune_llama.sh run 4      &lt;span class="c"&gt;# 4-node, single run&lt;/span&gt;
./finetune_llama.sh run 4 10   &lt;span class="c"&gt;# 4-node, 10 runs (MLPerf submission)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interested in the full script?&lt;/strong&gt; Reach out via &lt;a href="https://www.linkedin.com/in/vincent-du-3a0b8928/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and I'll be happy to share.&lt;/p&gt;

</description>
      <category>amd</category>
      <category>rocm</category>
      <category>machinelearning</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Kubernetes Persistence Series Part 3: Controllers &amp; Resilience — Why Kubernetes Self-Heals</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Sun, 11 Jan 2026 00:42:32 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-3-controllers-resilience-why-kubernetes-self-heals-392b</link>
      <guid>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-3-controllers-resilience-why-kubernetes-self-heals-392b</guid>
      <description>&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How application controllers (NGINX Ingress, cert-manager) persist through evictions&lt;/li&gt;
&lt;li&gt;Why controllers are stateless and can restart anywhere&lt;/li&gt;
&lt;li&gt;The complete persistence chain from hardware to application&lt;/li&gt;
&lt;li&gt;What survives pod evictions vs. what doesn't&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Previously
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-1-when-our-ingress-vanished-after-a-node-upgrade-17li"&gt;Part 1&lt;/a&gt;, we debugged a missing ingress after GKE node upgrades. In &lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-2-the-foundation-from-systemd-to-control-plane-2464"&gt;Part 2&lt;/a&gt;, we explored how systemd supervises kubelet, and how kubelet bootstraps the control plane through static pods.&lt;/p&gt;

&lt;p&gt;Now we reach the final layer: &lt;strong&gt;your application controllers&lt;/strong&gt;—and the elegant insight that makes Kubernetes truly resilient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4: Application Controllers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How Application Controllers Persist
&lt;/h3&gt;

&lt;p&gt;Controllers like NGINX Ingress, cert-manager, and Prometheus Operator are deployed as &lt;strong&gt;Deployments&lt;/strong&gt; or &lt;strong&gt;StatefulSets&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingress-nginx-controller&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingress-nginx&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingress-nginx&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;controller&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/ingress-nginx/controller:v1.9.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this pod is evicted:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;kubelet stops reporting the pod → control plane marks it terminated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ReplicaSet controller&lt;/strong&gt; notices: current replicas (0) &amp;lt; desired (1)&lt;/li&gt;
&lt;li&gt;ReplicaSet creates a new pod specification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt; assigns the pod to a healthy node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubelet&lt;/strong&gt; on that node starts the container&lt;/li&gt;
&lt;li&gt;NGINX controller reconnects to API server and resumes watching ingresses&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The controller itself doesn't store state—it reads everything from the API server (backed by etcd).&lt;/p&gt;

&lt;h3&gt;
  
  
  Helm Release Persistence
&lt;/h3&gt;

&lt;p&gt;Helm stores release information in Kubernetes secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get secret &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;helm &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sh.helm.release.v1.prometheus.v3&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;helm&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;helm.sh/release.v1&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;H4sIAAAAAAAAA...&lt;/span&gt; &lt;span class="c1"&gt;# Base64 encoded release manifest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This secret contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The chart that was installed&lt;/li&gt;
&lt;li&gt;The values that were used&lt;/li&gt;
&lt;li&gt;The computed manifest of all resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because this is stored in etcd via the API server, Helm releases survive any pod eviction.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Complete Persistence Chain
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│                     Linux Host (Physical/VM)                        │
├─────────────────────────────────────────────────────────────────────┤
│  systemd (PID 1)                                                    │
│  ├── Supervises all system services                                 │
│  ├── Restarts failed services automatically                         │
│  └── Config: /etc/systemd/system/                                   │
│      │                                                              │
│      └── kubelet.service                                            │
│          ├── Started and supervised by systemd                      │
│          ├── Watches /etc/kubernetes/manifests/ for static pods     │
│          ├── Watches API server for scheduled pods                  │
│          └── Ensures containers match pod specs                     │
│              │                                                      │
│              ├── Static Pods (/etc/kubernetes/manifests/)           │
│              │   ├── etcd ──────────────────┐                       │
│              │   ├── kube-apiserver ◄───────┤ Persistent            │
│              │   ├── kube-controller-manager│ State Store           │
│              │   └── kube-scheduler         │                       │
│              │                              │                       │
│              └── Regular Pods ◄─────────────┘                       │
│                  │                 (scheduled via API server)       │
│                  │                                                  │
│                  ├── kube-system namespace                          │
│                  │   ├── CoreDNS                                    │
│                  │   ├── kube-proxy                                 │
│                  │   └── CNI plugins                                │
│                  │                                                  │
│                  ├── ingress-nginx namespace                        │
│                  │   └── NGINX Ingress Controller                   │
│                  │       └── Watches Ingress resources              │
│                  │                                                  │
│                  └── Application namespaces                         │
│                      ├── cert-manager                               │
│                      ├── Prometheus Operator                        │
│                      └── Your applications                          │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Critical Insight: Controllers Are Stateless
&lt;/h2&gt;

&lt;p&gt;This is the elegant core of the design: &lt;strong&gt;controllers don't store state&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every controller:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reads&lt;/strong&gt; desired state from the API server (backed by etcd)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watches&lt;/strong&gt; for changes via the API server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes changes&lt;/strong&gt; through the API server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can be restarted anywhere&lt;/strong&gt;, anytime, without losing information&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The API server + etcd is the &lt;strong&gt;single source of truth&lt;/strong&gt;, not the controllers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBMUlxuICAgIHN1YmdyYXBoIFNvdXJjZVtcIlNvdXJjZSBvZiBUcnV0aFwiXVxuICAgICAgICBldGNkWyhldGNkKV1cbiAgICAgICAgYXBpW0FQSSBTZXJ2ZXJdXG4gICAgZW5kXG5cbiAgICBzdWJncmFwaCBTdGF0ZWxlc3NbXCJTdGF0ZWxlc3MgQ29udHJvbGxlcnNcIl1cbiAgICAgICAgZGVwbG95W0RlcGxveW1lbnQgQ29udHJvbGxlcl1cbiAgICAgICAgcnNbUmVwbGljYVNldCBDb250cm9sbGVyXVxuICAgICAgICBuZ2lueFtOR0lOWCBJbmdyZXNzXVxuICAgICAgICBjZXJ0W2NlcnQtbWFuYWdlcl1cbiAgICBlbmRcblxuICAgIGV0Y2QgLS0tIGFwaVxuICAgIGFwaSAtLS18cmVhZC93cml0ZXwgZGVwbG95XG4gICAgYXBpIC0tLXxyZWFkL3dyaXRlfCByc1xuICAgIGFwaSAtLS18cmVhZC93cml0ZXwgbmdpbnhcbiAgICBhcGkgLS0tfHJlYWQvd3JpdGV8IGNlcnRcblxuICAgIHN0eWxlIGV0Y2QgZmlsbDojZTY3NzAwLGNvbG9yOiNmZmZcbiAgICBzdHlsZSBhcGkgZmlsbDojZTY3NzAwLGNvbG9yOiNmZmZcbiIsCiAgIm1lcm1haWQiOiB7CiAgICAidGhlbWUiOiAiZGVmYXVsdCIKICB9Cn0K" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBMUlxuICAgIHN1YmdyYXBoIFNvdXJjZVtcIlNvdXJjZSBvZiBUcnV0aFwiXVxuICAgICAgICBldGNkWyhldGNkKV1cbiAgICAgICAgYXBpW0FQSSBTZXJ2ZXJdXG4gICAgZW5kXG5cbiAgICBzdWJncmFwaCBTdGF0ZWxlc3NbXCJTdGF0ZWxlc3MgQ29udHJvbGxlcnNcIl1cbiAgICAgICAgZGVwbG95W0RlcGxveW1lbnQgQ29udHJvbGxlcl1cbiAgICAgICAgcnNbUmVwbGljYVNldCBDb250cm9sbGVyXVxuICAgICAgICBuZ2lueFtOR0lOWCBJbmdyZXNzXVxuICAgICAgICBjZXJ0W2NlcnQtbWFuYWdlcl1cbiAgICBlbmRcblxuICAgIGV0Y2QgLS0tIGFwaVxuICAgIGFwaSAtLS18cmVhZC93cml0ZXwgZGVwbG95XG4gICAgYXBpIC0tLXxyZWFkL3dyaXRlfCByc1xuICAgIGFwaSAtLS18cmVhZC93cml0ZXwgbmdpbnhcbiAgICBhcGkgLS0tfHJlYWQvd3JpdGV8IGNlcnRcblxuICAgIHN0eWxlIGV0Y2QgZmlsbDojZTY3NzAwLGNvbG9yOiNmZmZcbiAgICBzdHlsZSBhcGkgZmlsbDojZTY3NzAwLGNvbG9yOiNmZmZcbiIsCiAgIm1lcm1haWQiOiB7CiAgICAidGhlbWUiOiAiZGVmYXVsdCIKICB9Cn0K" alt="Stateless Controllers Architecture" width="688" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delete any controller pod → it restarts and catches up&lt;/li&gt;
&lt;li&gt;Move controllers between nodes → they just reconnect&lt;/li&gt;
&lt;li&gt;Scale controllers to multiple replicas → they coordinate via the API server&lt;/li&gt;
&lt;li&gt;Upgrade controllers → new version reads the same state&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Survives vs. What Doesn't
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Survives Any Pod Eviction
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Why It Survives&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes objects in etcd&lt;/td&gt;
&lt;td&gt;Stored independently of pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm releases&lt;/td&gt;
&lt;td&gt;Stored as secrets in etcd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operator-managed CRDs&lt;/td&gt;
&lt;td&gt;Reconciled by operator continuously&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PersistentVolumes&lt;/td&gt;
&lt;td&gt;Storage exists outside the cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ConfigMaps/Secrets&lt;/td&gt;
&lt;td&gt;Stored in etcd&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Doesn't Survive Without Help
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Why It Doesn't Survive&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pod-local EmptyDir volumes&lt;/td&gt;
&lt;td&gt;Deleted with the pod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manually applied resources with missing dependencies&lt;/td&gt;
&lt;td&gt;Validation webhooks reject on recreation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;In-memory caches&lt;/td&gt;
&lt;td&gt;Process restarts lose memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node-local state&lt;/td&gt;
&lt;td&gt;Unless explicitly persisted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Elegance of the Design
&lt;/h2&gt;

&lt;p&gt;The Kubernetes architecture embodies several design principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declarative over imperative&lt;/strong&gt; — Describe desired state, not steps to get there&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconciliation over transactions&lt;/strong&gt; — Continuously converge to desired state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateless controllers&lt;/strong&gt; — State lives in etcd, not in components&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical supervision&lt;/strong&gt; — Every layer watches the layer above&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure is normal&lt;/strong&gt; — Design for recovery, not prevention&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why Kubernetes clusters can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lose nodes unexpectedly&lt;/li&gt;
&lt;li&gt;Have pods evicted for resource pressure&lt;/li&gt;
&lt;li&gt;Experience network partitions&lt;/li&gt;
&lt;li&gt;Undergo rolling upgrades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...and still maintain application availability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The journey from debugging a missing ingress to understanding the complete supervision hierarchy revealed the sophisticated machinery that makes Kubernetes resilient.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemd → kubelet → static pods → control plane → controllers → your apps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer supervises the next, with etcd as the persistent memory that survives any component failure.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Kubernetes doesn't prevent failures—it recovers from them automatically&lt;/strong&gt; through layers of supervision, persistent state in etcd, and continuous reconciliation loops.&lt;/p&gt;

&lt;p&gt;This is the true power of Kubernetes: not that things don't fail, but that when they do, the system knows how to restore itself to the desired state.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Recap
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-1-when-our-ingress-vanished-after-a-node-upgrade-17li"&gt;Part 1: When Our Ingress Vanished&lt;/a&gt;&lt;/strong&gt; — The incident that started it all&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-2-the-foundation-from-systemd-to-control-plane-2464"&gt;Part 2: The Foundation&lt;/a&gt;&lt;/strong&gt; — systemd → kubelet → control plane&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-3-controllers-resilience-why-kubernetes-self-heals-392b"&gt;Part 3: Controllers &amp;amp; Resilience&lt;/a&gt;&lt;/strong&gt; — Why Kubernetes self-heals&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/overview/components/" rel="noopener noreferrer"&gt;Kubernetes Components&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/" rel="noopener noreferrer"&gt;Static Pods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/" rel="noopener noreferrer"&gt;Controller Manager&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd.service.html" rel="noopener noreferrer"&gt;systemd Service Files&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Found this series useful? Follow for more Kubernetes internals content!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>architecture</category>
      <category>sre</category>
    </item>
    <item>
      <title>Kubernetes Persistence Series Part 2: The Foundation — From systemd to Control Plane</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Sun, 11 Jan 2026 00:37:43 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-2-the-foundation-from-systemd-to-control-plane-2464</link>
      <guid>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-2-the-foundation-from-systemd-to-control-plane-2464</guid>
      <description>&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How Linux systemd supervises the kubelet process&lt;/li&gt;
&lt;li&gt;The role of static pods in bootstrapping the control plane&lt;/li&gt;
&lt;li&gt;How the controller manager implements reconciliation loops&lt;/li&gt;
&lt;li&gt;The complete 4-layer supervision model&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Previously
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-1-when-our-ingress-vanished-after-a-node-upgrade-17li"&gt;Part 1&lt;/a&gt;, we investigated why a Grafana ingress disappeared after GKE node upgrades. The fix was straightforward: use Helm-managed resources instead of manual &lt;code&gt;kubectl apply&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But that raised a deeper question: &lt;strong&gt;How do controllers themselves survive pod evictions?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer is a hierarchical supervision model—each layer watches the layer above it, ensuring continuous operation despite failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Layers of Kubernetes Supervision
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBUQlxuICAgIHN1YmdyYXBoIExheWVyMVtcIkxheWVyIDE6IExpbnV4IEZvdW5kYXRpb25cIl1cbiAgICAgICAgc3lzdGVtZFtcInN5c3RlbWQgKFBJRCAxKVwiXVxuICAgIGVuZFxuXG4gICAgc3ViZ3JhcGggTGF5ZXIyW1wiTGF5ZXIgMjogTm9kZSBBZ2VudFwiXVxuICAgICAgICBrdWJlbGV0W1wia3ViZWxldFwiXVxuICAgIGVuZFxuXG4gICAgc3ViZ3JhcGggTGF5ZXIzW1wiTGF5ZXIgMzogQ29udHJvbCBQbGFuZVwiXVxuICAgICAgICBhcGlbXCJrdWJlLWFwaXNlcnZlclwiXVxuICAgICAgICBjb250cm9sbGVyW1wia3ViZS1jb250cm9sbGVyLW1hbmFnZXJcIl1cbiAgICAgICAgc2NoZWR1bGVyW1wia3ViZS1zY2hlZHVsZXJcIl1cbiAgICAgICAgZXRjZFtcImV0Y2RcIl1cbiAgICBlbmRcblxuICAgIHN1YmdyYXBoIExheWVyNFtcIkxheWVyIDQ6IEFwcCBDb250cm9sbGVyc1wiXVxuICAgICAgICBuZ2lueFtcIk5HSU5YIEluZ3Jlc3NcIl1cbiAgICAgICAgY2VydG1ncltcImNlcnQtbWFuYWdlclwiXVxuICAgICAgICBwcm9tZXRoZXVzW1wiUHJvbWV0aGV1cyBPcGVyYXRvclwiXVxuICAgICAgICBoZWxtW1wiSGVsbSBSZWxlYXNlc1wiXVxuICAgIGVuZFxuXG4gICAgc3lzdGVtZCAtLT58c3VwZXJ2aXNlc3wga3ViZWxldFxuICAgIGt1YmVsZXQgLS0%2BfG1hbmFnZXMgc3RhdGljIHBvZHN8IGFwaVxuICAgIGt1YmVsZXQgLS0%2BfG1hbmFnZXMgc3RhdGljIHBvZHN8IGNvbnRyb2xsZXJcbiAgICBrdWJlbGV0IC0tPnxtYW5hZ2VzIHN0YXRpYyBwb2RzfCBzY2hlZHVsZXJcbiAgICBrdWJlbGV0IC0tPnxtYW5hZ2VzIHN0YXRpYyBwb2RzfCBldGNkXG4gICAgY29udHJvbGxlciAtLT58cmVjb25jaWxlc3wgbmdpbnhcbiAgICBjb250cm9sbGVyIC0tPnxyZWNvbmNpbGVzfCBjZXJ0bWdyXG4gICAgY29udHJvbGxlciAtLT58cmVjb25jaWxlc3wgcHJvbWV0aGV1c1xuXG4gICAgc3R5bGUgTGF5ZXIxIGZpbGw6IzE4NjRhYixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXIyIGZpbGw6IzE5NzFjMixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXIzIGZpbGw6IzIyOGJlNixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXI0IGZpbGw6IzMzOWFmMCxjb2xvcjojZmZmXG4iLAogICJtZXJtYWlkIjogewogICAgInRoZW1lIjogImRlZmF1bHQiCiAgfQp9Cg%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBUQlxuICAgIHN1YmdyYXBoIExheWVyMVtcIkxheWVyIDE6IExpbnV4IEZvdW5kYXRpb25cIl1cbiAgICAgICAgc3lzdGVtZFtcInN5c3RlbWQgKFBJRCAxKVwiXVxuICAgIGVuZFxuXG4gICAgc3ViZ3JhcGggTGF5ZXIyW1wiTGF5ZXIgMjogTm9kZSBBZ2VudFwiXVxuICAgICAgICBrdWJlbGV0W1wia3ViZWxldFwiXVxuICAgIGVuZFxuXG4gICAgc3ViZ3JhcGggTGF5ZXIzW1wiTGF5ZXIgMzogQ29udHJvbCBQbGFuZVwiXVxuICAgICAgICBhcGlbXCJrdWJlLWFwaXNlcnZlclwiXVxuICAgICAgICBjb250cm9sbGVyW1wia3ViZS1jb250cm9sbGVyLW1hbmFnZXJcIl1cbiAgICAgICAgc2NoZWR1bGVyW1wia3ViZS1zY2hlZHVsZXJcIl1cbiAgICAgICAgZXRjZFtcImV0Y2RcIl1cbiAgICBlbmRcblxuICAgIHN1YmdyYXBoIExheWVyNFtcIkxheWVyIDQ6IEFwcCBDb250cm9sbGVyc1wiXVxuICAgICAgICBuZ2lueFtcIk5HSU5YIEluZ3Jlc3NcIl1cbiAgICAgICAgY2VydG1ncltcImNlcnQtbWFuYWdlclwiXVxuICAgICAgICBwcm9tZXRoZXVzW1wiUHJvbWV0aGV1cyBPcGVyYXRvclwiXVxuICAgICAgICBoZWxtW1wiSGVsbSBSZWxlYXNlc1wiXVxuICAgIGVuZFxuXG4gICAgc3lzdGVtZCAtLT58c3VwZXJ2aXNlc3wga3ViZWxldFxuICAgIGt1YmVsZXQgLS0%2BfG1hbmFnZXMgc3RhdGljIHBvZHN8IGFwaVxuICAgIGt1YmVsZXQgLS0%2BfG1hbmFnZXMgc3RhdGljIHBvZHN8IGNvbnRyb2xsZXJcbiAgICBrdWJlbGV0IC0tPnxtYW5hZ2VzIHN0YXRpYyBwb2RzfCBzY2hlZHVsZXJcbiAgICBrdWJlbGV0IC0tPnxtYW5hZ2VzIHN0YXRpYyBwb2RzfCBldGNkXG4gICAgY29udHJvbGxlciAtLT58cmVjb25jaWxlc3wgbmdpbnhcbiAgICBjb250cm9sbGVyIC0tPnxyZWNvbmNpbGVzfCBjZXJ0bWdyXG4gICAgY29udHJvbGxlciAtLT58cmVjb25jaWxlc3wgcHJvbWV0aGV1c1xuXG4gICAgc3R5bGUgTGF5ZXIxIGZpbGw6IzE4NjRhYixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXIyIGZpbGw6IzE5NzFjMixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXIzIGZpbGw6IzIyOGJlNixjb2xvcjojZmZmXG4gICAgc3R5bGUgTGF5ZXI0IGZpbGw6IzMzOWFmMCxjb2xvcjojZmZmXG4iLAogICJtZXJtYWlkIjogewogICAgInRoZW1lIjogImRlZmF1bHQiCiAgfQp9Cg%3D%3D" alt="The Four Layers of Kubernetes Supervision" width="979" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, we'll explore &lt;strong&gt;Layers 1-3&lt;/strong&gt;. Part 3 covers Layer 4 and the complete resilience model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: The Linux Foundation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  systemd — The Root Supervisor
&lt;/h3&gt;

&lt;p&gt;At the very bottom of the stack is &lt;strong&gt;systemd&lt;/strong&gt;, the init system running as PID 1 on most modern Linux distributions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On a Kubernetes node&lt;/span&gt;
ps aux | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-5&lt;/span&gt;
&lt;span class="c"&gt;# USER  PID  COMMAND&lt;/span&gt;
&lt;span class="c"&gt;# root    1  /sbin/init (systemd)&lt;/span&gt;
&lt;span class="c"&gt;# root  ...  /usr/bin/kubelet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;systemd's job is simple but critical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start services in the correct order at boot&lt;/li&gt;
&lt;li&gt;Monitor services and restart them if they crash&lt;/li&gt;
&lt;li&gt;Provide dependency management between services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The kubelet runs as a systemd service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/kubelet.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;kubelet: The Kubernetes Node Agent&lt;/span&gt;
&lt;span class="py"&gt;Documentation&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://kubernetes.io/docs/&lt;/span&gt;
&lt;span class="py"&gt;Wants&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network-online.target&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network-online.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/kubelet &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--config=/var/lib/kubelet/config.yaml &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--kubeconfig=/etc/kubernetes/kubelet.conf &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--container-runtime-endpoint=unix:///run/containerd/containerd.sock&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key line: &lt;strong&gt;&lt;code&gt;Restart=always&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If kubelet crashes, systemd restarts it within 10 seconds. This is the foundation of Kubernetes resilience—the node agent is supervised by the operating system itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View kubelet status&lt;/span&gt;
systemctl status kubelet

&lt;span class="c"&gt;# Watch kubelet restart after killing it (don't do this in production!)&lt;/span&gt;
&lt;span class="nb"&gt;sudo kill&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep kubelet&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# systemd will restart it automatically&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Layer 2: The Node Agent
&lt;/h2&gt;

&lt;h3&gt;
  
  
  kubelet — The Pod Supervisor
&lt;/h3&gt;

&lt;p&gt;kubelet is the Kubernetes agent running on every node. It has two critical responsibilities:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Running Static Pods
&lt;/h4&gt;

&lt;p&gt;kubelet watches a directory (typically &lt;code&gt;/etc/kubernetes/manifests/&lt;/code&gt;) for pod manifests and runs them directly—no API server required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; /etc/kubernetes/manifests/
&lt;span class="c"&gt;# etcd.yaml&lt;/span&gt;
&lt;span class="c"&gt;# kube-apiserver.yaml&lt;/span&gt;
&lt;span class="c"&gt;# kube-controller-manager.yaml&lt;/span&gt;
&lt;span class="c"&gt;# kube-scheduler.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how the control plane bootstraps itself. The API server can't schedule pods before it exists, so kubelet runs these components directly from files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/kubernetes/manifests/kube-apiserver.yaml (simplified)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-apiserver&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hostNetwork&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-apiserver&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/kube-apiserver:v1.28.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kube-apiserver&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--etcd-servers=https://127.0.0.1:2379&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--service-cluster-ip-range=10.96.0.0/12&lt;/span&gt;
    &lt;span class="c1"&gt;# ... more flags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Running API-Scheduled Pods
&lt;/h4&gt;

&lt;p&gt;Once the control plane is running, kubelet also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches the API server for pods scheduled to its node&lt;/li&gt;
&lt;li&gt;Starts containers via the container runtime (containerd)&lt;/li&gt;
&lt;li&gt;Reports pod status back to the API server&lt;/li&gt;
&lt;li&gt;Restarts failed containers based on &lt;code&gt;restartPolicy&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogInNlcXVlbmNlRGlhZ3JhbVxuICAgIHBhcnRpY2lwYW50IFMgYXMgU2NoZWR1bGVyXG4gICAgcGFydGljaXBhbnQgQVBJIGFzIEFQSSBTZXJ2ZXJcbiAgICBwYXJ0aWNpcGFudCBLIGFzIGt1YmVsZXRcbiAgICBwYXJ0aWNpcGFudCBDIGFzIGNvbnRhaW5lcmRcblxuICAgIFMtPj5BUEk6IEJpbmQgcG9kIHRvIG5vZGUtMVxuICAgIEFQSS0%2BPks6IFdhdGNoIGV2ZW50OiBuZXcgcG9kXG4gICAgSy0%2BPkM6IENyZWF0ZSBjb250YWluZXJcbiAgICBDLS0%2BPks6IENvbnRhaW5lciBydW5uaW5nXG4gICAgSy0%2BPkFQSTogVXBkYXRlIHBvZCBzdGF0dXM6IFJ1bm5pbmdcbiIsCiAgIm1lcm1haWQiOiB7CiAgICAidGhlbWUiOiAiZGVmYXVsdCIKICB9Cn0K" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogInNlcXVlbmNlRGlhZ3JhbVxuICAgIHBhcnRpY2lwYW50IFMgYXMgU2NoZWR1bGVyXG4gICAgcGFydGljaXBhbnQgQVBJIGFzIEFQSSBTZXJ2ZXJcbiAgICBwYXJ0aWNpcGFudCBLIGFzIGt1YmVsZXRcbiAgICBwYXJ0aWNpcGFudCBDIGFzIGNvbnRhaW5lcmRcblxuICAgIFMtPj5BUEk6IEJpbmQgcG9kIHRvIG5vZGUtMVxuICAgIEFQSS0%2BPks6IFdhdGNoIGV2ZW50OiBuZXcgcG9kXG4gICAgSy0%2BPkM6IENyZWF0ZSBjb250YWluZXJcbiAgICBDLS0%2BPks6IENvbnRhaW5lciBydW5uaW5nXG4gICAgSy0%2BPkFQSTogVXBkYXRlIHBvZCBzdGF0dXM6IFJ1bm5pbmdcbiIsCiAgIm1lcm1haWQiOiB7CiAgICAidGhlbWUiOiAiZGVmYXVsdCIKICB9Cn0K" alt="Pod Scheduling Sequence" width="898" height="399"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: The Control Plane
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Static Pods — The Bootstrap Layer
&lt;/h3&gt;

&lt;p&gt;The control plane runs as static pods managed directly by kubelet:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;etcd&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributed key-value store; holds all cluster state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kube-apiserver&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST API frontend; all components communicate through it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kube-controller-manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs built-in controllers (Deployment, ReplicaSet, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kube-scheduler&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Assigns pods to nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These components form a supervision loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubelet ensures static pods are running&lt;/li&gt;
&lt;li&gt;Control plane components use etcd for persistence&lt;/li&gt;
&lt;li&gt;If a component crashes, kubelet restarts it&lt;/li&gt;
&lt;li&gt;State is never lost because it's in etcd&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  kube-controller-manager — The Reconciliation Engine
&lt;/h3&gt;

&lt;p&gt;The controller manager runs dozens of controllers, each implementing the &lt;strong&gt;reconciliation pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Simplified reconciliation loop&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;DeploymentController&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// 1. Get desired state from API server (backed by etcd)&lt;/span&gt;
        &lt;span class="n"&gt;deployment&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetDeployment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;desiredReplicas&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;deployment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Replicas&lt;/span&gt;

        &lt;span class="c"&gt;// 2. Get current state&lt;/span&gt;
        &lt;span class="n"&gt;replicaSets&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListReplicaSets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;currentReplicas&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;countReadyReplicas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replicaSets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c"&gt;// 3. Reconcile&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;currentReplicas&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;desiredReplicas&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scaleUp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;currentReplicas&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;desiredReplicas&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scaleDown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// 4. Repeat&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reconciliationInterval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key controllers and what they manage:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Controller&lt;/th&gt;
&lt;th&gt;Watches&lt;/th&gt;
&lt;th&gt;Ensures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Deployments&lt;/td&gt;
&lt;td&gt;Correct ReplicaSets exist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ReplicaSet&lt;/td&gt;
&lt;td&gt;ReplicaSets&lt;/td&gt;
&lt;td&gt;Correct number of pods exist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StatefulSet&lt;/td&gt;
&lt;td&gt;StatefulSets&lt;/td&gt;
&lt;td&gt;Pods with stable identities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DaemonSet&lt;/td&gt;
&lt;td&gt;DaemonSets&lt;/td&gt;
&lt;td&gt;One pod per matching node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job&lt;/td&gt;
&lt;td&gt;Jobs&lt;/td&gt;
&lt;td&gt;Pods run to completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service&lt;/td&gt;
&lt;td&gt;Services + Pods&lt;/td&gt;
&lt;td&gt;Endpoints are updated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Foundation is Set
&lt;/h2&gt;

&lt;p&gt;We've now covered the first three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;systemd&lt;/strong&gt; supervises kubelet (Restart=always)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubelet&lt;/strong&gt; runs static pods from &lt;code&gt;/etc/kubernetes/manifests/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control plane&lt;/strong&gt; components persist state in etcd and reconcile continuously&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But what about &lt;strong&gt;your&lt;/strong&gt; controllers—NGINX Ingress, cert-manager, Prometheus Operator? How do they survive pod evictions?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Part 3&lt;/strong&gt;, we'll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How application controllers persist through evictions&lt;/li&gt;
&lt;li&gt;The complete persistence chain from hardware to application&lt;/li&gt;
&lt;li&gt;Why controllers are stateless (and why that matters)&lt;/li&gt;
&lt;li&gt;What survives pod evictions vs. what doesn't&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Next in this series:&lt;/strong&gt; &lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-3-controllers-resilience-why-kubernetes-self-heals-392b"&gt;Part 3: Controllers &amp;amp; Resilience — Why Kubernetes Self-Heals&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>linux</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Kubernetes Persistence Series Part 1: When Our Ingress Vanished After a Node Upgrade</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Sun, 11 Jan 2026 00:32:00 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-1-when-our-ingress-vanished-after-a-node-upgrade-17li</link>
      <guid>https://forem.com/vincentdu2021/kubernetes-persistence-series-part-1-when-our-ingress-vanished-after-a-node-upgrade-17li</guid>
      <description>&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Why manually-applied Kubernetes resources can disappear after pod evictions&lt;/li&gt;
&lt;li&gt;How NGINX Ingress admission webhooks validate resources&lt;/li&gt;
&lt;li&gt;The difference between controller-managed and manually-applied resources&lt;/li&gt;
&lt;li&gt;Why Helm-managed resources survive node disruptions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem That Started This Journey
&lt;/h2&gt;

&lt;p&gt;It was a regular Monday morning until the alerts fired: &lt;strong&gt;Grafana was unreachable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When GKE performed automatic node upgrades, our monitoring dashboard disappeared. The investigation that followed revealed a fascinating chain of dependencies—and ultimately led to understanding the elegant hierarchical supervision model that keeps Kubernetes running.&lt;/p&gt;

&lt;p&gt;But first, let's solve the immediate problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Incident: Why Ingress Disappeared
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Happened
&lt;/h3&gt;

&lt;p&gt;The sequence of events:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GKE automatically upgraded nodes&lt;/strong&gt; (routine security patches)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nodes were drained&lt;/strong&gt;, causing pod evictions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NGINX Ingress Controller pod was evicted&lt;/strong&gt; and restarted on a new node&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grafana ingress resource disappeared&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service became inaccessible&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The puzzling part: why would an &lt;em&gt;Ingress resource&lt;/em&gt; disappear when only &lt;em&gt;pods&lt;/em&gt; were evicted? Ingress is a Kubernetes object stored in etcd—it shouldn't just vanish.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Investigation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if the ingress exists&lt;/span&gt;
kubectl get ingress &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring
&lt;span class="c"&gt;# No resources found&lt;/span&gt;

&lt;span class="c"&gt;# Check the NGINX controller logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; ingress-nginx deploy/ingress-nginx-controller | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The logs revealed admission webhook failures during the controller restart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause Discovery
&lt;/h3&gt;

&lt;p&gt;The ingress disappeared because of a &lt;strong&gt;perfect storm&lt;/strong&gt; of issues:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBURFxuICAgIEFbTm9kZSBVcGdyYWRlIFRyaWdnZXJlZF0gLS0%2BIEJbUG9kcyBFdmljdGVkXVxuICAgIEIgLS0%2BIENbTkdJTlggQ29udHJvbGxlciBSZXN0YXJ0c11cbiAgICBDIC0tPiBEW0NvbnRyb2xsZXIgVmFsaWRhdGVzIEV4aXN0aW5nIEluZ3Jlc3Nlc11cbiAgICBEIC0tPiBFe1RMUyBTZWNyZXQgRXhpc3RzP31cbiAgICBFIC0tPnxOb3wgRltWYWxpZGF0aW9uIEZhaWxzXVxuICAgIEYgLS0%2BIEdbSW5ncmVzcyBSZWplY3RlZC9SZW1vdmVkXVxuICAgIEUgLS0%2BfFllc3wgSFtJbmdyZXNzIEhlYWx0aHldXG4gICAgc3R5bGUgRiBmaWxsOiNmZjZiNmJcbiAgICBzdHlsZSBHIGZpbGw6I2ZmNmI2YlxuICAgIHN0eWxlIEggZmlsbDojNTFjZjY2XG4iLAogICJtZXJtYWlkIjogewogICAgInRoZW1lIjogImRlZmF1bHQiCiAgfQp9Cg%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fbase64%3AewogICJjb2RlIjogImZsb3djaGFydCBURFxuICAgIEFbTm9kZSBVcGdyYWRlIFRyaWdnZXJlZF0gLS0%2BIEJbUG9kcyBFdmljdGVkXVxuICAgIEIgLS0%2BIENbTkdJTlggQ29udHJvbGxlciBSZXN0YXJ0c11cbiAgICBDIC0tPiBEW0NvbnRyb2xsZXIgVmFsaWRhdGVzIEV4aXN0aW5nIEluZ3Jlc3Nlc11cbiAgICBEIC0tPiBFe1RMUyBTZWNyZXQgRXhpc3RzP31cbiAgICBFIC0tPnxOb3wgRltWYWxpZGF0aW9uIEZhaWxzXVxuICAgIEYgLS0%2BIEdbSW5ncmVzcyBSZWplY3RlZC9SZW1vdmVkXVxuICAgIEUgLS0%2BfFllc3wgSFtJbmdyZXNzIEhlYWx0aHldXG4gICAgc3R5bGUgRiBmaWxsOiNmZjZiNmJcbiAgICBzdHlsZSBHIGZpbGw6I2ZmNmI2YlxuICAgIHN0eWxlIEggZmlsbDojNTFjZjY2XG4iLAogICJtZXJtYWlkIjogewogICAgInRoZW1lIjogImRlZmF1bHQiCiAgfQp9Cg%3D%3D" alt="Root Cause Flowchart" width="447" height="878"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The chain of failures:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TLS Secret was missing&lt;/strong&gt; — It was manually copied to the cluster months ago, not managed by any controller. When the namespace was recreated during troubleshooting, the secret didn't come back.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;NGINX Admission Webhook&lt;/strong&gt; — The NGINX Ingress Controller includes a validating webhook that checks ingress resources on creation and updates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validation Failed&lt;/strong&gt; — Without the TLS secret referenced in the ingress spec, the webhook rejected the ingress as invalid.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Reconciliation&lt;/strong&gt; — The ingress was created via &lt;code&gt;kubectl apply&lt;/code&gt; (not Helm or an operator), so nothing knew to recreate it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  The "Aha" Moment
&lt;/h3&gt;

&lt;p&gt;The real issue wasn't the node upgrade—it was our &lt;strong&gt;resource management approach&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Our original ingress (manually applied)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
  &lt;span class="c1"&gt;# No owner reference&lt;/span&gt;
  &lt;span class="c1"&gt;# No Helm labels&lt;/span&gt;
  &lt;span class="c1"&gt;# No operator management&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;grafana.prod.example.com&lt;/span&gt;
    &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana-tls&lt;/span&gt;  &lt;span class="c1"&gt;# This secret was also manually created!&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana.prod.example.com&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this ingress needed to be recreated, nothing knew it should exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Helm-Managed Resources
&lt;/h2&gt;

&lt;p&gt;We solved this by migrating to Helm charts with native ingress support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before: manually applied resources scattered across yaml files&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; grafana-ingress.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; grafana-tls-secret.yaml

&lt;span class="c"&gt;# After: Helm manages everything as a single release&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; monitoring prometheus-community/kube-prometheus-stack &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--namespace&lt;/span&gt; monitoring &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.ingress.enabled&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.ingress.hosts[0]&lt;span class="o"&gt;=&lt;/span&gt;grafana.prod.example.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.ingress.tls[0].secretName&lt;span class="o"&gt;=&lt;/span&gt;grafana-tls &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.ingress.tls[0].hosts[0]&lt;span class="o"&gt;=&lt;/span&gt;grafana.prod.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;Helm stores release state in Kubernetes secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get secrets &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;helm
&lt;span class="c"&gt;# NAME                                    TYPE                 DATA&lt;/span&gt;
&lt;span class="c"&gt;# sh.helm.release.v1.monitoring.v1       helm.sh/release.v1   1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Helm knows what resources should exist&lt;/li&gt;
&lt;li&gt;✅ &lt;code&gt;helm upgrade&lt;/code&gt; recreates missing resources&lt;/li&gt;
&lt;li&gt;✅ Resources are versioned and can be rolled back&lt;/li&gt;
&lt;li&gt;✅ Dependencies (like TLS secrets) are managed together&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For the TLS Secret
&lt;/h3&gt;

&lt;p&gt;We also moved TLS management to cert-manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cert-manager.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Certificate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana-tls&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana-tls&lt;/span&gt;
  &lt;span class="na"&gt;issuerRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;letsencrypt-prod&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIssuer&lt;/span&gt;
  &lt;span class="na"&gt;dnsNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;grafana.prod.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now cert-manager (an operator) ensures the TLS secret always exists and stays renewed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Survives Pod Evictions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource Type&lt;/th&gt;
&lt;th&gt;Survives?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Helm-managed resources&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;State stored in release secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operator-managed CRs&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Operator reconciles continuously&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resources with owner references&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Parent controller recreates them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manually &lt;code&gt;kubectl apply&lt;/code&gt;'d resources&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;Survives in etcd, but won't be recreated if deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resources referencing missing dependencies&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Validation webhooks may reject them&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Best Practices
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never manually apply production resources&lt;/strong&gt; — Use Helm, Kustomize, or GitOps tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manage secrets with operators&lt;/strong&gt; — External Secrets, cert-manager, Sealed Secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand admission webhooks&lt;/strong&gt; — They validate resources on every create/update&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test node disruptions&lt;/strong&gt; — Use &lt;code&gt;kubectl drain&lt;/code&gt; in staging regularly&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Deeper Question
&lt;/h2&gt;

&lt;p&gt;This incident was resolved, but it raised a fundamental question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How do controllers like Helm, NGINX Ingress, and cert-manager survive pod evictions themselves? What ensures THEY come back?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer involves a beautiful hierarchical supervision model that goes all the way down to Linux PID 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Part 2&lt;/strong&gt;, we'll explore the complete Kubernetes persistence chain—from Linux systemd to application controllers—and understand why Kubernetes is designed to assume failure is normal.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you experienced similar "ghost" resources disappearing in Kubernetes? Share your war stories in the comments!&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next in this series:&lt;/strong&gt; &lt;a href="https://dev.to/vincentdu2021/kubernetes-persistence-series-part-2-the-foundation-from-systemd-to-control-plane-2464"&gt;Part 2: The Foundation — From systemd to Control Plane&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>gke</category>
      <category>sre</category>
    </item>
    <item>
      <title>Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Wed, 07 Jan 2026 21:19:00 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/building-a-fast-file-transfer-tool-part-2-beating-rsync-by-58-with-ktls-1hob</link>
      <guid>https://forem.com/vincentdu2021/building-a-fast-file-transfer-tool-part-2-beating-rsync-by-58-with-ktls-1hob</guid>
      <description>&lt;h1&gt;
  
  
  Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS
&lt;/h1&gt;

&lt;p&gt;In &lt;a href="https://dev.to/vincentdu2021/building-a-file-copier-4x-faster-than-cp-using-iouring-4b5n"&gt;Part 1&lt;/a&gt;, I built &lt;strong&gt;uring-sync&lt;/strong&gt;—a file copier that's 4.2x faster than &lt;code&gt;cp&lt;/code&gt; for local copies using io_uring. Now I've added &lt;strong&gt;network transfer&lt;/strong&gt; with kernel TLS encryption, achieving &lt;strong&gt;58% faster transfers than rsync&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: SSH is the Bottleneck
&lt;/h2&gt;

&lt;p&gt;When transferring ML datasets between machines, rsync over SSH is the go-to tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rsync &lt;span class="nt"&gt;-az&lt;/span&gt; /data/ml_dataset user@server:/backup/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works, but it's slow. For a 9.7GB dataset (100K files), rsync took &lt;strong&gt;390 seconds&lt;/strong&gt;—a throughput of just 25 MB/s.&lt;/p&gt;

&lt;p&gt;The bottleneck isn't the network. It's &lt;strong&gt;encryption in userspace&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  File   │────▶│ rsync   │────▶│  SSH    │────▶│ Network │
│  Read   │     │ (delta) │     │ encrypt │     │  Send   │
└─────────┘     └─────────┘     └─────────┘     └─────────┘
                                     │
                              Context switches,
                              userspace copies,
                              CPU-bound AES
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every byte passes through the SSH process, which encrypts it using OpenSSL in userspace. This involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple context switches between kernel and userspace&lt;/li&gt;
&lt;li&gt;Copying data between kernel buffers and userspace buffers&lt;/li&gt;
&lt;li&gt;CPU time for AES encryption (even with AES-NI)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: kTLS (Kernel TLS)
&lt;/h2&gt;

&lt;p&gt;Linux 4.13+ supports &lt;strong&gt;kTLS&lt;/strong&gt;—TLS encryption handled directly in the kernel. Once you set up the TLS session, the kernel encrypts data as it flows through the socket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────┐     ┌─────────┐     ┌──────────────────┐
│  File   │────▶│  read   │────▶│ Socket (kTLS)    │
│         │     │         │     │ encrypt + send   │
└─────────┘     └─────────┘     └──────────────────┘
                                        │
                                 One kernel operation,
                                 no userspace copies,
                                 AES-NI in kernel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No userspace encryption process&lt;/strong&gt; - kernel handles it directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer copies&lt;/strong&gt; - data doesn't bounce through userspace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AES-NI in kernel&lt;/strong&gt; - hardware acceleration without context switches&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Setting up kTLS requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TLS handshake&lt;/strong&gt; - Exchange keys (we use a pre-shared secret + HKDF)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure kernel&lt;/strong&gt; - &lt;code&gt;setsockopt(SOL_TLS, TLS_TX, ...)&lt;/code&gt; with cipher keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send data&lt;/strong&gt; - Regular &lt;code&gt;send()&lt;/code&gt; calls, kernel encrypts automatically
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// After deriving keys from shared secret...&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;tls12_crypto_info_aes_gcm_128&lt;/span&gt; &lt;span class="n"&gt;crypto_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TLS_1_2_VERSION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cipher_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TLS_CIPHER_AES_GCM_128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crypto_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crypto_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crypto_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;salt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;setsockopt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SOL_TLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TLS_TX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;crypto_info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crypto_info&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="c1"&gt;// Now all send() calls are automatically encrypted!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Testing on real network: Laptop → GCP VM (public internet)&lt;/p&gt;

&lt;h3&gt;
  
  
  The Headline Number
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;uring-sync + kTLS&lt;/th&gt;
&lt;th&gt;rsync (SSH)&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ml_small (60MB, 10K files)&lt;/td&gt;
&lt;td&gt;2.98s&lt;/td&gt;
&lt;td&gt;2.63s&lt;/td&gt;
&lt;td&gt;~equal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ml_large (589MB, 100K files)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.4s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;24.8s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34% faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ml_images (9.7GB, 100K files)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;165s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;390s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58% faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Pattern
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data size:    60MB  →  589MB  →   9.7GB
Improvement:   0%   →   34%   →    58%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The larger the transfer, the bigger the kTLS advantage.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why? Per-connection overhead (handshake, key derivation) is amortized over more data. And SSH's userspace encryption overhead grows linearly with data size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;CPU Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;rsync (SSH)&lt;/td&gt;
&lt;td&gt;25 MB/s&lt;/td&gt;
&lt;td&gt;High (userspace encryption)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;uring-sync + kTLS&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (kernel encryption)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;kTLS achieves &lt;strong&gt;2.4x the throughput&lt;/strong&gt; of rsync while using less CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Zero-Copy Splice?
&lt;/h2&gt;

&lt;p&gt;In theory, kTLS supports splice() for true zero-copy transfers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File → Pipe → kTLS Socket (no userspace copies!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I implemented this and expected it to be fastest. Instead, it was &lt;strong&gt;2.9x slower&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Investigation
&lt;/h3&gt;

&lt;p&gt;Using strace, I found the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;splice(file→pipe):   27μs    ← instant
splice(pipe→socket): 33ms    ← 1000x slower!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;splice(pipe → kTLS socket)&lt;/code&gt; call &lt;strong&gt;blocks&lt;/strong&gt; waiting for TCP ACKs. The kernel can't buffer aggressively like it does with regular &lt;code&gt;send()&lt;/code&gt; calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Lesson
&lt;/h3&gt;

&lt;p&gt;Zero-copy isn't always faster. For many-file workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;read/send&lt;/strong&gt;: Kernel manages buffering efficiently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;splice&lt;/strong&gt;: Blocks on each chunk, killing throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Splice might help for single huge files, but for ML datasets (many small files), stick with read/send.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use kTLS file transfer when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transferring large datasets (&amp;gt;500MB)&lt;/li&gt;
&lt;li&gt;Network has bandwidth to spare&lt;/li&gt;
&lt;li&gt;You control both endpoints&lt;/li&gt;
&lt;li&gt;Security is required (not just over VPN)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stick with rsync when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need delta sync (only changed bytes)&lt;/li&gt;
&lt;li&gt;Destination already has partial data&lt;/li&gt;
&lt;li&gt;SSH infrastructure is already in place&lt;/li&gt;
&lt;li&gt;Simplicity matters more than speed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Protocol
&lt;/h2&gt;

&lt;p&gt;Our wire protocol is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HELLO (secret hash) ──────────────────▶ Verify
                    ◀────────────────── HELLO_OK (+ enable kTLS)

FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ────────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file

(repeat for all files)

ALL_DONE ──────────────────────────────▶ Complete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No delta encoding, no checksums (kTLS provides integrity via GCM). Just raw file transfer with authentication and encryption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Receiver (on remote host)&lt;/span&gt;
uring-sync recv /backup &lt;span class="nt"&gt;--listen&lt;/span&gt; 9999 &lt;span class="nt"&gt;--secret&lt;/span&gt; mykey &lt;span class="nt"&gt;--tls&lt;/span&gt;

&lt;span class="c"&gt;# Sender (on local host)&lt;/span&gt;
uring-sync send /data remote-host:9999 &lt;span class="nt"&gt;--secret&lt;/span&gt; mykey &lt;span class="nt"&gt;--tls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HKDF for key derivation from shared secret&lt;/li&gt;
&lt;li&gt;AES-128-GCM via kTLS&lt;/li&gt;
&lt;li&gt;Simple TCP protocol (no HTTP, no gRPC)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full source: &lt;a href="https://github.com/VincentDu2021/uring_sync" rel="noopener noreferrer"&gt;github.com/VincentDu2021/uring_sync&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By moving encryption from userspace SSH to kernel kTLS, we achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;58% faster&lt;/strong&gt; than rsync for large transfers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.4x throughput&lt;/strong&gt; (60 MB/s vs 25 MB/s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower CPU usage&lt;/strong&gt; (kernel AES-NI vs userspace OpenSSL)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: for bulk data transfer, SSH's flexibility is overhead. A purpose-built tool with kernel encryption wins.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix: Full Benchmark Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Environment
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Sender: Ubuntu laptop, local NVMe&lt;/li&gt;
&lt;li&gt;Receiver: GCP VM (us-central1-a)&lt;/li&gt;
&lt;li&gt;Network: Public internet&lt;/li&gt;
&lt;li&gt;All tests with cold cache (&lt;code&gt;echo 3 &amp;gt; /proc/sys/vm/drop_caches&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Raw Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;kTLS Time&lt;/th&gt;
&lt;th&gt;kTLS Speed&lt;/th&gt;
&lt;th&gt;rsync Time&lt;/th&gt;
&lt;th&gt;rsync Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ml_small&lt;/td&gt;
&lt;td&gt;10K&lt;/td&gt;
&lt;td&gt;60MB&lt;/td&gt;
&lt;td&gt;2.98s&lt;/td&gt;
&lt;td&gt;20 MB/s&lt;/td&gt;
&lt;td&gt;2.63s&lt;/td&gt;
&lt;td&gt;23 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ml_large&lt;/td&gt;
&lt;td&gt;100K&lt;/td&gt;
&lt;td&gt;589MB&lt;/td&gt;
&lt;td&gt;16.4s&lt;/td&gt;
&lt;td&gt;36 MB/s&lt;/td&gt;
&lt;td&gt;24.8s&lt;/td&gt;
&lt;td&gt;24 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ml_images&lt;/td&gt;
&lt;td&gt;100K&lt;/td&gt;
&lt;td&gt;9.7GB&lt;/td&gt;
&lt;td&gt;165s&lt;/td&gt;
&lt;td&gt;60 MB/s&lt;/td&gt;
&lt;td&gt;390s&lt;/td&gt;
&lt;td&gt;25 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Splice Investigation (ml_images)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Plaintext + read/send&lt;/td&gt;
&lt;td&gt;146s&lt;/td&gt;
&lt;td&gt;68 MB/s&lt;/td&gt;
&lt;td&gt;Fastest (no encryption)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plaintext + splice&lt;/td&gt;
&lt;td&gt;157s&lt;/td&gt;
&lt;td&gt;63 MB/s&lt;/td&gt;
&lt;td&gt;+8% overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kTLS + read/send&lt;/td&gt;
&lt;td&gt;165s&lt;/td&gt;
&lt;td&gt;60 MB/s&lt;/td&gt;
&lt;td&gt;+13% (encryption cost)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kTLS + splice&lt;/td&gt;
&lt;td&gt;428s&lt;/td&gt;
&lt;td&gt;23 MB/s&lt;/td&gt;
&lt;td&gt;2.9x slower (broken)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #linux #ktls #tls #rsync #performance #networking #encryption&lt;/p&gt;

</description>
      <category>linux</category>
      <category>networking</category>
      <category>security</category>
      <category>performance</category>
    </item>
    <item>
      <title>Building a File Copier 4x Faster Than cp Using io_uring</title>
      <dc:creator>Vincent Du</dc:creator>
      <pubDate>Wed, 07 Jan 2026 17:45:26 +0000</pubDate>
      <link>https://forem.com/vincentdu2021/building-a-file-copier-4x-faster-than-cp-using-iouring-4b5n</link>
      <guid>https://forem.com/vincentdu2021/building-a-file-copier-4x-faster-than-cp-using-iouring-4b5n</guid>
      <description>&lt;h1&gt;
  
  
  Building a File Copier That's 4x Faster Than &lt;code&gt;cp&lt;/code&gt; Using io_uring
&lt;/h1&gt;

&lt;p&gt;I built a high-performance file copier for ML datasets using Linux io_uring. On the right workload, it's &lt;strong&gt;4.2x faster than &lt;code&gt;cp -r&lt;/code&gt;&lt;/strong&gt;. Here's what I learned about when async I/O helps—and when it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Millions of Small Files
&lt;/h2&gt;

&lt;p&gt;ML training datasets often contain millions of small files:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Typical Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ImageNet&lt;/td&gt;
&lt;td&gt;1.28M&lt;/td&gt;
&lt;td&gt;100-200KB JPEG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;COCO&lt;/td&gt;
&lt;td&gt;330K&lt;/td&gt;
&lt;td&gt;50-500KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MNIST&lt;/td&gt;
&lt;td&gt;70K&lt;/td&gt;
&lt;td&gt;784 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CIFAR-10&lt;/td&gt;
&lt;td&gt;60K&lt;/td&gt;
&lt;td&gt;3KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Copying these with &lt;code&gt;cp -r&lt;/code&gt; is painfully slow. Each file requires multiple syscalls (open, read, write, close), and the kernel processes them one at a time. For 100,000 files, that's 400,000+ syscalls executed sequentially.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: io_uring
&lt;/h2&gt;

&lt;p&gt;io_uring is a Linux async I/O interface (kernel 5.1+) that enables:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Batched submission&lt;/strong&gt; - Queue dozens of operations, submit with one syscall&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async completion&lt;/strong&gt; - Operations complete out of order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-copy&lt;/strong&gt; - Splice data directly between file descriptors via kernel pipes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of: open → read → write → close → repeat&lt;/p&gt;

&lt;p&gt;We do: submit 64 opens → process completions → submit reads/writes → batch everything&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐     ┌─────────────────┐     ┌─────────────────────┐
│ Main Thread  │────▶│  WorkQueue&amp;lt;T&amp;gt;   │────▶│  Worker Threads     │
│ (scanner)    │     │  (thread-safe)  │     │  (per-thread uring) │
└──────────────┘     └─────────────────┘     └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each file progresses through a state machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;64 files in-flight&lt;/strong&gt; per worker simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-thread io_uring instances&lt;/strong&gt; (avoids lock contention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inode sorting&lt;/strong&gt; for sequential disk access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splice zero-copy&lt;/strong&gt; for data transfer (source → pipe → destination)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buffer pool&lt;/strong&gt; with 4KB-aligned allocations (O_DIRECT compatible)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Local NVMe (Cold Cache)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;cp -r&lt;/th&gt;
&lt;th&gt;uring-sync&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100K × 4KB files (400MB)&lt;/td&gt;
&lt;td&gt;7.67s&lt;/td&gt;
&lt;td&gt;5.14s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.5x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100K × 100KB files (10GB)&lt;/td&gt;
&lt;td&gt;22.7s&lt;/td&gt;
&lt;td&gt;5.4s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: Larger files benefit MORE from io_uring on fast storage. The 100KB test shows 4.2x improvement because we're overlapping many large reads/writes.&lt;/p&gt;

&lt;h3&gt;
  
  
  GCP pd-balanced (SSD-backed, 100GB)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;cp -r&lt;/th&gt;
&lt;th&gt;uring-sync&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100K × 4KB files&lt;/td&gt;
&lt;td&gt;67.7s&lt;/td&gt;
&lt;td&gt;31.5s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.15x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100K × 100KB files&lt;/td&gt;
&lt;td&gt;139.6s&lt;/td&gt;
&lt;td&gt;64.7s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.16x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Consistent 2x improvement on cloud SSD storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why io_uring Helps
&lt;/h2&gt;

&lt;p&gt;On fast storage (NVMe, SSD), the bottleneck is &lt;strong&gt;CPU and syscall overhead&lt;/strong&gt;, not the disk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;cp -r&lt;/strong&gt;: Processes files sequentially, 12+ syscalls per file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;io_uring&lt;/strong&gt;: 64 files in-flight, batched syscalls, async completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bigger the files, the more time we spend waiting for I/O to complete—and the more io_uring's async approach helps. That's why we see 4.2x speedup for 100KB files vs 1.5x for 4KB files on NVMe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The State Machine
&lt;/h3&gt;

&lt;p&gt;Each file copy is a state machine with these transitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FileState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;OPENING_SRC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Opening source file&lt;/span&gt;
    &lt;span class="n"&gt;STATING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Getting file size&lt;/span&gt;
    &lt;span class="n"&gt;OPENING_DST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Creating destination&lt;/span&gt;
    &lt;span class="n"&gt;SPLICE_IN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// Reading into kernel pipe&lt;/span&gt;
    &lt;span class="n"&gt;SPLICE_OUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// Writing from pipe to dest&lt;/span&gt;
    &lt;span class="n"&gt;CLOSING_SRC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Closing source&lt;/span&gt;
    &lt;span class="n"&gt;CLOSING_DST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Closing destination&lt;/span&gt;
    &lt;span class="n"&gt;DONE&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Completions drive state transitions. When a completion arrives, we look up the file context and advance its state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splice Zero-Copy
&lt;/h3&gt;

&lt;p&gt;Instead of read() → userspace buffer → write(), we use splice():&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source FD → Kernel Pipe → Destination FD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data never touches userspace. The kernel moves pages directly between file descriptors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Splice from source into pipe&lt;/span&gt;
&lt;span class="n"&gt;io_uring_prep_splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sqe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe_write_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Splice from pipe to destination&lt;/span&gt;
&lt;span class="n"&gt;io_uring_prep_splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sqe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe_read_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inode Sorting
&lt;/h3&gt;

&lt;p&gt;Before copying, we sort files by inode number:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inode&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inode&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This encourages sequential disk access since inodes are typically allocated sequentially for files created together.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single worker beats multi-threading&lt;/strong&gt; for local NVMe. Lock contention outweighs parallelism benefits when the bottleneck is fast I/O.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Queue depth matters more than thread count&lt;/strong&gt;. 64 files in-flight per worker is the sweet spot.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Profile your actual workload&lt;/strong&gt;. Synthetic benchmarks lie. Test with your real data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;io_uring shines on fast storage&lt;/strong&gt;. When the disk can keep up, reducing syscall overhead yields big gains.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next: Network Transfer
&lt;/h2&gt;

&lt;p&gt;This tool now also supports &lt;strong&gt;network file transfer&lt;/strong&gt; with kTLS encryption, achieving 58% faster transfers than rsync. See the companion post: &lt;a href="https://dev.to/vincentdu2021/building-a-fast-file-transfer-tool-part-2-beating-rsync-by-58-with-ktls-1hob"&gt;Beating rsync by 58% with Kernel TLS&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;The full implementation is ~1,400 lines of C++20. Key components:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;RingManager&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;io_uring wrapper with SQE/CQE management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BufferPool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4KB-aligned buffer allocation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PipePool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reusable kernel pipes for splice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WorkQueue&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Thread-safe file queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FileContext&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-file state machine&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Build requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux kernel 5.1+ (5.19+ for splice)&lt;/li&gt;
&lt;li&gt;liburing&lt;/li&gt;
&lt;li&gt;C++20&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;io_uring can dramatically speed up small-file workloads—&lt;strong&gt;4.2x faster on NVMe&lt;/strong&gt; and &lt;strong&gt;2x faster on cloud SSD&lt;/strong&gt;. The key is reducing syscall overhead through batching and async I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use io_uring for file copying:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Many small files (ML datasets, source trees)&lt;/li&gt;
&lt;li&gt;Fast storage (NVMe, SSD)&lt;/li&gt;
&lt;li&gt;CPU-bound on syscall overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When &lt;code&gt;cp -r&lt;/code&gt; is fine:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single large files (already efficient)&lt;/li&gt;
&lt;li&gt;One-off copies where complexity isn't worth it&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;The code is available at &lt;a href="https://github.com/VincentDu2021/uring_sync" rel="noopener noreferrer"&gt;github.com/VincentDu2021/uring_sync&lt;/a&gt;. Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on local NVMe and GCP Compute Engine VMs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>cpp</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
