Forem: Pendela BhargavaSai

Kubernetes CNI Complete Guide: Flannel vs Cilium vs Calico + Cloud Provider CNIs

Pendela BhargavaSai — Tue, 12 May 2026 03:30:00 +0000

K3s v1.29+ | Flannel v0.24+ | Cilium v1.15+ | Calico v3.27+ | AWS VPC CNI v1.18+ | Azure CNI v1.5+ | GKE Dataplane V2 (Cilium-based)

A definitive comparison of every major Kubernetes CNI — open-source plugins (Flannel, Calico, Cilium, Weave, Antrea, Multus) and cloud-managed defaults (AWS VPC CNI on EKS, Azure CNI on AKS, and GKE's Dataplane V2 on GKE) — across architecture, performance, network policy, observability, encryption, and when to choose each.

CNI	Identity	Core Approach	Default On
🟢 Flannel	Simple Overlay	VXLAN tunnel, zero policy	K3s
🟠 Calico	Policy Powerhouse	BGP routing, iptables/eBPF	Self-managed
🔵 Cilium	eBPF Native	Kernel eBPF, replaces kube-proxy	GKE (Dataplane V2)
🟡 Weave Net	Mesh Overlay	Gossip-based mesh routing	Self-managed
🟣 Antrea	VMware-backed	OVS dataplane, Antrea policies	Self-managed
🔶 AWS VPC CNI	Cloud-native	Native VPC IP assignment	EKS
🔷 Azure CNI	Cloud-native	Azure VNET IP assignment	AKS
♦️ GKE CNI / Dataplane V2	Cloud-native + eBPF	Cilium-based eBPF on GKE	GKE

What Is a CNI?
Open Source CNIs
- 2.1 Flannel — Simple Overlay
- 2.2 Cilium — eBPF Native
- 2.3 Calico — BGP + Flexible Dataplane
- 2.4 Weave Net — Mesh Overlay
- 2.5 Antrea — OVS-based CNI
- 2.6 Multus — Meta CNI
Cloud Provider CNIs
- 3.1 AWS VPC CNI — EKS Default
- 3.2 Azure CNI — AKS Default
- 3.3 GKE Dataplane V2 — GKE Default
Data Plane Comparison
Network Policy
Observability
Performance Benchmarks
Encryption
Multi-Cluster
Resource Usage
Full Feature Comparison
When to Choose Each
K3s-Specific Setup
Migration Guide on K3s
Conclusion

1. What Is a CNI and Why Does It Matter?

The Container Network Interface (CNI) is the plugin layer every Kubernetes cluster depends on for:

Assigning IP addresses to pods from a defined CIDR range
Creating virtual Ethernet (veth) pairs between pod namespaces and the host
Programming cross-node routing so pods on Node A can reach pods on Node B
Optionally enforcing NetworkPolicy resources to control traffic flow

Cloud providers like AWS, Azure, and GCP have built proprietary CNI plugins that deeply integrate with their underlying VPC/VNET networking primitives — providing native IP assignment, cloud-aware routing, and tight integration with cloud IAM, load balancers, and security groups.

💡 K3s Key Flag
To replace the default CNI on K3s, install with --flannel-backend=none --disable-network-policy. This leaves the CNI slot open for Calico or Cilium to fill.

2. Open Source CNIs

2.1 Flannel Simple Overlay

Flannel's design philosophy: do one thing well. A user-space daemon (flanneld) manages subnet allocation, while the kernel's own VXLAN and bridge code handles all actual forwarding. No policy, no observability — just connectivity.

Pod A (eth0: 10.244.0.2)          Pod B (eth0: 10.244.0.5)
        │                                  │
        │ veth pair                        │ veth pair
        ▼                                  ▼
           cni0 Linux bridge (kernel)
                    │
      iptables PREROUTING / FORWARD / POSTROUTING
                    │
         VXLAN encapsulation — UDP 8472
                    │
     flanneld (user-space) ← etcd / K8s API
                    │
          Physical NIC → Node B

Available backends:

Backend	Transport	Use Case
`vxlan`	UDP encap (default)	Works across any network, even routers
`host-gw`	Direct routing	Fastest, requires L2 adjacency between nodes
`wireguard-native`	Encrypted WireGuard tunnel	When you need encryption
`udp`	Legacy user-space	Fallback only — very slow

Network Policy: Flannel enforces zero NetworkPolicy. Resources are silently ignored. You must pair it with Calico (Canal) to get policy — adding a second DaemonSet, version compatibility risk, and split ownership between two projects.

Flannel Encryption: Flannel encrypts cross-node traffic only — pod-to-pod on the same node travels through the cni0 bridge unencrypted. No auto key rotation; restart flanneld to rotate keys.

{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "wireguard"
  }
}

Best for: Dev/CI clusters, Raspberry Pi, edge nodes, K3s defaults.

2.2 Cilium — eBPF Native

Cilium compiles and injects eBPF programs into the Linux kernel at TC/XDP hook points. There is no bridge, no iptables — packets are forwarded via bpf_redirect() at line rate, and policy is enforced via O(1) BPF map lookups.

Pod A (eth0)                         Pod B (eth0)
       │                                  │
       │ veth pair                        │
       ▼                                  ▼
TC eBPF hook ──── bpf_redirect() ──── TC eBPF hook
                  │
BPF maps: identity · policy · NAT · LB
                  │
cilium-agent — compiles eBPF, watches K8s API
                  │
  Physical NIC — GENEVE / native routing

Datapath modes:

Mode	Encapsulation	Requirement
`tunnel: geneve`	GENEVE (default)	Any network topology
`native-routing`	None	L2 adjacency or BGP underlay
`wireguard`	WireGuard transparent	Kernel ≥ 5.6
`ipsec`	IPsec	FIPS-regulated environments

Network Policy: 4.3 Cilium — L3 Through L7, No Sidecar

Cilium enforces standard NetworkPolicy and extends it with CiliumNetworkPolicy (CNP) for Layer 7 rules — no sidecar required:

# CiliumNetworkPolicy — L7 HTTP rule
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-get-only
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/api/v1/.*"

🔭 Cilium + Hubble

✅ Per-flow visibility on every packet
✅ Live service dependency map (Hubble UI)
✅ L7 HTTP / DNS / Kafka / gRPC flows
✅ Drop reason per endpoint
✅ Rich Prometheus metrics

# Enable Hubble and UI
cilium hubble enable --ui

# Watch live flows in a namespace
hubble observe --namespace production --follow

# Show only policy drops with reason
hubble observe --verdict DROPPED --follow

# Sample output:
# 12:34:01: default/frontend → default/backend  FORWARDED  TCP:SYN
# 12:34:02: default/attacker → default/backend  DROPPED    Policy denied

Cilium Encryption: Cilium WireGuard + IPsec

# WireGuard with strict mode (drops unencrypted packets)
cilium install \
  --encryption wireguard \
  --encryption-strict-mode true

# IPsec for FIPS-regulated environments
cilium install --encryption ipsec

Best for: Large-scale production, L7 policy, observability (Hubble), zero-trust, multi-cluster.

2.3 Calico — BGP + Flexible Dataplane

Calico uses BGP (Border Gateway Protocol) to distribute pod routes across nodes — no encapsulation by default. Each node acts as a BGP peer, advertising its pod CIDR to other nodes and upstream routers. Calico's data plane is pluggable: iptables, eBPF, or even Windows HNS.

Pod A (eth0: 192.168.0.2)          Pod B (eth0: 192.168.1.2)
        │                                  │
        │ veth pair                        │ veth pair
        ▼                                  ▼
      Host routing table (no bridge needed)
                    │
      iptables / eBPF policy enforcement
                    │
     Felix (per-node agent) ← Typha (fan-out)
                    │
     BIRD (BGP daemon) — peers with other nodes
                    │
    Physical NIC — direct IP routing (no encap)

Key Calico components:

Component	Role
Felix	Per-node agent; programs iptables/eBPF rules and routes
BIRD	Open-source BGP daemon; advertises pod subnets to peers
Typha	Fan-out proxy for the K8s datastore; recommended at 50+ nodes
calico-kube-controllers	Garbage-collects stale Calico resources

Network Policy: 4.2 Calico — L3/L4 Policy Leader

Calico is widely regarded as the gold standard for L3/L4 NetworkPolicy. It supports standard NetworkPolicy resources plus its own GlobalNetworkPolicy and NetworkSet CRDs:

# Calico GlobalNetworkPolicy — cluster-wide deny-all
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: default-deny-all
spec:
  selector: all()
  types:
  - Ingress
  - Egress

# Calico NetworkSet — group external CIDRs
apiVersion: projectcalico.org/v3
kind: NetworkSet
metadata:
  name: trusted-external
spec:
  nets:
  - 203.0.113.0/24
  - 198.51.100.0/24

⚠️ Calico does not support L7 HTTP/gRPC policy natively in OSS. For that you need its optional Envoy-based Application Layer Policy (ALP), which adds a sidecar and complexity.

Calico Encryption: Calico supports WireGuard for node-to-node encryption, enabled with a single patch:

kubectl patch felixconfiguration default \
  --type merge \
  --patch '{"spec":{"wireguardEnabled":true}}'

Starting in Calico v3.26, same-node pod traffic encryption is also supported via host-to-pod WireGuard options.

Best for: BGP-integrated DCs, Windows node support, bare-metal L3, robust L3/L4 policy.

2.4 Weave Net — Mesh Overlay

Weave Net uses a gossip protocol to build a full mesh topology between all cluster nodes without any central store. It wraps packets in a sleeve (VXLAN-like) tunnel and can optionally encrypt all traffic with NaCl. Weave is simpler to operate than Calico/Cilium but is no longer under active development (archived by Weaveworks in 2023).

Pod A (eth0)
       │
    weave bridge
       │
  weave daemon (gossip mesh peer discovery)
       │
  Sleeve / Fast Datapath (VXLAN kernel bypass)
       │
    Node B weave daemon
       │
    Pod B (eth0)

Key characteristics:

Feature	Detail
Discovery	Gossip — no external etcd needed
Datapath	Sleeve (user-space) or Fast Datapath (kernel VXLAN)
Encryption	NaCl (enabled per-pod connection)
NetworkPolicy	✅ Standard K8s policy supported
Status	⚠️ Archived/maintenance mode (use Cilium or Calico for new clusters)

⚠️ Important: Weaveworks ceased active development in 2023. Weave Net is community-maintained but no longer receives feature updates. It is not recommended for new clusters — migrate to Cilium or Calico.

Best for: Legacy clusters already running Weave with migration on the roadmap.

2.5 Antrea — OVS-based CNI

Antrea is a CNI backed by VMware (now Broadcom) that uses Open vSwitch (OVS) as its dataplane. It supports both Linux and Windows nodes and provides its own AntreaNetworkPolicy and ClusterNetworkPolicy CRDs with tiered policy enforcement. Antrea integrates well with NSX-T for enterprise SD-WAN environments.

Pod A (eth0)
       │
   OVS (Open vSwitch) bridge
       │
   antrea-agent (per-node DaemonSet)
       │
   antrea-controller (centralized)
       │
   Encap: Geneve / VXLAN / GRE (configurable)
       │
   Node B OVS bridge → Pod B

Key features:

Feature	Antrea
Dataplane	Open vSwitch (OVS)
Windows support	✅ Full (OVS on Windows)
NetworkPolicy	✅ K8s standard + AntreaNetworkPolicy CRDs
Tiered policy	✅ (Emergency / Security / Application tiers)
Encryption	✅ IPsec / WireGuard
Observability	✅ Antrea Octant plugin, Prometheus metrics
NSX-T integration	✅ Enterprise add-on
eBPF support	✅ AntreaProxy (partial eBPF)

Best for: VMware/NSX-T environments, Windows-heavy clusters, tiered network policy.

2.6 Multus — Meta CNI

Multus is not a standalone CNI — it is a meta CNI that allows pods to attach multiple network interfaces simultaneously. A pod can have its primary network (managed by Flannel/Calico/Cilium) and secondary interfaces (SR-IOV, DPDK, Macvlan) for specialized workloads like telco NFV or HPC.

Pod with Multiple NICs:
  eth0 (primary) ← Flannel/Calico/Cilium (cluster network)
  net1 (secondary) ← SR-IOV (high-throughput direct NIC)
  net2 (secondary) ← Macvlan (storage network)

Multus reads NetworkAttachmentDefinition CRDs and delegates
to the correct CNI for each interface.

# NetworkAttachmentDefinition for secondary interface
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net
spec:
  config: |
    {
      "type": "sriov",
      "name": "sriov-net",
      "ipam": { "type": "static" }
    }

Best for: Telco/NFV workloads, HPC, pods that need to straddle multiple network segments.

3. Cloud Provider CNIs

Cloud-managed Kubernetes services ship their own CNI plugins that are deeply integrated with the underlying cloud networking fabric. These provide first-class VPC routing, cloud IAM integration, and managed lifecycle — but are typically locked to their respective cloud.

3.1 AWS VPC CNI — EKS Default

Amazon EKS uses the Amazon VPC CNI plugin (aws-node DaemonSet) by default. Instead of an overlay, it assigns real VPC secondary IP addresses directly to pods from Elastic Network Interfaces (ENIs) attached to the worker node.

Worker Node (EC2 instance)
    │
    ├── Primary ENI (node IP: 10.0.1.10)
    │      └── eth0
    │
    ├── Secondary ENI (attached by vpc-cni)
    │      ├── 10.0.1.20 → Pod A (eth0 via veth)
    │      ├── 10.0.1.21 → Pod B (eth0 via veth)
    │      └── 10.0.1.22 → Pod C (eth0 via veth)
    │
    └── vpc-cni (aws-node DaemonSet)
           manages ENI lifecycle via EC2 API

How pod IPs work:

Each EC2 instance can attach multiple ENIs; each ENI holds multiple secondary IPs
vpc-cni pre-warms a pool of secondary IPs per node via EC2 API calls
Pods receive a real VPC IP — routable natively across the VPC, peered VPCs, VPNs, and Direct Connect — with no overlay

Pod density limits per node (examples):

Instance Type	Max ENIs	Max IPs (pod limit)
t3.medium	3	17
m5.large	3	29
m5.xlarge	4	58
m5.4xlarge	8	234
c5.18xlarge	15	750

⚠️ Important: Default pod density is capped by the ENI/IP limit per instance type. For IP-constrained environments, use VPC CNI with prefix delegation (ENABLE_PREFIX_DELEGATION=true) to assign /28 prefixes instead of individual IPs, dramatically increasing pod density.

Key features:

Feature	AWS VPC CNI
IP assignment	Native VPC secondary IPs from ENIs
Overlay	✗ None — native VPC routing
NetworkPolicy	✗ Not built-in — requires Calico or Cilium add-on
Security Groups	✅ Security Groups for Pods (SGP) — per-pod AWS SGs
IPv6	✅ Supported
Prefix delegation	✅ /28 prefix per ENI (more pods per node)
Windows nodes	✅ Supported
Custom networking	✅ Pods in different subnet than node
eBPF acceleration	✅ via Cilium add-on (EKS + Cilium mode)

Enabling Network Policy on EKS:
AWS VPC CNI itself does not enforce NetworkPolicy. You must add one of:

Calico (most common) — install as an add-on alongside vpc-cni
Cilium in chained mode — replaces policy enforcement, keeps VPC IP routing
Amazon VPC CNI Network Policy (AWS-native, GA as of 2024) — uses eBPF for policy enforcement

# Enable AWS-native network policy controller (EKS add-on)
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --configuration-values '{"nodeAgent":{"enablePolicyEventLogs":"true"}}'

When to choose AWS VPC CNI:

✅ Running EKS — it is the default and AWS-managed
✅ Need pods directly reachable from on-premises via Direct Connect / VPN
✅ Need per-pod AWS Security Groups (SGP feature)
✅ Compliance requires no overlay network
⚠️ Watch instance type ENI limits for large pod densities

3.2 Azure CNI — AKS Default

Azure Kubernetes Service (AKS) offers multiple CNI modes. The default for most production clusters is Azure CNI, which assigns pod IPs directly from the Azure Virtual Network (VNET) subnet — similar in concept to AWS VPC CNI but using Azure's networking primitives.

AKS CNI Modes:

Mode	Description	Default?
kubenet	Basic overlay; nodes get VNET IPs, pods get private overlay IPs (NAT)	Legacy default
Azure CNI	Pods get real VNET IPs from a pre-allocated subnet	Current recommended default
Azure CNI Overlay	Pods get overlay IPs (larger scale, fewer VNET IPs needed)	Recommended for large clusters
Azure CNI + Cilium	Azure CNI routing + Cilium eBPF dataplane + Hubble	Recommended for policy/observability
Bring Your Own CNI	Disable Azure CNI; install Calico, Flannel, etc.	Advanced

Azure CNI (traditional):

AKS Worker Node (Azure VM)
    │
    ├── Primary NIC (node IP: 10.240.0.4)
    │      └── VNET: 10.240.0.0/16
    │
    └── Pod IPs pre-allocated from subnet:
           ├── 10.240.0.10 → Pod A
           ├── 10.240.0.11 → Pod B
           └── 10.240.0.12 → Pod C

azure-vnet (CNI plugin) programs routes in Azure SDN

Azure CNI Overlay (recommended for scale):
Introduced to solve IP exhaustion. Pods get IPs from a private overlay CIDR (e.g., 10.244.0.0/16) while nodes get real VNET IPs. Azure SDN handles the translation — no overlay encap at the packet level from the VM's perspective.

# Create AKS cluster with Azure CNI Overlay + Cilium dataplane
az aks create \
  --resource-group myRG \
  --name myAKS \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --network-dataplane cilium \
  --pod-cidr 192.168.0.0/16

Key features:

Feature	kubenet	Azure CNI	Azure CNI Overlay	Azure CNI + Cilium
Pod IPs	Overlay (NAT)	Real VNET IPs	Overlay (Azure SDN)	Overlay (Azure SDN)
IP exhaustion risk	Low	High	Low	Low
Direct pod routing	✗	✅	✅ (via Azure SDN)	✅
NetworkPolicy	Basic	Azure Network Policy / Calico	Azure NP / Calico	✅ Cilium (eBPF)
Windows nodes	✅	✅	✅	⚠️ Partial
Hubble observability	✗	✗	✗	✅
Max pods/node	110	250	250	250

Network Policy options on AKS:

Azure Network Policy Manager (NPM) — iptables-based, Azure-native, limited feature set
Calico — add-on, full L3/L4 policy, most commonly used
Cilium — available with Azure CNI Overlay mode, eBPF enforcement + Hubble

When to choose Azure CNI:

✅ Running AKS — Azure CNI Overlay is the modern recommended choice
✅ Need pods directly reachable from on-premises via ExpressRoute
✅ Want Hubble observability → use Azure CNI Overlay + Cilium dataplane
✅ Large clusters (100+ nodes) → use Overlay mode to avoid VNET IP exhaustion
⚠️ Traditional Azure CNI requires pre-allocating pod IPs per node — plan subnet size carefully

3.3 GKE Dataplane V2 — GKE Default

Google Kubernetes Engine (GKE) introduced Dataplane V2 in 2021, which is based on Cilium's eBPF engine. It is the default for new GKE clusters and brings production-grade eBPF networking, built-in NetworkPolicy enforcement, and a subset of Hubble observability — all managed by Google.

GKE networking modes:

Mode	Description	Default?
Legacy (iptables)	kube-proxy + iptables, no Dataplane V2	Older clusters
Dataplane V2	Cilium eBPF, managed by GKE, no full Cilium control plane	Default for new clusters
Dataplane V2 + Hubble	Same + network telemetry via Hubble	Optional add-on

Architecture:

GKE Node (GCE VM)
    │
    ├── Alias IP range (VPC-native pod CIDRs)
    │     Pods get real VPC IPs, routed via Google SDN
    │
    └── Dataplane V2 (Cilium eBPF engine)
           ├── TC eBPF hooks on veth interfaces
           ├── BPF maps for policy, NAT, LB
           ├── kube-proxy replaced by eBPF
           └── Hubble telemetry (if enabled)

GKE uses VPC-native networking (alias IP ranges) — pods get real VPC CIDRs routed natively through Google's Andromeda SDN. Dataplane V2 sits on top, adding eBPF policy enforcement and observability.

Enabling Dataplane V2 on GKE:

# Create GKE cluster with Dataplane V2 (default for new clusters)
gcloud container clusters create my-cluster \
  --enable-dataplane-v2 \
  --enable-ip-alias \
  --location us-central1

# Enable Hubble observability add-on
gcloud container clusters update my-cluster \
  --enable-dataplane-v2-flow-observability \
  --location us-central1

Key features:

Feature	GKE Dataplane V2
Dataplane	Cilium eBPF (managed subset)
kube-proxy replacement	✅ eBPF
NetworkPolicy	✅ eBPF-enforced (L3/L4)
FQDN policy	✅ (GKE 1.28+)
Hubble observability	✅ Optional add-on
L7 policy	⚠️ Not exposed (managed limitations)
Pod IPs	Real VPC IPs (alias ranges)
Windows nodes	✅
Multi-cluster	✅ via GKE Fleet / Anthos
Managed lifecycle	✅ Google manages upgrades

Dataplane V2 vs self-managed Cilium on GKE:

Aspect	GKE Dataplane V2	Self-managed Cilium on GKE
Management	Google-managed	You manage Helm values/upgrades
Feature exposure	Subset of Cilium	Full Cilium feature set
Hubble	Basic (add-on)	Full Hubble UI + Relay
Cluster Mesh	✗ (use GKE Fleet)	✅
L7 CNP	✗	✅
Support	GKE SLA	Community / Isovalent

💡 GKE Recommendation: For most workloads, Dataplane V2 is the right choice — Google manages it, it's eBPF-based, and it covers L3/L4 policy. If you need full CiliumNetworkPolicy L7 rules or Cluster Mesh, consider self-managed Cilium on GKE with --network-plugin=cni and disabling kube-proxy.

When to choose GKE Dataplane V2:

✅ Running GKE — it is the default and Google-managed
✅ Want eBPF performance without managing Cilium yourself
✅ NetworkPolicy enforcement at scale (eBPF O(1) lookups)
✅ Need basic Hubble network telemetry
⚠️ For full L7 policy or Cluster Mesh, self-manage Cilium on GKE instead

4. Data Plane Comparison

Service Scalability — All CNIs

Services	Flannel (iptables)	Calico (iptables)	Calico (eBPF)	Cilium (eBPF)	AWS VPC CNI	Azure CNI	GKE DPv2
100	~10 ms	~10 ms	< 1 ms	< 1 ms	~10 ms	~10 ms	< 1 ms
1,000	~80 ms	~80 ms	< 1 ms	< 1 ms	~80 ms	~80 ms	< 1 ms
10,000	~800 ms	~800 ms	< 1 ms	< 1 ms	~800 ms	~800 ms	< 1 ms
50,000	⚠️ drops	⚠️ drops	< 1 ms	< 1 ms	⚠️ drops	⚠️ drops	< 1 ms

5. Network Policy

Policy Feature Comparison

Policy Feature	Flannel	Calico	Cilium	Weave	Antrea	AWS VPC CNI	Azure CNI	GKE DPv2
Standard NetworkPolicy	✗	✅	✅	✅	✅	✅ (add-on)	✅	✅
Egress Policy	✗	✅	✅	✅	✅	✅	✅	✅
GlobalNetworkPolicy	✗	✅	✅ CCNP	✗	✅ ClusterNetworkPolicy	✗	✗	✗
FQDN / DNS policy	✗	✅	✅	✗	✅	✗	✗	✅ (1.28+)
L7 HTTP method/path	✗	⚠️ ALP	✅ no sidecar	✗	✗	✗	✗	✗
Kafka / gRPC policy	✗	✗	✅	✗	✗	✗	✗	✗
Tiered policy	✗	✗	✗	✗	✅	✗	✗	✗
Security Groups (cloud)	✗	✗	✗	✗	✗	✅ SGP	✅ NSG	✅ Firewall rules

6. Observability

Feature	Flannel	Calico	Cilium	Weave	Antrea	AWS VPC CNI	Azure CNI	GKE DPv2
L3/L4 flow logs	✗	✅	✅	✗	✅	✅ VPC Flow Logs	✅ NSG Flow Logs	✅
L7 HTTP flows	✗	✗ (OSS)	✅	✗	✗	✗	✗	✗
Live service map	✗	✗	✅ Hubble UI	✗	✅ Octant	✗	✗	✅ (add-on)
Drop reason	✗	✅	✅	✗	✅	⚠️	⚠️	✅
Prometheus metrics	Basic	✅	✅ Rich	✅ Basic	✅	✅ CloudWatch	✅ Azure Monitor	✅
Built-in UI	✗	✗ (OSS)	✅ Hubble UI	✗	✅ Octant	✅ CloudWatch	✅ Azure Monitor	✅ Cloud Console

7. Performance Benchmarks

TCP Throughput — iperf3, Pod-to-Pod Same Node

CNI	Mode	Throughput
Flannel	VXLAN	~8 Gbps
Flannel	host-gw	~9.5 Gbps
Calico	BGP direct (iptables)	~9.3 Gbps
Calico	BGP direct (eBPF)	~9.7 Gbps
Cilium	GENEVE tunnel	~8.5 Gbps
Cilium	native-routing	~9.8 Gbps
Cilium	XDP	line rate
AWS VPC CNI	Native VPC routing	~9.5 Gbps
Azure CNI	Native VNET routing	~9.4 Gbps
GKE Dataplane V2	Alias IP + eBPF	~9.7 Gbps

⚠️ Results are representative — hardware, kernel version, and NIC driver all affect real-world numbers.

p99 Latency — Same Node

CNI	Mode	p99 Latency
Flannel	VXLAN	~0.35 ms
Flannel	host-gw	~0.18 ms
Calico	BGP direct (eBPF)	~0.15 ms
Cilium	native-routing	~0.16 ms
AWS VPC CNI	Native	~0.17 ms
Azure CNI	Native	~0.18 ms
GKE Dataplane V2	eBPF	~0.15 ms

8. Encryption

Feature	Flannel WG	Calico WG	Cilium WG	Cilium IPsec	Antrea WG/IPsec	AWS CNI	Azure CNI	GKE DPv2
Cross-node encryption	✅	✅	✅	✅	✅	✅ (NLB/TLS)	✅ (Azure Firewall)	✅ (WireGuard, beta)
Same-node encryption	✗	✅ (v3.26+)	✅	✅	✅	✗	✗	✗
Strict drop mode	✗	✗	✅	✅	✗	N/A	N/A	✗
Auto key rotation	✗	✅	✅	✅	✅	Managed	Managed	Managed
FIPS compliance	✗	✗	✗	✅	✅ IPsec	✅ (AWS FIPS)	✅ (Azure FIPS)	✅ (Google FIPS)

9. Multi-Cluster

Feature	Flannel	Calico	Cilium	Antrea	AWS EKS	Azure AKS	GKE
Native multi-cluster	✗	✅ BGP	✅ Cluster Mesh	✅ Antrea Multi-cluster	✅ EKS Connector	✅ AKS Fleet	✅ GKE Fleet
Unified service DNS	✗	✗	✅	✅	⚠️ (manual)	⚠️ (manual)	✅ (Anthos)
Cross-cluster NetworkPolicy	✗	✗ (OSS)	✅	✅	✗	✗	✅ (Anthos)
Cross-cluster observability	✗	✗	✅ Hubble	✅	✅ CloudWatch	✅ Azure Monitor	✅ Cloud Ops
Max clusters	—	Unlimited	255	Unlimited	Unlimited	Unlimited	Unlimited

10. Resource Usage

Resource	Flannel	Calico	Cilium	Weave	Antrea	AWS VPC CNI	Azure CNI	GKE DPv2
DaemonSet CPU (idle)	~5 mCPU	~20–60 mCPU	~30–80 mCPU	~10–30 mCPU	~20–50 mCPU	~10–25 mCPU	~10–30 mCPU	~30–80 mCPU
DaemonSet RAM (idle)	~30 MB	~60–150 MB	~100–300 MB	~50–100 MB	~50–100 MB	~30–80 MB	~40–80 MB	~100–300 MB
Startup time	~5s	~10–20s	~30–60s	~10s	~10–15s	~5–10s	~5–10s	Managed
Additional CRDs	0	~8	~15	0	~10	0–2	0	0
Minimum kernel	Any	Any / ≥5.3 (eBPF)	≥4.9	Any	Any	Any	Any	GKE-managed
Operator required	✗	✅ tigera	✅ cilium-operator	✗	✅ antrea-controller	AWS-managed	Azure-managed	GKE-managed

11. Full Feature Comparison

Dimension	Flannel	Calico	Cilium	Weave	Antrea	AWS VPC CNI	Azure CNI	GKE DPv2
Data plane	Bridge + iptables	BGP + iptables/eBPF	eBPF kernel-native	Mesh sleeve/VXLAN	OVS	VPC native	VNET native	eBPF (Cilium)
kube-proxy replacement	✗	✅ (eBPF)	✅	✗	✅ AntreaProxy	✗	✗	✅
Encapsulation	VXLAN	None/IPIP/VXLAN	GENEVE	Sleeve/VXLAN	Geneve/VXLAN	None	None	None
BGP routing	✗	✅ native	✅ optional	✗	✗	✗	✗	✗
L3/L4 NetworkPolicy	✗	✅	✅	✅	✅	✅ (add-on)	✅	✅
L7 HTTP/gRPC policy	✗	⚠️ ALP	✅ no sidecar	✗	✗	✗	✗	✗
FQDN-based policy	✗	✅	✅	✗	✅	✗	✗	✅ (1.28+)
GlobalNetworkPolicy	✗	✅	✅ CCNP	✗	✅ CNP	✗	✗	✗
Flow observability	✗	✅ flow logs	✅ Hubble	✗	✅ Octant	✅ VPC Flow	✅ NSG Flow	✅
L7 flow visibility	✗	✗ (OSS)	✅	✗	✗	✗	✗	✗
Cross-node encryption	✅ WG	✅ WG	✅ WG/IPsec	✅ NaCl	✅ WG/IPsec	Cloud-layer	Cloud-layer	✅ WG (beta)
Same-node encryption	✗	✅ (v3.26+)	✅	✗	✅	✗	✗	✗
FIPS encryption	✗	✗	✅ IPsec	✗	✅ IPsec	✅ (AWS)	✅ (Azure)	✅ (GCP)
Multi-cluster	✗	✅ BGP	✅ Cluster Mesh	✗	✅	EKS Fleet	AKS Fleet	GKE Fleet
Windows nodes	⚠️	✅ HNS	✗	✗	✅	✅	✅	✅
Cloud default	K3s	Manual	GKE	Manual	Manual	EKS	AKS	GKE
RAM per node (idle)	~30 MB	~60–150 MB	~100–300 MB	~50–100 MB	~50–100 MB	~30–80 MB	~40–80 MB	~100–300 MB
Operational complexity	Very low	Medium	Medium–High	Low	Medium	Low (managed)	Low (managed)	Low (managed)
Active development	✅	✅	✅	⚠️ Archived	✅	✅	✅	✅

12. When to Choose Each

🟢 Choose Flannel when…

✅ Dev, CI, or home lab cluster with no production traffic
✅ No NetworkPolicy requirement whatsoever
✅ RAM-constrained nodes (Raspberry Pi, 1 GB edge devices)
✅ You want the absolute lowest operational overhead
✅ Running a legacy kernel (RHEL 7 / CentOS 7)
✅ Already using a service mesh (Istio, Linkerd) for policy and observability

🟠 Choose Calico when…

✅ NetworkPolicy is required and Cilium feels like overkill
✅ You need BGP peering with upstream physical routers
✅ Windows nodes exist in your cluster
✅ No-encap direct routing is preferred for performance
✅ Your team already has Calico expertise
✅ Medium cluster size (10–200 nodes) with moderate policy complexity

🔵 Choose Cilium when…

✅ L7 HTTP/gRPC/Kafka policy without a service mesh sidecar
✅ Hubble observability and a live service map are needed
✅ 100+ services with high service churn (eBPF O(1) matters)
✅ End-to-end pod traffic encryption including same-node
✅ Multi-cluster federation with unified DNS and policy
✅ Building toward zero-trust networking inside the cluster

🟡 Choose Weave when…

⚠️ Generally not recommended for new clusters — Weaveworks is archived
✅ Only if migrating from an existing Weave deployment with no immediate migration path
✅ Simple overlay needed with built-in NaCl encryption (short term)

🟣 Choose Antrea when…

✅ VMware NSX-T / Tanzu environment requiring deep SD-WAN integration
✅ Tiered network policy enforcement (Emergency / Security / Application tiers)
✅ Windows and Linux mixed clusters in an enterprise VMware stack
✅ OVS dataplane is a hard requirement (telco, NFV)

🔶 Choose AWS VPC CNI (EKS) when…

✅ Running EKS — it is the default AWS-recommended CNI
✅ Pods must be natively routable across VPC, VPN, or Direct Connect
✅ Per-pod AWS Security Groups are required (SGP feature)
✅ Compliance mandates no overlay network
✅ Integrate with AWS services that need pod-level VPC routing

🔷 Choose Azure CNI (AKS) when…

✅ Running AKS — use Azure CNI Overlay mode for most production workloads
✅ Pods need to be reachable from on-prem via ExpressRoute
✅ Want eBPF performance + Hubble → choose Azure CNI Overlay + Cilium dataplane
✅ Large clusters → Azure CNI Overlay avoids VNET IP exhaustion
✅ Windows node support is required (all Azure CNI modes support it)

♦️ Choose GKE Dataplane V2 (GKE) when…

✅ Running GKE — it is the default for new clusters
✅ Want eBPF-based policy without managing Cilium yourself
✅ Need Hubble network telemetry (enable as add-on)
✅ FQDN-based NetworkPolicy (GKE 1.28+)
✅ Google-managed lifecycle and upgrades are preferred
⚠️ For L7 CNP or Cluster Mesh, self-manage Cilium on GKE instead

13. K3s-Specific Setup

Flannel — Built-In, Nothing to Do

# Flannel ships with K3s — just install
curl -sfL https://get.k3s.io | sh -

# Change backend in /etc/rancher/k3s/config.yaml
flannel-backend: host-gw   # vxlan | host-gw | wireguard-native | none

Installing Calico on K3s

Step 1 — Install K3s without Flannel:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--flannel-backend=none \
  --disable-network-policy \
  --cluster-cidr=192.168.0.0/16" sh -

Step 2 — Install Calico operator:

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml

Step 3 — Apply Installation CR:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled

Installing Cilium on K3s

Step 1 — Install K3s without Flannel:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--flannel-backend=none \
  --disable-network-policy \
  --disable=servicelb" sh -

Step 2 — Install Cilium via Helm:

helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set operator.replicas=1 \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=<YOUR_K3S_API_IP> \
  --set k8sServicePort=6443 \
  --set bpf.masquerade=true \
  --set ipam.mode=kubernetes

Minimum Kernel Requirements

Feature	Cilium	Calico eBPF
Basic CNI	≥ 4.9	Any
kube-proxy replacement	≥ 5.2	≥ 5.3
WireGuard encryption	≥ 5.6	≥ 5.6
XDP acceleration	≥ 5.10	≥ 5.10

✅ Ubuntu 22.04 ships kernel 5.15, Debian 12 ships 6.1, Raspberry Pi OS Bookworm ships 6.1 — all satisfy every requirement.

14. Migration Guide on K3s

All migrations follow the same pattern:

drain → clean CNI state → restart K3s with --flannel-backend=none → install new CNI → uncordon

Flannel → Calico

# Step 1: Drain the node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Step 2: Remove Flannel state on the node
systemctl stop k3s
ip link delete flannel.1 2>/dev/null || true
ip link delete cni0 2>/dev/null || true
rm -rf /var/lib/cni /etc/cni/net.d

# Step 3: Set flannel-backend: none in /etc/rancher/k3s/config.yaml, then restart
systemctl start k3s

# Step 4: Install Calico operator
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml

# Step 5: Uncordon
kubectl uncordon <node>

Flannel → Cilium

# Steps 1–3 same as above (drain, clean, restart with flannel-backend=none)

# Step 4: Install Cilium
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=<API_IP> \
  --set k8sServicePort=6443

# Step 5: Uncordon
kubectl uncordon <node>

💡 Pro Tip: For single-node K3s lab environments, a clean reinstall is always faster and safer than a live migration. Run k3s-uninstall.sh, reinstall with the correct flags, then Helm install your chosen CNI — total time is about 10 minutes.

15. Conclusion

Open-Source CNIs

🟢 Flannel — A masterpiece of minimalism. One job, done perfectly, with near-zero operational overhead. The right choice when simplicity and RAM constraints matter more than policy or observability.
🟠 Calico — The policy-first CNI. BGP-native routing, mature L3/L4 NetworkPolicy, Windows node support, and a pluggable data plane. The right choice when you need robust policy enforcement, prefer no-encap routing, or operate in an environment with existing BGP infrastructure.
🔵 Cilium — The platform CNI. eBPF-native with O(1) service lookup, L7-aware policy with no sidecar, Hubble observability, full pod-traffic encryption, and Cluster Mesh multi-cluster. The most capable networking layer available in Kubernetes today.
🟡 Weave Net — Once a popular choice for simplicity and built-in encryption. Now archived — migrate to Cilium or Calico for any new or long-running cluster.
🟣 Antrea — The VMware-native CNI. OVS dataplane, tiered policy, Windows support, and NSX-T integration. The right choice in Tanzu or NSX environments.
🔷 Multus — Not a CNI replacement but a CNI multiplier. Essential for telco/NFV workloads needing multiple pod network interfaces.

Cloud Provider CNIs

🔶 AWS VPC CNI (EKS) — Native VPC IP assignment with no overlay. Pods are first-class VPC citizens. Add Calico or the AWS-native policy controller for NetworkPolicy. Choose prefix delegation for high pod density.
🔷 Azure CNI (AKS) — Use Azure CNI Overlay for most production workloads to avoid IP exhaustion, and add the Cilium dataplane for eBPF policy + Hubble observability. Azure CNI traditional still works, but requires careful subnet pre-planning.
♦️ GKE Dataplane V2 (GKE) — Google's managed Cilium eBPF layer. The default for new GKE clusters. Handles NetworkPolicy at scale with eBPF O(1) lookups. Add the Hubble observability add-on for network telemetry. Self-manage Cilium on GKE only if you need L7 CNP or Cluster Mesh.

Bottom line: If you run a managed Kubernetes service, use the cloud-default CNI and layer policy/observability on top. If you run self-managed clusters, Cilium is the most capable long-term investment, with Calico as the pragmatic choice if BGP integration or Windows nodes are required.

The networking layer of your cluster is not where you want to cut corners at scale.
Choose based on where your cluster is going — not just where it is today.

🔴 Supply Chain Attacks Are Breaking the Internet in 2026 — Every Major Hack Explained

Pendela BhargavaSai — Tue, 05 May 2026 04:00:00 +0000

Your vulnerability scanner is hacking you. Your password manager got weaponized. Your AI coding tool is the new attack surface. Welcome to 2026.

The Year Everything Became a Weapon

In 2025, supply chain attacks were a concern. In 2026, they became the dominant threat vector in software security.

The numbers are staggering: a single compromised maintainer account poisoned a library with 100 million weekly downloads. A misconfigured CI/CD workflow cascaded into five separate tool compromises within days. A developer downloaded Roblox exploit scripts, and that mistake eventually exposed Vercel's internal database — which was listed for sale at $2 million on BreachForums.

This isn't theoretical risk. This is what happened between January and April 2026.

In this post, I'm going to break down every major supply chain attack that hit the IT and software ecosystem this year — what got compromised, how the attackers did it, what the real blast radius looked like, and most importantly, what you need to do right now to protect your pipelines.

Let's start with what a supply chain attack actually is — because most explanations bury the lead.

What Is a Software Supply Chain Attack?

Here's the mental model that matters:

Instead of breaking into your house, the attacker bribes your locksmith.

When you run npm install or pip install, you're implicitly trusting thousands of strangers who maintain open-source packages. You're trusting their accounts, their CI/CD pipelines, their GitHub credentials, and their judgment. Every single one of those trust relationships is an attack surface.

A supply chain attack exploits that trust. Instead of targeting you directly — which requires defeating your firewall, your endpoint detection, your access controls — attackers target the supplier. Compromise one maintainer account, and you've just compromised every developer who installs that package.

The attack chain looks like this:

1. `Identify` a maintainer of a widely-used package
2. Phish their npm/GitHub credentials, or exploit a misconfigured CI/CD workflow
3. Push backdoored versions — the malware runs at install time or on startup
4. Harvest: cloud credentials, SSH keys, API tokens, Kubernetes configs
5. Cascade: use stolen tokens to compromise more repos, more pipelines, more packages
6. Monetize: ransomware, data sale on BreachForums, cryptomining

The asymmetry is what makes this so devastating. The attacker breaks in once, at one point in the supply chain, and inherits access to thousands of downstream organizations simultaneously.

Now let's talk about what actually happened in 2026.

January 2026 — Cisco Unified Communications Zero-Day (CVE-2026-20045)

What got compromised

Cisco's entire enterprise voice stack: Unified Communications Manager, IM & Presence Service, Unity Connection, and Webex Calling Dedicated Instance.

How it happened

A critical zero-day in the web-based management interface allowed unauthenticated remote attackers to send crafted HTTP requests and execute arbitrary commands on the underlying OS — then escalate straight to root. No credentials needed. No user interaction required.

Why it's a supply chain risk (not just a vulnerability)

This one is subtler than the package ecosystem attacks below, but it's a textbook supply chain risk: managed service providers. Thousands of organizations outsource their voice and UC infrastructure to third parties. If your managed service provider is running vulnerable Cisco UC components, your business communications become a pivot point into your environment — even if your own perimeter is airtight.

This is the definition of inherited risk. You didn't deploy the vulnerable software. You didn't configure it. But you're exposed because you trusted someone who did.

How to protect yourself

Apply Cisco's emergency patch immediately (see Cisco Security Advisory cisco-sa-20260115-uc)
Implement continuous vendor monitoring — when a critical advisory drops, you need instant visibility into which of your vendors is exposed
Restrict management interface access to known IP ranges only
Map which applications and data flows depend on your vendors' UC components so you can assess blast radius before an attack, not after

February 2026 — GitHub Actions: The Misconfiguration That Started Everything

What got compromised

This is the origin point of the largest multi-tool supply chain campaign of 2026. A threat actor operating under the GitHub handle hackerbot-claw (account created February 20, 2026) ran an automated campaign scanning public repositories for a specific GitHub Actions misconfiguration: the pull_request_target event trigger with excessive token permissions.

On February 27–28, the attacker successfully exploited this misconfiguration in Aqua Security's Trivy repository, exfiltrating the aqua-bot service account's Personal Access Token (PAT). This PAT had write access to release automation — which is everything the attacker needed.

How it happened

The pull_request_target workflow is a GitHub Actions feature that lets CI pipelines trigger automatically on pull requests from external contributors. The problem: when misconfigured, external code gets access to the repository's internal secrets. The workflow essentially hands an untrusted contributor the keys to your pipeline.

Aqua detected the intrusion and attempted credential rotation. But here's the critical failure: the rotation was not atomic. Sequential token replacement left a window during which newly issued tokens may have been captured. As Aqua's VP of Open Source, Itay Shakury, later confirmed:

"We rotated secrets and tokens, but the process wasn't atomic, and attackers may have been privy to refreshed tokens."

This residual access enabled everything that followed in March.

The lesson about `pull_request_target`

This is a well-documented dangerous pattern, but it keeps getting deployed:

# ⚠️ DANGEROUS — external PRs can access your secrets
on:
  pull_request_target:
    types: [opened]

jobs:
  ci:
    permissions:
      contents: write  # ← This is the mistake

# ✅ SAFE — pin to SHA, restrict permissions
on:
  pull_request:

jobs:
  ci:
    permissions:
      contents: read  # minimum required

How to protect yourself

Never use pull_request_target with write permissions for workflows triggered by external contributors
Pin all GitHub Actions to full 40-character commit SHAs — not version tags (more on why this matters below)
Rotate credentials atomically — revoke all, reissue all, in a single synchronized operation
Limit service account tokens to minimum required permissions and scope

March 2026 — The Month Everything Went Wrong

March 2026 will go down as the most significant month in software supply chain history. Five major compromises. One threat group. A cascade that went from a misconfigured GitHub workflow to a ransomware operation targeting 1,000+ enterprise SaaS environments.

Let me break each one down.

🔍 Trivy (Aqua Security) — March 19–20, 2026

CVE-2026-33634 | Severity: CRITICAL

What happened

At approximately 17:43 UTC on March 19, 2026, an attacker with residual access from the February compromise force-pushed malicious code to 75 of 77 version tags in aquasecurity/trivy-action — the official GitHub Action for Trivy, one of the most widely deployed open-source vulnerability scanners in the world.

Simultaneously, all 7 tags in aquasecurity/setup-trivy were poisoned, and a weaponized Trivy binary (v0.69.4) was published to GitHub Releases, Docker Hub, GHCR, ECR Public, and deb/rpm repositories.

Safe versions: only trivy-action v0.35.0, setup-trivy v0.2.6, and trivy v0.69.3 were unaffected.

The attack was elegant and terrifying

The malicious entrypoint.sh ran the credential-harvesting payload first, then ran the legitimate Trivy scan. Workflows completed normally. No errors. No indication of compromise. Developers watching their CI logs saw a clean vulnerability scan — while their secrets were being exfiltrated in the background.

The malware (named "TeamPCP Cloud Stealer") performed three operations:

Dumped Runner.Worker process memory to extract GitHub PATs and CI secrets
Swept SSH keys, cloud credentials (AWS, GCP, Azure), Kubernetes tokens, Docker configs, Git credentials
Encrypted the bundle with AES-256 + RSA-4096 and exfiltrated to attacker-controlled servers

If the primary C2 channel failed, the malware fell back to creating a repository called tpcp-docs inside the victim's own GitHub organization to store stolen secrets. Check your org for that repo right now.

The forensic tells (that most teams missed)

# Each malicious commit had an impossible timestamp:
# - Claimed to be from 2021/2022
# - But parent commit was dated March 2026

# Additionally:
# - Only entrypoint.sh was modified per commit
# - Original commits touched multiple files
# - GitHub's "Immutable" release badge was present (but meaningless)

How to protect yourself

# ❌ VULNERABLE — tag can be rewritten silently
- uses: aquasecurity/trivy-action@v0.34.2

# ✅ SECURE — commit SHA is immutable
- uses: aquasecurity/trivy-action@f781cce5aab226378d021711787766a7d423d18d

If you ran Trivy between 17:43 and 23:13 UTC on March 19, 2026:

Search your GitHub org for any repo named tpcp-docs
Check DNS/network logs for connections to scan.aquasecurtiy[.]org (note the typo — deliberate)
Check for connections to 45.148.10.212
Treat all CI/CD secrets from that window as fully compromised — rotate everything

🧠 LiteLLM — March 24, 2026

Severity: CRITICAL | ~3.4M daily downloads | 40-minute exposure window

What happened

LiteLLM is a Python package providing a unified interface for 100+ LLM APIs — OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI. Because it sits between your applications and multiple AI providers, it has access to API keys and cloud credentials for all of them. That's exactly why it was targeted.

The compromise was a cascade from Trivy: LiteLLM's CI/CD pipeline used Trivy for security scanning. When Trivy was poisoned on March 19, the malware in LiteLLM's pipeline exfiltrated its PyPI publish token to TeamPCP. Five days later, attackers used that token to upload two malicious versions directly to PyPI:

litellm==1.82.7 — published 10:39 UTC
litellm==1.82.8 — published 10:52 UTC

Both were live for approximately 40 minutes before PyPI quarantined them. During that window, they accumulated tens of thousands of downloads.

The 3-stage payload

# Stage 1: Credential Harvesting
# Exfiltrates to models.litellm.cloud (attacker-controlled, not official BerriAI domain)
collect([
    "LLM API keys (OpenAI, Anthropic, Google...)",
    "Cloud credentials (AWS, GCP, Azure)",
    "SSH keys, shell history, .env files",
    "Crypto wallets",
    "Kubernetes configs"
])

# Stage 2: Kubernetes Lateral Movement
# Deploys privileged DaemonSets → full cluster access

# Stage 3: Persistence
# Installs ~/.config/systemd/user/sysmon.service
# Polls attacker server for additional payloads
# Survives package removal

The .pth file mechanism in v1.82.8 was particularly nasty: it placed a litellm_init.pth file that executed on every Python interpreter startup — meaning the payload fired even when LiteLLM wasn't explicitly imported.

Disclosure suppression

When the community opened GitHub issue #24512 to report the compromise, TeamPCP deployed 88 bots from 73 unique compromised developer accounts in a 102-second window to spam the thread. They used the compromised maintainer account to close the issue as "not planned." This is one of the first documented uses of AI-assisted bot networks for supply chain attack disclosure suppression.

Immediate action

# Check if you're affected
pip show litellm | grep Version
# v1.82.7 or v1.82.8 = COMPROMISED

# Check for persistence
ls ~/.config/systemd/user/sysmon.service
ls ~/.config/sysmon/sysmon.py

# In Kubernetes
kubectl get pods -n kube-system | grep "node-setup"

# Purge cache
pip cache purge
# or
rm -rf ~/.cache/uv

# Safe version
pip install litellm==1.82.6

📦 Axios (npm) — March 30–31, 2026

Severity: CRITICAL | ~100M weekly downloads | Attributed: UNC1069 (North Korea)

What happened

Axios is one of the most depended-upon libraries in the JavaScript ecosystem. At the time of the attack, it was present in approximately 80% of cloud and code environments. The attack didn't exploit any code vulnerability — it was a straightforward account takeover.

Attackers compromised the npm account of jasonsaayman, Axios's primary maintainer, by changing the account's associated email from jasonsaayman@gmail.com to ifstap@proton.me. This bypassed the GitHub Actions OIDC publish flow entirely.

The attack timeline:

2026-03-30 05:57 UTC — plain-crypto-js@4.2.0 published (clean decoy, builds registry history)
2026-03-30 23:59 UTC — plain-crypto-js@4.2.1 published (malicious postinstall backdoor)
2026-03-31 00:21 UTC — axios@1.14.1 published (MALICIOUS, tagged: latest)
2026-03-31 01:00 UTC — axios@0.30.4 published (MALICIOUS, tagged: legacy)
2026-03-31 03:29 UTC — Detected and removed

39 minutes. Two malicious versions. Both tagged as the default install.

The payload

The malicious dependency plain-crypto-js contained a postinstall hook that silently downloaded and executed platform-specific stage-2 RAT implants from sfrclak[.]com:8000. Cross-platform: macOS, Windows, Linux.

Google's Threat Intelligence Group attributed this to UNC1069, a financially motivated North Korean threat actor. OpenAI was sufficiently exposed via Axios's dependency chain that it revoked its macOS code-signing certificate on March 31, 2026 as a precaution.

Check your lockfiles now

# Check for compromised versions
grep -E "axios.*(1\.14\.1|0\.30\.4)" package-lock.json
grep -E "plain-crypto-js" package-lock.json yarn.lock bun.lockb

# Safe versions
npm install axios@1.14.0  # Last legitimate 1.x with SLSA provenance

How to protect yourself

Enable phishing-resistant MFA on npm, GitHub, and all cloud platforms — no exceptions
Use npm ci with strict lockfiles instead of npm install
Monitor npm for maintainer email changes on critical dependencies
Audit and block postinstall scripts in CI environments where possible
Never run npm install on production systems from ephemeral runners without lockfile pinning

🤖 Anthropic Claude Code — March 31, 2026

Severity: HIGH | ~512,000 lines of proprietary source code | Root cause: Human error

What happened

This one is different from the others — it wasn't a malicious actor compromising a third party. Anthropic accidentally shipped the entire source code of Claude Code to the public npm registry.

When Anthropic published @anthropic-ai/claude-code version 2.1.88, a missing exclusion rule in the build configuration caused a 59.8 MB JavaScript source map file (cli.js.map) to be bundled into the package. That source map pointed to a zip archive on Anthropic's Cloudflare R2 storage containing the full, unobfuscated TypeScript source — 512,000 lines across 1,906 files.

Security researcher Chaofan Shou spotted it on X within hours. By the time Anthropic pulled the package at ~08:00 UTC, the code had been downloaded from their own cloud storage, mirrored to GitHub, and forked tens of thousands of times.

What was exposed

Complete multi-agent orchestration architecture
Self-healing memory system (MEMORY.md architecture with lazy-load topic files)
"Undercover Mode" — suppresses Anthropic-internal metadata in commits to public repos
Anti-distillation controls — injects fake tool definitions into API responses to poison competitor training data
44 feature flags, including an unreleased Tamagotchi easter egg planned for April 1–7
Bidirectional CLI-to-IDE communication layer

The cascading danger

The leak coincided — entirely coincidentally — with the Axios RAT attack. Anyone who updated Claude Code via npm between 00:21 and 03:29 UTC on March 31 may have simultaneously pulled a trojanized version of Axios.

Additionally, attackers immediately registered npm packages mimicking Anthropic's internal tooling (audio-capture-napi, color-diff-napi, image-processor-napi) to stage dependency confusion attacks against developers trying to compile the leaked source.

Do not download, fork, build, or run any GitHub repository claiming to be "leaked Claude Code." Many of these repositories are active malware lures delivering Vidar Stealer and GhostSocks.

Anthropic's official statement

"Earlier today, a Claude Code release included some internal source code. No sensitive customer data or credentials were involved or exposed. This was a release packaging issue caused by human error, not a security breach. We're rolling out measures to prevent this from happening again."
— Anthropic Spokesperson

What this means for your build pipeline

# The failure point: Bun generates source maps by default.
# A single missing line in build config exposed 512K lines of IP.

# Lesson: Add this to your CI/CD pre-publish checklist:
✓ Verify .npmignore excludes *.map files
✓ Verify `files` field in package.json is allowlist-based, not denylist
✓ Run `npm pack --dry-run` and inspect the manifest before every publish
✓ Set up automated secret/source scanning on all npm publish workflows

April 2026 — The Attacks Keep Coming

▲ Vercel — April 19, 2026

Severity: CRITICAL | Entry point: AI productivity tool | Dwell time: ~2 months

What happened

This attack is a masterclass in how OAuth trust relationships create invisible lateral movement paths.

The chain:

February 2026: A Context.ai employee downloaded Roblox game exploit scripts. Those scripts installed Lumma Stealer malware.
Lumma Stealer exfiltrated the employee's Google Workspace OAuth tokens.
Context.ai's Chrome Extension had been granted full Google Drive read access by users during onboarding.
A Vercel enterprise employee had used Context.ai and connected their Vercel Google account.
Attackers pivoted from the stolen tokens → Context.ai's AWS environment → OAuth tokens for their product → the Vercel employee's workspace → Vercel's internal systems.

Vercel disclosed the breach on April 19, 2026. By then, the attacker had approximately 2 months of dwell time. Vercel's CEO Guillermo Rauch confirmed the attack chain publicly on X and named Context.ai as the compromised third party.

The stolen Vercel internal database was listed for sale at $2 million on BreachForums by ShinyHunters.

The env variable problem

Vercel's environment variable model left variables not explicitly marked as "sensitive" unencrypted at rest. Once an attacker had team-scoped OAuth access, they could read all non-sensitive environment variables — connection strings, API keys, third-party service credentials — stored by developers who assumed they were protected.

Key takeaway for developers

You can have perfect security in your own systems and still get breached because an AI productivity tool you gave full Drive access to got compromised via an employee who downloaded Roblox scripts.

This is the supply chain threat model in its purest form. The attack surface is no longer just your code — it's every OAuth permission you've ever granted.

How to protect yourself

Immediate actions:
✓ Audit all OAuth app permissions in your Google Workspace — revoke apps with excessive access
✓ Mark ALL Vercel environment variables as "sensitive" explicitly (not just secrets)
✓ Query database connection logs for IPs outside known egress ranges, Feb–Apr 2026 window
✓ Rotate all API keys and secrets stored in Vercel project environment variables

Systemic changes:
✓ Never grant AI tools full-read workspace access — use scoped permissions
✓ Implement OAuth token monitoring to detect abnormal access patterns
✓ Treat third-party AI tools with the same vendor risk assessment as any SaaS platform

🔐 Bitwarden CLI — April 22, 2026

Severity: CRITICAL | Window: 90 minutes | Notable: First supply chain attack targeting AI coding tools

What happened

The Shai-Hulud worm's "Third Coming." At 5:57 PM ET on April 22, 2026, attackers published @bitwarden/cli@2026.4.0 — a malicious version of the CLI tool for the world's most popular open-source password manager (10M+ users, 50,000 business customers). By 7:30 PM ET, it was gone.

90 minutes. That's the entire attack window.

The attack vector: Bitwarden's repository uses checkmarx/ast-github-action — one of the GitHub Actions compromised in the ongoing Checkmarx supply chain campaign (also attributed to TeamPCP). Attackers hijacked Bitwarden's CI/CD pipeline, editing the publish-cli.yml workflow five consecutive times to inject a prebuilt malicious tarball containing the payload bw1.js.

Bitwarden confirmed: no user vault data was accessed. The web extension, desktop apps, and all other clients were unaffected. Only the CLI npm package was compromised.

The payload was remarkable

The malware targeted six distinct credential surfaces and introduced two novel capabilities:

// Credential targets:
targets = [
  "AWS access keys + SSM/Secrets Manager",
  "Azure credentials + Key Vault",
  "GCP service account keys + Secret Manager",
  "GitHub PATs + npm publish tokens",
  "SSH keys + shell history + .env files",
  "AI coding assistant configurations"  // ← NEW in 2026
]

// Novel capability 1: AI tool targeting
// Explicitly probed for: Claude, Cursor, Codex CLI, Aider
// If authenticated session found → extract credentials + inject persistence

// Novel capability 2: Self-propagating worm
// Uses victim's npm publish tokens to backdoor ALL packages they can publish to
// Exfiltrates to public GitHub repos (RSA-encrypted) as dead-drop C2
// GitHub traffic not flagged by security tools → effective evasion

// Kill switch: skips if Russian locale detected

This changes the threat model for AI coding tools

The Bitwarden CLI attack — combined with the Vercel breach via Context.ai — confirms a clear pattern that security teams need to internalize:

AI coding tools (Claude, Cursor, Copilot, Aider) sit at the intersection of everything attackers want: source code access, command execution, API credentials, and cloud service connections.

These tools are now explicitly named in supply chain attack malware. Your AI coding assistant's authentication state is a credential worth stealing.

Immediate response

# Check if affected (installed between 5:57–7:30 PM ET, April 22)
npm list @bitwarden/cli  # 2026.4.0 = COMPROMISED

# Clean install
npm uninstall -g @bitwarden/cli
npm cache clean --force
npm install -g @bitwarden/cli@2026.4.1  # verified clean

# Find C2 artifacts
find / -name "bw1.js" -o -name "bw_setup.js" 2>/dev/null

# Search for data exfil repos
# Check public GitHub for repos containing: "Shai-Hulud: The Third Coming"

# Rotate if affected:
# → GitHub PATs
# → npm tokens  
# → AWS access keys
# → GCP service account keys
# → Azure credentials
# → SSH keys

The Big Picture: TeamPCP and the Campaign Architecture

Most of the March–April attacks trace back to a single threat group: TeamPCP (also operating as DeadCatx3, PCPcat, Persy_PCP, ShellForce, and CipherForce).

TeamPCP first appeared in late December 2025 as a group focused on cloud-native infrastructure exploitation. Their 2026 campaign was methodical:

Phase 1 (Feb 27–28):  Exploit pull_request_target in Trivy → steal aqua-bot PAT
Phase 2 (Mar 1):      Aqua rotates credentials → incomplete rotation
Phase 3 (Mar 19):     Use residual access → poison 75 Trivy tags + Docker images
Phase 4 (Mar 21):     Use stolen PATs from Trivy → poison KICS GitHub Actions
Phase 5 (Mar 24):     Use LiteLLM CI's Trivy → steal PyPI token → poison LiteLLM
Phase 6 (Mar 27):     Telnyx Python SDK compromised
Phase 7 (Mar 30–31):  Axios npm package poisoned (separate North Korean actor)
Phase 8 (Apr 15):     Vect ransomware lists first victim from Trivy campaign
Phase 9 (Apr 22):     Bitwarden CLI poisoned via Checkmarx GitHub Action

The campaign spanned PyPI, npm, Docker Hub, GitHub Actions, and OpenVSX in a single coordinated multi-ecosystem operation.

Systemic Defenses: What Actually Works

After cataloging all of this, here's what the evidence shows actually works:

1. Pin to commit SHAs, not version tags

# This is the single highest-impact change you can make:

# ❌ VULNERABLE (both of these)
uses: some-action@v2.0
uses: some-action@main

# ✅ IMMUTABLE — cannot be silently changed
uses: some-action@a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2

The 2025 tj-actions attack and the 2026 Trivy attack both succeeded because developers referenced actions by tag. Both would have been completely immune with SHA pinning. One line of config change. That's it.

2. Use lockfiles strictly

# In CI/CD pipelines:
npm ci          # NOT npm install
pip install --require-hashes -r requirements.txt

# Never allow unpinned transitive dependencies in production

3. Atomic credential rotation

When you detect a compromise and rotate credentials, the rotation must be a single synchronized operation — revoke all active tokens, generate new ones, update all consumers simultaneously. Sequential rotation leaves a window. TeamPCP exploited exactly this window in the Trivy incident.

4. Principle of least privilege for service accounts

# Your CI service account should not have:
# - write access to multiple repositories
# - admin access to package registries
# - broad cloud IAM roles

# It should have exactly:
# - read access to the specific repos needed for this job
# - publish access to the specific package this job publishes
# - no persistent credentials (use OIDC/short-lived tokens)

5. Behavior-based CI monitoring

The LiteLLM incident was caught first by a developer whose machine started stuttering — their CPU was pegged because the malware's fork bomb behavior crashed the system. That's not monitoring; that's luck.

What you actually need: alerts for Python processes making outbound POST requests at install time. Package installation should pull from PyPI — it should never POST encrypted binary payloads to external endpoints.

Alert rule: 
  process: python (via pip subprocess)
  direction: outbound
  method: POST
  payload: encrypted binary
  action: ALERT + BLOCK

6. Audit OAuth permissions regularly

The Vercel breach started with a productivity tool that was granted full Google Drive read access. Every OAuth integration in your organization is a potential pivot point. Audit them. Scope them to minimum required permissions. Revoke anything that hasn't been used recently.

7. Treat your AI coding tools as high-privilege systems

Given the Bitwarden CLI attack explicitly targeted Claude, Cursor, Codex, and Aider credentials, it's time to treat AI coding assistant authentication state with the same security posture as cloud access keys:

Don't leave AI tools authenticated in unmonitored environments
Rotate AI tool API keys on the same schedule as cloud credentials
Monitor for abnormal AI tool usage patterns (large data transfers, unusual API calls)
Be aware that your AI coding assistant may have access to your entire codebase, your git credentials, and your cloud service connections simultaneously

The Developer's Quick Reference Checklist

Here's a condensed action list you can use right now:

For your CI/CD pipelines

☐ Pin all GitHub Actions to full commit SHAs
☐ Use npm ci / pip install --require-hashes (not npm install / pip install)  
☐ Audit pull_request_target workflows for excessive permissions
☐ Limit service account tokens to minimum required scope
☐ Enable GitHub's SHA pinning organizational policy
☐ Set up behavior-based alerts for install-time network requests

For your dependencies

☐ Check for plain-crypto-js in any lockfile (Axios RAT indicator)
☐ Check for litellm==1.82.7 or 1.82.8 in any Python environment
☐ Check @bitwarden/cli for version 2026.4.0 (rotate if found)
☐ Search your GitHub org for repos named "tpcp-docs" (Trivy compromise indicator)
☐ Audit all GitHub Actions for recent unexpected workflow edits

For your organization

☐ Audit all OAuth app permissions — revoke excessive access
☐ Mark all Vercel environment variables as "sensitive"
☐ Rotate credentials from any CI pipeline that ran Trivy on March 19, 2026
☐ Implement vendor monitoring with automated CVE-to-vendor mapping
☐ Document your complete dependency tree (can you answer: what packages did production install in the last 30 days?)

Closing Thoughts

The pattern across all of these attacks is the same: attackers are targeting trust, not systems.

They're not breaking through your firewall. They're getting invited through the front door — via a trusted package, a trusted OAuth app, a trusted GitHub Action, a trusted vulnerability scanner.

The question isn't whether your perimeter is secure. The question is whether you know every entity you trust, what access you've granted them, and what happens to your environment if any one of them is compromised.

Supply chain security in 2026 isn't a specialized discipline anymore. It's table stakes for any team that ships software.

The next compromised package is already on its way to your CI pipeline. The question is whether you'll see it land.

Resources and Further Reading

If you're responsible for a CI/CD pipeline, share this with your team — the SHA pinning point alone is worth the read.*

All technical details sourced from public security disclosures, vendor incident reports, and independent researcher analysis.

Tags: #security #cybersecurity #devops #opensource #supplychain #javascript #python #npm #github

The Definitive Guide to Lightweight Kubernetes: KIND, Minikube, MicroK8s, K3s, Vcluster, k0s, and RKE2 Compared

Pendela BhargavaSai — Thu, 23 Apr 2026 03:18:00 +0000

TL;DR — There is no single "best" lightweight Kubernetes. KIND wins CI/CD, Minikube wins local dev UX, MicroK8s wins on Ubuntu, K3s wins edge and production, Vcluster wins multi-tenancy, k0s wins zero-dependency ops, and RKE2 wins enterprise compliance. This post explains why — with architecture diagrams, feature tables, and real-world guidance.

Why Lightweight Kubernetes Matters
The Contenders at a Glance
KIND — Kubernetes IN Docker
Minikube — The Developer's Workhorse
MicroK8s — Zero-Ops by Canonical
K3s — Production-Grade at the Edge
Vcluster — Kubernetes Inside Kubernetes
k0s — Zero Dependencies, Zero Friction
RKE2 — Security-First Enterprise K8s
Scoring Across 8 Dimensions
Use Case Decision Guide
Final Verdict

Why Lightweight Kubernetes Matters

Full-fat Kubernetes — the kind you run on a 3-master, 6-worker production cluster — is extraordinary infrastructure. It is also deeply impractical when you need to:

Spin up a throwaway cluster in a GitHub Actions runner in under 30 seconds
Run Kubernetes on a Raspberry Pi with 1 GB of RAM
Give every developer on your team their own isolated cluster without buying new hardware
Deploy to a factory floor where the "server" is an ARM SBC with no internet access

The Kubernetes ecosystem responded by producing a rich family of lightweight distributions, each making different trade-offs. By 2025, the major players are KIND, Minikube, MicroK8s, K3s, Vcluster, k0s, and RKE2 — and choosing between them is genuinely consequential.

This guide gives you the full picture: architecture, components, features, limitations, scoring, and concrete use-case guidance, all in one place.

The Contenders at a Glance

Tool	Creator	Year	Primary Use Case	Min RAM	Binary Size
KIND	Kubernetes SIG Testing	2019	CI/CD testing	2 GB	N/A (uses Docker)
Minikube	Kubernetes Community	2016	Local development	2 GB	~100 MB
MicroK8s	Canonical (Ubuntu)	2018	Ubuntu / Edge	540 MB	Snap package
K3s	Rancher Labs (SUSE)	2019	Edge / Production	512 MB	< 100 MB
Vcluster	Loft Labs	2021	Multi-tenancy	Host-dependent	Helm chart
k0s	Mirantis	2020	Zero-dependency ops	1 GB	~230 MB
RKE2	Rancher (SUSE)	2021	Enterprise / Compliance	4 GB	~300 MB

Each of these is CNCF-compatible and capable of running real Kubernetes workloads. The differences are in where, how, and at what cost they do it.

1. KIND — Kubernetes IN Docker

What It Is

KIND (Kubernetes IN Docker) was built by the Kubernetes SIG Testing team for one purpose: to test Kubernetes itself. Every node in a KIND cluster is a Docker container. The control plane runs in one container, worker nodes in others, and they communicate over a Docker bridge network called kindnet.

KIND runs every Kubernetes node as a Docker container. There is no VM, no hypervisor, no separate OS. The kindnet CNI is a purpose-built bridge that understands this container-as-node topology. The practical effect is that KIND clusters are disposable, fast, and completely ephemeral — perfect for testing but incapable of persistence.

Because there is no VM involved, KIND clusters start in about 30 seconds and use only Docker's existing networking and storage. You can run a dozen isolated clusters on a single laptop.

Architecture


┌─────────────────────────────────────────────────────---┐
│                   Docker Host                          │
│                                                        │
│  ┌─────────────────────┐   ┌──────────────────────┐    │
│  │   Control Plane     │   │     Worker 1         │    │
│  │   (container)       │──▶│     (container)      │   │
│  │                     │   │                      │    │
│  │  • API Server       │   │  • kubelet           │    │
│  │  • etcd             │   │  • kube-proxy        │    │
│  │  • Scheduler        │──▶│  • Pod A  • Pod B    │    │
│  │  • Controller Mgr   │   └──────────────────────┘    │
│  │  • kindnet CNI      │     ┌──────────────────────┐  │
│  └─────────────────────┘     │     Worker 2         │  │
│                              │     (container)      │  │
│  ┌──────────────────┐        │  • kubelet + pods    │  │
│  │  Port-forwarding │        └──────────────────────┘  │
│  │  localhost:6443  │                                  │
│  └──────────────────┘                                  │
└─────────────────────────────────────────────────────---┘

Core Components

kindnet — Custom CNI using a kernel bridge, purpose-built for KIND's container-as-node model
etcd — Full etcd running inside the control-plane container
containerd — Container runtime inside each node-container (Docker-in-Docker)
kubeadm — KIND uses kubeadm internally to bootstrap the cluster

Key Features

True multi-node clusters (control plane + N workers) on a single host
Custom node images — test against any Kubernetes version
Rootless mode via rootless Docker/Podman
IPv6 and dual-stack support
Create multiple isolated clusters simultaneously
Parallel cluster creation
KUBECONFIG auto-export
Optimised for GitHub Actions, GitLab CI, and Jenkins

Quick Start


# Install

curl  -Lo  ./kind  <https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64>

chmod  +x  ./kind && sudo  mv  ./kind  /usr/local/bin/kind

# Create a single-node cluster

kind  create  cluster

# Create a multi-node cluster

cat <<EOF | kind  create  cluster  --config=-

kind: Cluster

apiVersion: kind.x-k8s.io/v1alpha4

nodes:

- role: control-plane

- role: worker

- role: worker

EOF

# Delete cluster

kind  delete  cluster

Pros

Blazing fast — 30-second cluster creation, no hypervisor boot time
Zero VM overhead — runs entirely inside Docker containers
True multi-node topology on one host
Exact Kubernetes version control via node images
Perfect for ephemeral CI environments
No LoadBalancer hacks needed for testing (use NodePort)
Widely supported in CI platforms

Cons

Requires Docker or Podman to be running
Not production-ready under any circumstances
No GPU passthrough
LoadBalancer type services need MetalLB or similar
Volumes are lost when the cluster is deleted
No addon ecosystem

Best For

CI/CD pipelines — specifically integration testing that needs a real multi-node Kubernetes topology without the boot time of a VM-based solution.

2. Minikube - The Developer's Workhorse

What It Is

Minikube is the original local Kubernetes project, released in 2016 and still the most feature-rich local development option. It runs a Kubernetes cluster inside a VM, a container, or directly on the host, and brings an unmatched addon ecosystem of 30+ pre-packaged integrations.

Minikube is the only distribution that abstracts over drivers — it runs identically whether the underlying host is a VM (VirtualBox, HyperKit, KVM), a container (Docker, Podman), or bare metal. This flexibility comes at the cost of startup time and memory, but it means Minikube works for every developer on every operating system.

If you've ever run kubectl apply -f on your laptop, you've probably used Minikube.

Architecture

┌──────────────────────────────────────────────────────────-┐
│             VM / Docker / Podman Driver                   │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐  │
│  │            Single Node (All-in-One)                 │  │
│  │                                                     │  │
│  │  Control Plane                  Data Plane          │  │
│  │  ┌──────────┐  ┌────────────┐   ┌─────────────────┐ │  │
│  │  │API Server│  │etcd        │   │kubelet          │ │  │
│  │  └──────────┘  └────────────┘   │kube-proxy       │ │  │
│  │  ┌──────────┐  ┌────────────┐   │Pod A • Pod B    │ │  │
│  │  │Scheduler │  │Ctrl Manager│   └─────────────────┘ │  │
│  │  └──────────┘  └────────────┘                       │  │
│  └─────────────────────────────────────────────────────┘  │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                   Addons Layer                      │  │
│  │  Dashboard │ Ingress │ Metrics │ Registry │ Istio   │  │
│  └─────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────-┘

Core Components

Multiple drivers — HyperKit, VirtualBox, KVM2, Docker, Podman, SSH
etcd — Full etcd as the backing store
Calico or Flannel — CNI (configurable per driver)
Addon controller — Manages the 30+ available addon services

Key Features

30+ addons including Istio, Knative, Linkerd, GPU operator, registry, and more
Built-in Kubernetes dashboard (minikube dashboard)
GPU passthrough in VM mode
LoadBalancer via minikube tunnel
Multiple profile management (run several clusters simultaneously)
Image caching to speed up repeated pulls
minikube service command for easy port access
Built-in image loading (minikube image load)

Quick Start


# Install (Linux)

curl  -LO  <https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64>

sudo  install  minikube-linux-amd64  /usr/local/bin/minikube

# Start with Docker driver

minikube  start  --driver=docker

# Enable addons

minikube  addons  enable  ingress

minikube  addons  enable  metrics-server

minikube  addons  enable  dashboard

# Open dashboard

minikube  dashboard

# LoadBalancer support

minikube  tunnel  # Run in separate terminal

# Delete

minikube  delete

Pros

Easiest getting-started experience of any K8s tool
Unmatched addon ecosystem (30+ addons)
GPU passthrough support (VirtualBox/KVM drivers)
Built-in dashboard requires zero configuration
Works on macOS, Linux, and Windows
Multiple profiles = multiple clusters
Best documentation and community support

Cons

Slow startup in VM mode (~2 minutes)
High memory consumption, especially with VM driver
Primarily a single-node environment
Not production-ready
LoadBalancer requires keeping minikube tunnel running separately
Battery-intensive on laptops
Multi-node support exists but is limited and buggy

Best For

Local development — especially developers who want a full Kubernetes experience with addons, dashboards, and GPU support without deep infrastructure expertise.

3. MicroK8s - Zero-Ops by Canonical

What It Is

MicroK8s is Canonical's packaging of Kubernetes as a snap. It installs as a single command, self-heals via systemd, updates automatically through snap channels, and has the lowest memory footprint of any full-featured Kubernetes distribution at just 540 MB.

MicroK8s is unique in using dqlite — a distributed SQLite engine developed by Canonical — as an alternative to etcd for HA mode. This dramatically simplifies the operational burden of running a multi-master cluster: no external etcd cluster needed, just microk8s add-node on each machine.

Unlike KIND and Minikube, MicroK8s is designed for both development and light production workloads. Its HA mode using dqlite (a distributed version of SQLite) supports clustering without requiring a full etcd setup.

Architecture


┌──────────────────────────────────────────────────────────-┐
│                  Snap Package (systemd)                   │
│                                                           │
│  ┌──────────────────────┐   ┌────────────────────────┐    │
│  │    Node 1 (Master)    │   │      Node 2           │    │
│  │                       │   │                       │    │
│  │  • API Server         │──▶│  • kubelet            │    │
│  │  • dqlite (HA store)  │   │  • kube-proxy         │    │
│  │  • Scheduler          │   │  • Calico CNI          │   │
│  │  • Controller Manager │   │  • Pods                │   │
│  │  • Calico CNI         │   └────────────────────────┘   │
│  │  • Auto-updater       │                                │
│  └──────────────────────┘                                 │
│                                                           │
│  ┌──────────────────────────────────────────────────────┐ │
│  │         Addon Engine (microk8s enable <addon>)       │ │
│  │  Istio │ Knative │ GPU │ Registry │ Dashboard │ More │ │
│  └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────-┘

Core Components

dqlite — Distributed SQLite for HA without the operational burden of etcd
Calico CNI — Production-grade networking with network policy support
Snap daemon — Manages the entire lifecycle including automatic updates
Addon engine — microk8s enable <name> installs curated addons

Key Features

Lowest memory footprint: 540 MB minimum
HA clustering via microk8s add-node
Automatic channel-based updates with rollback
GPU operator addon for ML/AI workloads
Strict snap confinement for security
ARM64 and x86 native support
Observability stack addon (Prometheus, Grafana)
Built-in image registry

Quick Start


# Install via snap

sudo  snap  install  microk8s  --classic

# Add your user to the microk8s group

sudo  usermod  -aG  microk8s  $USER

newgrp  microk8s

# Check status

microk8s  status  --wait-ready

# Enable core addons

microk8s  enable  dns  ingress  metrics-server  dashboard

# Use kubectl

microk8s  kubectl  get  nodes

# Add worker node (run on master, then copy join command to worker)

microk8s  add-node

# Uninstall

sudo  snap  remove  microk8s

Pros

Lowest RAM usage of all full-featured distributions (540 MB)
Best Ubuntu and Linux integration through the snap ecosystem
Self-healing via systemd — restarts automatically on failure
HA multi-node with a simple add-node workflow
Automatic updates through snap channels (stable, candidate, beta)
Production-capable for light workloads
ARM64 support for Raspberry Pi and ARM servers

Cons

Snap packaging limits portability to non-Ubuntu systems
Ubuntu-centric design — snap is not available everywhere
Addon conflicts can occur (Istio + other service meshes, for example)
Strict snap confinement can block some host filesystem operations
dqlite is still maturing compared to battle-tested etcd
Automatic updates can cause unplanned restarts without configuration

Best For

Ubuntu workstations and edge servers — if you're on Ubuntu, MicroK8s is the most native Kubernetes experience available.

4. K3s - Production-Grade at the Edge

What It Is

K3s is the single most consequential lightweight Kubernetes project of the past five years. Released by Rancher Labs (now SUSE) in 2019, it packs a complete, CNCF-certified Kubernetes distribution into a single binary under 100 MB. It runs on 512 MB of RAM, boots in 30 seconds, and runs identically on a Raspberry Pi, a factory floor ARM controller, and a cloud VM.

K3s achieves its sub-100 MB size by bundling everything into a single Go binary with no external dependencies, using SQLite as a default backing store (which requires no cluster management), and removing upstream K8s features that aren't needed in its target environments (Windows nodes, cloud-provider integrations, certain alpha features).

K3s is not a toy. It is used in production by thousands of organisations worldwide.

Architecture


┌────────────────────────────────────────────────────────────────┐
│                      k3s binary (< 100 MB)                     │
│                                                                │
│  ┌─────────────────────────────────┐                           │
│  │          k3s Server             │                           │
│  │  (Control Plane + Optional DP)  │──────────┐                │
│  │                                 │          │                │
│  │  • API Server                   │          ▼                │
│  │  • SQLite (default) / etcd / PG │   ┌─────────────────┐     │
│  │  • Scheduler                    │   │   k3s Agent 1   │     │
│  │  • Controller Manager           │   │   (Worker Node) │     │
│  │  • Flannel CNI (built-in)       │   │  • kubelet      │     │
│  │  • Traefik Ingress              │   │  • kube-proxy   │     │
│  │  • CoreDNS                      │──▶│  • Flannel     │     │
│  │  • local-path-provisioner       │   │  • Pods         │     │
│  │  • Helm controller              │   └─────────────────┘     │
│  └─────────────────────────────────┘          │                │
│                                               ▼                │
│                                        ┌─────────────────┐     │
│                                        │ k3s Agent 2     │     │
│                                        │ (ARM / IoT)     │     │
│                                        └─────────────────┘     │
└────────────────────────────────────────────────────────────────┘

Core Components

Single binary — Packages containerd, CNI plugins, CoreDNS, Traefik, and more
SQLite — Default data store, ideal for single-server or small clusters
Embedded etcd — Available for HA clusters (3+ servers)
External DB — PostgreSQL, MySQL, or etcd for larger deployments
Flannel CNI — Built-in overlay networking, zero extra configuration
Traefik — Ingress controller included out of the box
Helm controller — Manage Helm charts via CRDs
local-path-provisioner — Dynamic PVC provisioning on local disk

Key Features

CNCF-certified — passes full Kubernetes conformance tests
Single binary < 100 MB with everything bundled
Multiple storage backends: SQLite, etcd, PostgreSQL, MySQL
ARM64 and ARMv7 first-class support
Air-gap / offline install support (critical for edge deployments)
Auto TLS with Let's Encrypt for Traefik
Server + Agent role split for control/data plane separation
Automatic certificate rotation

Quick Start


# Install server (master) — one command

curl  -sfL  <https://get.k3s.io> | sh  -

# Check status

sudo  systemctl  status  k3s

sudo  kubectl  get  nodes

# Get the node join token

sudo  cat  /var/lib/rancher/k3s/server/node-token

# Join a worker node (run on the worker)

curl  -sfL  <https://get.k3s.io> | \\

K3S_URL=https://<SERVER_IP>:6443  \\

K3S_TOKEN=<NODE_TOKEN> \\

sh  -

# Use kubectl without sudo

mkdir  -p  ~/.kube

sudo  cp  /etc/rancher/k3s/k3s.yaml  ~/.kube/config

sudo  chown $(id  -u):$(id  -g) ~/.kube/config

# Uninstall

/usr/local/bin/k3s-uninstall.sh  # server

/usr/local/bin/k3s-agent-uninstall.sh  # agent

HA Setup (Embedded etcd)


# First server node

curl  -sfL  <https://get.k3s.io> | sh  -s  -  server  \\

--cluster-init

# Additional server nodes

curl  -sfL  <https://get.k3s.io> | sh  -s  -  server  \\

--server https://<FIRST_SERVER_IP>:6443 \\

--token <NODE_TOKEN>

Pros

CNCF-certified — genuine, conformant Kubernetes, not a cut-down imitation
Single binary under 100 MB — deploy to anything
512 MB RAM minimum — runs on Raspberry Pi 3
30-second cold start
SQLite for small clusters, etcd for HA — right tool for every scale
Traefik ingress out of the box — production workloads with zero extra config
ARM64 and ARMv7 native — best IoT Kubernetes support in the market
Air-gap install — works in completely offline environments

Cons

SQLite backend not suitable for clusters exceeding ~50 nodes
Some upstream Kubernetes features are stripped (Alpha features, some cloud integrations)
Default CNI is Flannel only (using Calico requires additional configuration)
No built-in dashboard
Less rich addon ecosystem than Minikube or MicroK8s
Limited Windows node support

Best For

Edge computing, IoT, production on resource-constrained hardware, and any environment where the binary size and startup time of a traditional Kubernetes distribution is prohibitive.

5. Vcluster — Kubernetes Inside Kubernetes

What It Is

Vcluster takes a completely different approach to "lightweight Kubernetes." Rather than running alongside a host operating system, it runs inside an existing Kubernetes cluster. Each virtual cluster is a set of pods in a namespace, but from the user's perspective it is a completely isolated Kubernetes cluster with its own API server, etcd, and full Kubernetes API.

This makes Vcluster the definitive answer to the multi-tenancy problem: instead of giving teams namespace isolation (which shares the API server and exposes blast radius), you give each team their own cluster for the cost of a few pods.

Vcluster is architecturally unique in the field. Its virtual control plane (API server + etcd + scheduler + controller manager) runs as pods inside a host cluster namespace. A component called the Syncer watches the virtual cluster's API and translates virtual resources into real host resources — a virtual Pod becomes a real Pod in the host namespace with a remapped name.

Architecture

┌──────────────────────────────────────────────────────────────┐
│               Host Kubernetes Cluster (any provider)         │
│                                                              │
│  ┌────────────────────┐  ┌────────────────────┐              │
│  │    vcluster 1      │  │    vcluster 2       │             │
│  │  (Team A ns)       │  │  (Team B ns)        │             │
│  │                    │  │                     │             │
│  │  Virtual API Srv   │  │  Virtual API Srv    │             │
│  │  In-process etcd   │  │  In-process etcd    │             │
│  │  Syncer pod        │  │  Syncer pod         │             │
│  │  ┌────┐  ┌────┐    │  │  ┌────┐  ┌────┐    │              │
│  │  │PodA│  │PodB│    │  │  │PodC│  │PodD│    │              │
│  │  └─┬──┘  └─┬──┘    │  │  └─┬──┘  └─┬──┘    │              │
│  │    │sync   │sync   │  │    │sync   │sync    │             │
│  └────┼───────┼───────┘  └────┼───────┼────────┘             │
│       ▼       ▼               ▼       ▼                      │
│  ┌──────────────────────────────────────────────────────┐    │
│  │  Shared Worker Nodes — Host CNI, Storage, Hardware   │    │
│  └──────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

The Syncer is the key innovation: it translates virtual cluster resources into real host cluster resources. A Pod created in vcluster 1 becomes a real Pod in the host cluster's namespace, but with a remapped name that prevents conflicts.

Core Components

Virtual API Server — Full Kubernetes API, runs as a pod in the host cluster
In-process etcd — Embedded etcd for the virtual cluster's state
Syncer — Reconciles virtual resources to host cluster resources
vcluster CLI — Manages lifecycle: create, connect, delete, list

Key Features

Full Kubernetes API isolation per virtual cluster
Works on top of any Kubernetes (EKS, GKE, AKS, K3s, RKE2, etc.)
~10 second spin-up time — fastest of all solutions
No extra hardware — uses existing cluster nodes
CRD isolation — each vcluster has its own CRDs
RBAC isolation — separate RBAC per vcluster
Helm chart deployment — deploy via standard Helm
On-demand creation and deletion

Quick Start


# Install vcluster CLI

curl  -L  -o  vcluster  "<https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64>"

chmod  +x  vcluster && sudo  mv  vcluster  /usr/local/bin

# Create a virtual cluster

vcluster  create  my-vcluster  --namespace  team-a

# Connect to it (sets KUBECONFIG automatically)

vcluster  connect  my-vcluster  --namespace  team-a

# Now kubectl talks to the vcluster

kubectl  get  nodes

kubectl  create  deployment  nginx  --image=nginx

# Disconnect

vcluster  disconnect

# Delete

vcluster  delete  my-vcluster  --namespace  team-a

Pros

Full Kubernetes API isolation per tenant — no shared API server blast radius
10-second spin-up — fastest cluster creation of all solutions reviewed
No extra hardware — reuses host cluster's nodes entirely
Works on any cloud or on-premises Kubernetes
Cost-efficient multi-tenancy at scale
Each team gets the full kubectl experience
Easy to create and delete on demand for short-lived environments

Cons

Not standalone — requires a host Kubernetes cluster to exist first
Cannot create real nodes — virtual only
Advanced networking between vclusters is complex
Some cluster-scoped resources (like ClusterRoles and CRDs) are not fully isolated
Requires privileged pod access on the host cluster
Newer project — less battle-tested than K3s or Minikube
Node-level debugging is limited

Best For

Multi-tenant development environments, per-team isolated clusters, and CI/CD environments where many short-lived clusters need to be spun up and torn down rapidly on existing infrastructure.

6. k0s — Zero Dependencies, Zero Friction

What It Is

k0s (pronounced "kay-zeros") from Mirantis lives up to its name: zero host OS dependencies. It is a single binary that includes everything needed to run Kubernetes without requiring any specific kernel modules, swap configuration, or package manager. It works on any Linux distribution out of the box.

k0s uses an eBPF-based CNI called kube-router, includes Autopilot for automated upgrades, and offers FIPS 140-2 compliance — a feature set that appeals strongly to regulated industries.

k0s prioritises deployment universality. By bundling containerd and all CNI plugins into the binary itself and requiring no kernel module configuration from the host OS, it can be dropped onto virtually any Linux system and run. The eBPF-based kube-router CNI offers modern packet processing without iptables overhead.

Architecture


┌──────────────────────────────────────────────────────┐
│             k0s binary (systemd / OpenRC)            │
│                                                      │
│  ┌──────────────────────────┐                        │
│  │     k0s controller       │                        │
│  │   (Control Plane)        │───────────────┐        │
│  │                          │               │        │
│  │  • API Server            │               ▼        │
│  │  • etcd (embedded)       │   ┌─────────────────┐  │
│  │  • Scheduler             │   │  k0s worker 1   │  │
│  │  • Controller Manager    │   │                 │  │
│  │  • containerd            │   │  • kubelet      │  │
│  │  • kube-router (eBPF)    │──▶│  • kube-router  │  │
│  │  • Autopilot updater     │   │  • containerd   │  │
│  └──────────────────────────┘   │  • Pods         │  │
│                                  └─────────────────┘ │
│  k0sctl tool → manages cluster lifecycle             │
└──────────────────────────────────────────────────────┘

Key Features

Truly zero host OS dependencies — no kernel module requirements
FIPS 140-2 compliance mode available
eBPF-based networking via kube-router
Autopilot automated upgrades
k0sctl for full cluster lifecycle management
ARM64 native support
Air-gap install support
Works on any Linux OS (Debian, RHEL, Alpine, CoreOS, etc.)

Quick Start


# Download k0s

curl  -sSLf  <https://get.k0s.sh> | sudo  sh

# Install and start as a service

sudo  k0s  install  controller  --single

sudo  k0s  start

# Get kubeconfig

sudo  k0s  kubeconfig  admin > ~/.kube/config

# Check cluster

kubectl  get  nodes

# Add a worker node — generate join token on controller

sudo  k0s  token  create  --role=worker

# On the worker node

sudo  k0s  install  worker  --token-file  /path/to/token

sudo  k0s  start

Pros

Truly zero host OS dependencies — works on any Linux, no special kernel configuration
FIPS 140-2 compliance for regulated industries
eBPF-based networking with kube-router is modern and efficient
Autopilot handles automated upgrades safely
k0sctl provides a proper cluster lifecycle management tool
No swap or kernel module pre-requirements
Air-gap support

Cons

Smaller community than K3s or MicroK8s
Less rich addon ecosystem
k0sctl adds an additional tool to the workflow
Some CNI plugins need manual configuration beyond kube-router
Enterprise support is a paid product from Mirantis
Fewer third-party integrations and tutorials

Best For

Environments where host OS diversity is a challenge — mixed Linux distributions, heavily locked-down servers, or compliance-driven deployments needing FIPS 140-2.

7. RKE2 — Security-First Enterprise K8s

What It Is

RKE2 (Rancher Kubernetes Engine 2) is the enterprise evolution of K3s. Where K3s optimises for minimal resource usage and edge deployability, RKE2 optimises for security hardening and compliance. It ships hardened by default with CIS Kubernetes Benchmark compliance, FIPS 140-2 support, automatic etcd snapshots, and deep Rancher integration.

RKE2 starts from K3s's architecture and adds a hardening layer: Pod Security Admission enforced by default, etcd encryption at rest, CIS-compliant API server flags, audit logging enabled, and Canal CNI with network policy enforcement. It is Kubernetes made appropriate for government and financial sector requirements.

If K3s is the lightweight sports car, RKE2 is the armoured vehicle. More resource-intensive, harder to damage.

Architecture

┌───────────────────────────────────────────────────────────┐
│              RKE2 Server (Hardened Control Plane)         │
│                                                           │
│  ┌──────────────────────────────────────────────────────┐ │
│  │  CIS-Hardened Kubernetes                             │ │
│  │                                                      │ │
│  │  • Hardened API Server (PSP enforced)                │ │
│  │  • etcd with automated snapshots                     │ │
│  │  • Hardened Scheduler & Controller Manager           │ │
│  │  • Canal / Calico / Cilium CNI (configurable)        │ │
│  │  • containerd runtime                                │ │
│  │  • Cert-manager + auto rotation                      │ │
│  └──────────────────────────────────────────────────────┘ │
│                    │                                      │
│          ┌─────────┴──────────┐                           │
│          ▼                    ▼                           │
│  ┌────────────────┐  ┌────────────────┐                   │
│  │  RKE2 Agent 1  │  │  RKE2 Agent 2  │                   │
│  │  (Worker)      │  │  (Worker)      │                   │
│  └────────────────┘  └────────────────┘                   │
│                                                           │
│  ┌──────────────────────────────────────────────────────┐ │
│  │  Rancher Management Plane (optional)                 │ │
│  └──────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘

Key Features

CIS Kubernetes Benchmark v1.6 compliant by default
FIPS 140-2 cryptographic compliance
etcd with automated periodic snapshots and restoration
Multiple CNI options: Canal (default), Calico, Cilium
Automated certificate rotation
Helm chart integration
Air-gap install support
Deep Rancher management platform integration
Role-based node configuration

Quick Start


# Install RKE2 server

curl  -sfL  <https://get.rke2.io> | sh  -

systemctl  enable  rke2-server.service

systemctl  start  rke2-server.service

# Get kubeconfig

export  KUBECONFIG=/etc/rancher/rke2/rke2.yaml

# Get join token for workers

cat  /var/lib/rancher/rke2/server/node-token

# On worker nodes

curl  -sfL  <https://get.rke2.io> | INSTALL_RKE2_TYPE="agent"  sh  -

mkdir  -p  /etc/rancher/rke2/

cat > /etc/rancher/rke2/config.yaml <<EOF

server: https://<SERVER_IP>:9345

token: <NODE_TOKEN>

EOF

systemctl  enable  rke2-agent.service

systemctl  start  rke2-agent.service

Pros

CIS Kubernetes Benchmark compliance out of the box — no manual hardening
FIPS 140-2 for regulated environments (finance, government, healthcare)
Automated etcd snapshots — point-in-time restore capability
Multiple CNI choices (Canal, Calico, Cilium) for varied network requirements
Excellent Rancher multi-cluster management integration
Automated certificate rotation
Strong air-gap support for isolated environments

Cons

4 GB RAM minimum makes it unsuitable for edge/IoT
Longer startup time (~2 minutes)
More operationally complex than K3s
Overkill for non-compliance use cases
Tightly coupled to the Rancher ecosystem
Larger binary and resource footprint
etcd only — no SQLite lightweight option

Best For

Enterprise, compliance-driven, and government workloads where security hardening and audit-readiness are non-negotiable.

Scoring Across 8 Dimensions

Scores are relative (1–10, higher is better for most dimensions):

Dimension	KIND	Minikube	MicroK8s	K3s	Vcluster	k0s	RKE2
Ease of use	7	9	8	7	5	6	4
Production readiness	2	2	7	9	8	8	10
Resource efficiency	7	5	9	10	8	8	3
Multi-node support	9	4	8	9	8	8	9
Addon ecosystem	5	9	8	6	5	4	7
Edge / IoT fit	1	1	6	10	1	7	3
Multi-tenancy	1	1	2	3	10	2	4
CI/CD suitability	10	7	7	7	9	6	5

Use Case Decision Guide

Your Situation	Best Choice	Runner-Up
GitHub Actions / GitLab CI pipelines	KIND	Vcluster
Local development on macOS/Windows/Linux	Minikube	MicroK8s
Developer on Ubuntu workstation	MicroK8s	K3s
Raspberry Pi cluster at home	K3s	MicroK8s
Industrial IoT / factory floor	K3s	k0s
ARM-based edge server	K3s	MicroK8s
Production workload on lightweight infra	K3s	MicroK8s
Government / regulated enterprise	RKE2	k0s
FIPS 140-2 compliance required	RKE2 or k0s	—
Multi-tenant dev environments	Vcluster	Namespace isolation
Per-team isolated clusters	Vcluster	KIND
Mixed Linux OS fleet	k0s	K3s
Air-gap / offline environment	K3s	k0s or RKE2
Testing Kubernetes itself	KIND	—
HA on bare metal with minimal ops	MicroK8s	K3s embedded etcd
Kubernetes with Rancher management	RKE2	K3s

The Decision Tree

Do you need production-grade?
├── No → Is it for CI/CD testing?
│         ├── Yes → KIND
│         └── No  → Are you on Ubuntu?
│                   ├── Yes → MicroK8s
│                   └── No  → Minikube
└── Yes → Do you need compliance (FIPS/CIS)?
          ├── Yes → RKE2 (CIS+FIPS) or k0s (FIPS)
          └── No  → Is it edge/IoT/ARM?
                    ├── Yes → K3s
                    └── No  → Need multi-tenancy?
                              ├── Yes → Vcluster
                              └── No  → K3s or MicroK8s

Final Verdict

After a thorough review, the landscape shakes out clearly:

K3s is the most remarkable project in the lightweight Kubernetes space. It delivers a complete, CNCF-certified Kubernetes distribution in under 100 MB, runs on 512 MB of RAM, and works in air-gapped ARM environments. For the vast majority of production lightweight Kubernetes use cases, K3s is the correct answer.

Vcluster solves a problem no other distribution addresses: genuine Kubernetes API-level multi-tenancy without dedicated hardware. If you need to give 10 teams their own isolated clusters, Vcluster is the only sensible approach.

KIND is indispensable for CI/CD. If you run Kubernetes integration tests in any CI system, KIND's 30-second, Docker-native, multi-node clusters are the right tool with no close competitor.

Minikube remains the best onboarding experience for developers who are new to Kubernetes. The addon ecosystem and built-in dashboard lower the barrier to entry substantially.

MicroK8s is the best Kubernetes for Ubuntu. If your team lives on Ubuntu workstations and servers, snap-based installation, self-healing, and dqlite HA make it the most frictionless operational experience on that platform.

k0s fills an important niche: mixed Linux fleets and environments where zero host OS dependencies matter more than community size or addon richness.

RKE2 is the right answer when your compliance officer needs CIS Kubernetes Benchmark and FIPS 140-2. The resource overhead is the price of admission to heavily regulated sectors.

Resources

This post was written in April 2025. Kubernetes moves fast — always check the official documentation for the latest version information.

Tags: kubernetes k8s k3s kind minikube microk8s vcluster k0s rke2 devops infrastructure edge-computing cloud-native containers cncf

Running k3s on Proxmox: A Multi-Node Cluster with a VM and LXC Worker — The Hard Way and Back

Pendela BhargavaSai — Tue, 21 Apr 2026 03:30:00 +0000

A practical guide covering installation, troubleshooting, and the real story of getting k3s to run inside an LXC container

Introduction

Kubernetes is powerful but notorious for being heavy. k3s, the lightweight Kubernetes distribution from Rancher, fixes that. It strips out legacy APIs, bundles containerd, and ships as a single binary under 100MB. It is perfect for homelabs, edge deployments, and resource-constrained environments.
(more about k3s: https://traefik.io/glossary/k3s-explained/)

This is the first of a series of posts describing how to bootstrap a Kubernetes cluster on Proxmox using ubuntu VM and LXC containers. By the end of the series, the aim is to have a fully working Kubernetes (K3S) install including MetalLB load balancer, Gateway API controller and an Istio service mesh. I’ll also have some sample applications installed for good measure.

Basically why do I need a Kubernetes cluster ?

At work, I’ve used large K8S clusters in production environments (AWS), clusters are abstracted away behind platform teams, which is efficient for delivery but leaves gaps in understanding how scheduling, networking, storage, and controllers really behave under the hood. Setting up your own cluster gives you that missing layer of operational intuition: you get to break things, debug them, and understand why they broke. For someone already running a fairly complex home setup, using Kubernetes as a unifying platform to experiment, whether or not you fully migrate all your Docker Compose stacks—is less about necessity and more about building practical, transferable expertise.

In this post I document how I built a three-node k3s cluster on Proxmox VE with:

1 master node — a Proxmox VM running Ubuntu
1 VM worker node — a standard Proxmox VM (worker1)
1 LXC worker node — a Proxmox LXC container (worker2)

The VM setup was straightforward. The LXC setup was not. This post focuses heavily on the LXC journey — the errors, the fixes, the Linux internals involved, and what it finally took to make it work.

Part 1: Setting Up the Master Node

Installing k3s Server

On the master VM, installing k3s is a single command:


curl  -sfL  https://get.k3s.io | sh  -

k3s sets up a systemd service, installs containerd, and bootstraps a single-node Kubernetes cluster automatically.

Fixing kubectl Access

After installation, running kubectl get nodes immediately fails:

The connection to the server localhost:8080 was refused

This happens because kubectl defaults to localhost:8080 when no kubeconfig is set. k3s stores its kubeconfig at /etc/rancher/k3s/k3s.yaml. The fix:


mkdir  -p  ~/.kube

sudo  cp  /etc/rancher/k3s/k3s.yaml  ~/.kube/config

sudo  chown  $USER:$USER  ~/.kube/config

Or export it permanently:


echo  'export KUBECONFIG=/etc/rancher/k3s/k3s.yaml' >> ~/.bashrc

source  ~/.bashrc

Retrieve the Node Token

Worker nodes need a token to join the cluster. Grab it from the master:


sudo  cat  /var/lib/rancher/k3s/server/node-token

Keep this value — it is used in every worker join command.

Part 2: Adding the VM Worker (worker1)

Joining the Cluster

On the worker VM, run:


curl  -sfL  https://get.k3s.io | \

K3S_URL=https://192.168.1.44:6443  \

K3S_TOKEN=<node-token> \

sh  -

Problem: Node Password Rejected

The agent started but immediately logged:


Node password rejected, duplicate hostname or contents of

'/etc/rancher/node/password' may not match server node-passwd entry

This happened because the worker VM had previously joined the cluster. k3s stores a node password on both the node (/etc/rancher/node/password) and the master (as a Kubernetes secret). When they don't match, the server rejects the node.

Fix — on the worker:


sudo  systemctl  stop  k3s-agent

sudo  rm  -f  /etc/rancher/node/password

sudo  rm  -rf  /var/lib/rancher/k3s/agent/

sudo  systemctl  start  k3s-agent

Fix — on the master, delete the stale secret:


kubectl  get  secrets  -n  kube-system | grep  node-password

kubectl  delete  secret  worker1.node-password.k3s  -n  kube-system

Problem: Duplicate Hostname

Both the master and worker had the hostname k3s. k3s uses the hostname as the node name, so the server rejected the second node as a duplicate.

Fix — rename the worker:


sudo  hostnamectl  set-hostname  worker1

After renaming and cleaning up the stale secret, the worker joined successfully.

Part 3: The LXC Worker — The Real Story

What is an LXC Container?

LXC (Linux Containers) is a lightweight virtualisation technology. Unlike VMs which emulate full hardware, LXC containers share the host kernel directly. They use Linux namespaces for isolation and cgroups for resource control. They are faster and more efficient than VMs but have less isolation.

Proxmox LXC containers can be privileged (root inside = root on host) or unprivileged (root inside maps to a regular user on host via UID namespacing). Unprivileged is the default and more secure option.

Creating the LXC Container

In Proxmox, I created a Debian Trixie LXC container with:

Joining the Cluster


curl  -sfL  https://get.k3s.io | \

K3S_URL=https://192.168.1.44:6443  \

K3S_TOKEN=<node-token> \

sh  -

The install script ran and printed [INFO] systemd: Starting k3s-agent — and then nothing. It just hung.

Checking the journal:


journalctl  -u  k3s-agent  -f

Error 1: `/dev/kmsg: no such file or directory`


Error: failed to run Kubelet: failed to create kubelet: open /dev/kmsg: no such file or directory

What is /dev/kmsg?

/dev/kmsg is the kernel message buffer device. The Linux kernel uses it to log messages (this is what dmesg reads). kubelet uses it to watch for OOM (Out of Memory) kill events via the oomWatcher. Without it, kubelet refuses to start.

In an unprivileged LXC container, /dev/kmsg does not exist because the container does not have access to kernel devices.

Fix — bind mount from host:

In /etc/pve/lxc/209.conf on the Proxmox host:


lxc.mount.entry: /dev/kmsg dev/kmsg none bind,create=file

This bind mounts the host's /dev/kmsg into the container. Stop and start (not restart) the LXC:


pct  stop  209

pct  start  209

Error 2: `/dev/kmsg: operation not permitted`

After adding the bind mount, the error changed slightly:


open /dev/kmsg: operation not permitted

The file now existed in the container but the process was not allowed to open it. The container was still running in user namespace mode (unprivileged), and AppArmor was blocking the access.

Fix — disable AppArmor restriction:


lxc.apparmor.profile: unconfined

AppArmor is a Linux Security Module that applies mandatory access control policies. The default Proxmox LXC AppArmor profile blocks access to kernel devices like /dev/kmsg. Setting it to unconfined removes all AppArmor restrictions for this container.

Error 3: `/proc/sys/kernel/panic: read-only file system`


Failed to start ContainerManager:

open /proc/sys/kernel/panic: read-only file system

open /proc/sys/kernel/panic_on_oops: read-only file system

open /proc/sys/vm/overcommit_memory: read-only file system

What is /proc/sys?

/proc is a virtual filesystem the kernel exposes so userspace can read and write kernel parameters. /proc/sys/ specifically contains sysctl values — tuneable kernel settings.

kubelet needs to write to these on startup:

kernel/panic — configure kernel panic timeout
kernel/panic_on_oops — whether a kernel oops causes a panic
vm/overcommit_memory — memory overcommit policy

In an unprivileged LXC container, /proc is mounted read-only for safety. Any process inside the container (even root inside) cannot modify these values.

Fix — mount proc and sys as read-write:


lxc.mount.auto: "proc:rw sys:rw"

This tells LXC to mount /proc and /sys with read-write access instead of the default read-only.

Error 4: Various Permission Denied Errors


write /proc/self/oom_score_adj: permission denied

Failed to set sysctl: open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

These were caused by the container still running as unprivileged — the process was root inside the container but mapped to a normal user on the host, so many privileged operations were blocked.

Fix — switch to privileged container:


unprivileged: 0

This is the most significant change. A privileged container maps root inside to actual root on the host. This removes the UID namespace remapping that caused most of the permission errors.

Also needed:


lxc.cgroup2.devices.allow: a

lxc.cap.drop:

cgroup2.devices.allow: a — allows the container access to all devices via the cgroup device controller
cap.drop: (empty) — prevents Proxmox from dropping any Linux capabilities. By default, Proxmox drops capabilities like CAP_SYS_ADMIN, CAP_NET_ADMIN, and CAP_SYS_PTRACE from LXC containers. k3s needs these.

Also needed: `features: keyctl=1,nesting=1`

keyctl=1 — enables the Linux kernel keyring inside the container. containerd uses this to securely store credentials and keys for image pulls.
nesting=1 — enables nested containerisation. k3s runs containerd inside the LXC container, and containerd runs pods (more containers) inside itself. Without nesting enabled, Proxmox blocks the inner container creation.

Final Working LXC Config

After applying all these changes and doing a full pct stop / pct start:


journalctl  -u  k3s-agent  -f

# ... containerd is now running

# ... Server ACTIVE

# ... Started kubelet

Summary: What Each Modification Does

Part 4: LXC as a k3s Worker — Features and Limitations

Features / Advantages

Resource efficiency — LXC containers consume significantly less memory and CPU than VMs. A VM needs a full OS kernel in memory. An LXC container shares the host kernel, so the overhead is minimal. worker2 running k3s uses around 250–300MB RAM idle versus a VM which would use 500MB+ for the OS alone.

Fast startup — LXC containers start in 1–3 seconds versus 15–30 seconds for a VM. For ephemeral worker nodes or autoscaling scenarios this matters.

Storage efficiency — LXC uses the host filesystem directly (with a root filesystem overlay). No separate virtual disk emulation layer. I/O is closer to bare metal performance.

Simple networking — LXC containers participate in the same Proxmox bridge (vmbr0) as VMs. No extra networking configuration is needed for k3s to communicate between the master VM and the LXC worker.

Density — you can run more LXC containers on the same Proxmox host than VMs, making it ideal for testing multi-node cluster topologies on limited hardware.

Limitations

Shared kernel — no kernel version isolation — all LXC containers on a host run the same kernel version as the host. You cannot run a different kernel inside an LXC container. This matters if you need a specific kernel feature or version for your workloads.

Privileged mode is a security trade-off — to get k3s working we had to switch to a privileged container and disable AppArmor. In a privileged container, a root escape inside the container gives root on the host. For a homelab or trusted environment this is acceptable; for production or multi-tenant setups it is a significant risk.

No hardware virtualisation — LXC containers cannot run nested VMs. If your workloads need hardware-level isolation or GPU passthrough in the container, a VM is required.

Kernel module limitations — the LXC container cannot load kernel modules that aren't already loaded on the host. During setup we saw:


modprobe: FATAL: Module br_netfilter not found

These modules need to be loaded on the Proxmox host, not inside the container.

Some syscalls are blocked — even in privileged mode, certain syscalls that could affect the host are restricted. This can cause subtle compatibility issues with some container workloads.

Not suitable for untrusted workloads — because the kernel is shared, a kernel exploit inside an LXC container could theoretically affect the host and all other containers. Never run untrusted code in a privileged LXC container.

Conclusion

Getting k3s running on a Proxmox LXC container is absolutely possible, but it requires understanding why each restriction exists and selectively removing the ones that conflict with k3s's requirements. The journey from a blank LXC to a working cluster node touched on AppArmor, Linux capabilities, cgroups, kernel device access, namespace nesting, and virtual filesystem permissions.

The key takeaway: LXC containers are not VMs. They share the host kernel, and every security restriction that makes them safe is also a potential blocker for complex software like k3s that expects a full OS environment. The solution is not to blindly disable everything — it is to understand each error, trace it to the underlying Linux feature, and make the minimal change required to unblock it.

The final cluster — one control plane VM and two workers (one VM, one LXC) — runs stably with k3s managing scheduling, networking, and DNS across all three nodes via CoreDNS.

I now have a vanilla multi-node Kubernetes cluster running in a Ubuntu VM and an LXC container and accessible from my machine. It’s got nothing deployed inside it yet, but that’s easily fixed.... see u in part 2.

*Built on Proxmox VE with k3s v1.34.6+k3s1 — Debian Trixie LXC — Ubuntu VM nodes

"Why can’t I just mount S3 like a drive?” AWS finally answering that question in 2026

Pendela BhargavaSai — Sun, 12 Apr 2026 13:35:35 +0000

From "why can't I just mount S3 like a drive?" to AWS finally answering that question in 2026.

I've had that conversation more times than I can count.

A developer joins a new AWS project, looks at the architecture, and asks: "We're already storing everything in S3 — why do we also need EFS? Can't we just mount S3 directly?"

And every time, the answer was the same patient explanation about object storage vs file systems, why they're fundamentally different, and why you need separate services for separate workloads. It was the right answer. It just wasn't a satisfying one.

That changed in April 2026 when AWS launched S3 Files — and suddenly that conversation got a lot shorter.

But before we get there, let's start from the beginning. Because understanding why S3 Files matters requires understanding the problem it's solving. And that means understanding the full AWS storage landscape.

The AWS Storage Trinity (Before S3 Files)

AWS has three primary storage services, each built for a completely different purpose. Engineers often get confused because on the surface they all seem to do the same thing: store data. But the way they store it — and who can access it and how — is completely different.

Here's the simplest way I know to think about it:

S3 is like a giant library. You can store billions of books (objects), and anyone with the right access can retrieve any book. But to fix a typo on page 47, you have to reprint the entire book.
EBS is like a hard drive physically attached to your computer. Super fast, but only your computer can use it.
EFS is like a shared office filing cabinet on a network. Anyone in the office can open a drawer, pull out a folder, and edit a document — at the same time.

Let's go deeper on each one.

Amazon S3 — Object Storage Built for Scale

S3 (Simple Storage Service) launched in 2006 and fundamentally changed how the world thinks about storing data. The core idea is simple: you have buckets, and inside buckets you store objects. Each object is just a file plus its metadata, stored at a unique key (think of it like a URL).

What makes S3 special

Virtually unlimited scale. S3 stores more than 500 trillion objects across hundreds of exabytes today.
11 nines of durability (99.999999999%). AWS automatically replicates your data across at least three Availability Zones.
Pay only for what you use. No minimum capacity, no infrastructure to manage.
Multiple storage classes. From S3 Standard (~$0.023/GB) down to Glacier Deep Archive (~$0.00099/GB) for data you almost never touch.

The one thing S3 cannot do

Here's the catch that trips everyone up: S3 is not a file system.

When you store something in S3, it becomes an immutable object. If you want to change even a single character in a file, you have to download the entire object, make your change, and re-upload the whole thing as a new object. There's no such thing as "open this file and edit line 47." That's just not how object storage works.

This isn't a bug — it's by design. The immutability of objects is part of what makes S3 so durable and scalable. But it creates real friction for any workload that needs to work with data the way normal applications do: open a file, read some bytes, write some bytes, save.

# What you can do with S3
aws s3 cp myfile.txt s3://my-bucket/myfile.txt    # upload
aws s3 cp s3://my-bucket/myfile.txt ./myfile.txt  # download
aws s3 rm s3://my-bucket/myfile.txt               # delete

# What you CANNOT do
# Open myfile.txt and append a line — impossible without full re-upload

Amazon EBS — The Fast Attached Drive

EBS (Elastic Block Store) is block storage — the AWS equivalent of an SSD attached directly to your server. When you launch an EC2 instance, the root volume (where the operating system lives) is an EBS volume.

What EBS is good at

Speed. EBS delivers single-digit millisecond latency because it behaves like a local disk.
POSIX semantics. You can open files, write individual bytes, seek to specific positions — everything a normal file system supports.
Consistency. What you write is immediately readable. No eventual consistency concerns.

The hard limit of EBS

EBS volumes can only be attached to one EC2 instance at a time (with some multi-attach exceptions for specific use cases).

This means if you have a cluster of 10 EC2 instances all running your application, each one needs its own EBS volume. They can't share data through EBS. If instance A writes a file, instance B can't see it without some kind of sync mechanism.

EC2 Instance A  →  EBS Volume A  (can't share)
EC2 Instance B  →  EBS Volume B  (separate, isolated)
EC2 Instance C  →  EBS Volume C  (separate, isolated)

For single-instance workloads — databases, operating system volumes, single-server applications — EBS is excellent. The moment you need shared storage across multiple servers, you hit a wall.

Amazon EFS — The Shared Network Drive

EFS (Elastic File System) is AWS's managed Network File System (NFS). Think of it as a shared drive that any number of EC2 instances, containers, or Lambda functions can mount simultaneously and use like a local file system.

What EFS solves

Concurrent access. Thousands of compute resources can mount and use the same EFS volume at the same time.
Full POSIX semantics. Open files, edit bytes in-place, file locking, directory operations — everything works.
Scales automatically. The file system grows and shrinks as you add or remove files. No capacity planning required.
Sub-millisecond latency on Standard tier.

EC2 Instance A  ──┐
EC2 Instance B  ──┤──→  EFS Volume  (all share the same files)
EC2 Instance C  ──┘
Lambda Function ──┘

Where EFS falls short

The pricing model. EFS charges you for every gigabyte stored, whether you touched it this month or not. Standard tier is $0.30/GB-month — roughly 13x more expensive than S3 Standard per gigabyte.

This is fine when your data is "hot" (actively accessed). It's painful when you have petabytes of data where only a fraction is actively used at any time. You end up paying full file system prices for data that's sitting idle.

And the other problem: EFS has zero native integration with S3. They're completely separate systems. Your data lake is in S3. Your compute needs EFS. So you write sync scripts to copy data back and forth — and now you have two copies of everything, two storage bills, and a manual process that breaks at the worst possible times.

The Old Workflow Pain (The Problem All of This Creates)

Before S3 Files, a typical ML or data engineering team's workflow looked like this:

S3 Data Lake
    ↓  (manual copy — takes time, costs money)
EFS Volume
    ↓  (mount on EC2)
EC2 Training Job
    ↓  (output back to EFS)
    ↓  (another manual copy)
S3 Data Lake  ← results stored here for analytics

Every arrow in that diagram is a point of failure. Every copy step is a delay, a cost, and a potential for the two copies to drift out of sync. Engineers were spending real engineering hours maintaining these sync pipelines — hours that weren't building anything valuable.

This is the problem that s3fs tried to solve, years before AWS had an official answer.

s3fs-fuse — The Community's Workaround

If you've been working with AWS for a few years, you've probably encountered s3fs-fuse. It's an open-source FUSE (Filesystem in Userspace) tool that lets you mount an S3 bucket as a local directory on Linux, macOS, or FreeBSD.

# Install
sudo apt-get install s3fs

# Configure credentials
echo "ACCESS_KEY_ID:SECRET_ACCESS_KEY" > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs

# Mount your bucket
s3fs my-bucket /mnt/s3-data -o passwd_file=~/.passwd-s3fs

After that, you can run ls, cp, cat — your S3 bucket looks like a local folder. For a quick demo or a simple use case, it feels magical.

What's actually happening under the hood

Here's the thing nobody tells you upfront: s3fs isn't really giving you file system access to S3. It's translating file commands into S3 API calls — and the translation has serious limitations.

When you "edit" a file through s3fs, this is what actually happens:

You: nano myfile.txt  (make a small change, save)
     ↓
s3fs: GET entire object from S3 → download to local temp cache
s3fs: You edit the local temp copy
s3fs: On file close → PUT entire object back to S3 (full re-upload)

Change one character in a 10GB file? s3fs downloads all 10GB, makes the change, and uploads all 10GB again. Every time.

The real limitations you need to know

No file locking. If two processes try to write to the same file through s3fs at the same time, you get data corruption. Not an error message — silent data corruption.

No atomic renames. Renaming a file in s3fs copies it to a new key and deletes the old one. Any application that relies on atomic renames (which includes most databases and many log processors) will break.

Slow directory listings. Every ls is a ListObjects API call to S3. On a bucket with millions of objects, this is painfully slow.

No hard links or symbolic links. S3 simply doesn't support them.

Operation          | What s3fs does              | Problem
-------------------|-----------------------------|-----------------------
Read file          | GET entire object           | Slow for large files
Edit file          | Download → edit → full PUT  | Expensive re-upload
Append to file     | Rewrite entire object       | Very expensive
Rename file        | Copy + Delete               | Not atomic
File lock          | Not supported               | Data corruption risk
List directory     | ListObjects API call        | Slow on large buckets

s3fs works well for lightweight, read-heavy, single-process use cases. But the moment you need multi-process access, in-place edits, or production reliability — it starts breaking down. The community built it because AWS didn't have a better answer. Eventually, AWS tried building their own version.

Mountpoint for S3 — AWS's Open-Source Attempt (2023)

In 2023, AWS released Mountpoint for S3, their own open-source FUSE client. It was faster than s3fs-fuse and better optimised for cloud-native read-heavy workloads.

But it still couldn't do in-place edits, directory renames, or file locking. It was better than s3fs-fuse, but it still hit the same fundamental ceiling: you can't make S3's API behave like a real file system by pretending.

AWS knew this. Internally, they'd been trying to solve it properly for years.

Amazon S3 Files — The Real Solution (April 2026)

On April 7, 2026, AWS launched S3 Files — and it's the most significant S3 update since the service launched.

The internal project was even called "EFS3" at one point. One engineer on the team described the design process as "a battle of unpalatable compromises." Getting object storage and file system semantics to truly coexist is genuinely hard engineering. Every design decision forced a tradeoff where either the file presentation or the object presentation had to give something up.

What they landed on is clever: instead of trying to make the S3 API behave like a file system (which is what s3fs does), they did the opposite — they took a real, production-grade file system (EFS) and connected it directly to S3 storage.

How S3 Files actually works

S3 Files uses a two-tier architecture:

Tier 1 — EFS Cache Layer (hot data)

Stores your active working set: recently written files, recently read files, metadata
Delivers ~1ms latency
Serves small files (under 128KB by default) entirely from cache
Handles all NFS file operations — open, read, write, rename, lock

Tier 2 — S3 Bucket (your full dataset)

Holds your complete data at normal S3 prices (~$0.023/GB)
Large reads (1MB+) bypass the cache entirely and stream directly from S3 for free
Changes made through the file system sync back to S3 automatically within minutes

Your Application
      ↓  (NFS mount — standard Linux file operations)
EFS Cache Layer  ←→  Smart Router
      ↓                    ↓
   Hot data            Cold/large data
   (~1ms)              (streams from S3, free)
      ↓                    ↓
      └────────────────────┘
                  ↓
            S3 Bucket
       (your data, always here)

The key insight: your data never leaves S3. The EFS cache is just a smart caching layer on top. You're not maintaining two copies — you have one copy in S3, accessible via both the S3 API and the file system mount simultaneously.

OLD way to New way

Getting started in 3 steps

Step 1: Create an S3 file system

In the AWS Console → S3 → File Systems → Create file system. Enter your bucket name, done.

Or via CLI:

aws s3api create-file-system --bucket my-bucket
aws s3api create-mount-target --file-system-id fs-xxxx --subnet-id subnet-xxxx

Step 2: Mount it on your EC2 instance

Make sure the amazon-efs-utils package is installed (preinstalled on AWS AMIs), then:

sudo mkdir /mnt/s3files
sudo mount -t s3files fs-0aa860d05df9afdfe:/ /mnt/s3files

Step 3: Use it like any local directory

# Create a file
echo "Hello S3 Files" > /mnt/s3files/hello.txt

# Edit it in place
echo "New line added" >> /mnt/s3files/hello.txt

# List files
ls -la /mnt/s3files/

# The same data is accessible via S3 API too
aws s3 ls s3://my-bucket/

Changes you make through the file system mount appear in S3 within minutes. Changes made directly to the S3 bucket appear in the file system within seconds.

Security — what you need to know

IAM integration for access control at both file system and object level
Data encrypted in transit using TLS 1.3
Data encrypted at rest using SSE-S3 (or KMS if you prefer customer-managed keys)
POSIX permissions (UID/GID) stored as S3 object metadata
Monitor via CloudWatch metrics and CloudTrail logs

Pricing — the part that actually makes sense

S3 Files charges EFS-level rates, but only on the fraction of data you're actively working with:

What you pay for	Rate
High-performance storage (hot data)	$0.30/GB-month
Reads (small files served from cache)	$0.03/GB
Writes	$0.06/GB
Everything else in your S3 bucket	Standard S3 rates (~$0.023/GB)

If you have a 100TB dataset but only 1TB is actively used at any time — you pay EFS rates on 1TB and S3 rates on the other 99TB. AWS claims up to 90% cost savings compared to the old pattern of cycling data between S3 and a dedicated EFS volume.

Putting It All Together — Which Service Should You Use?

Here's the honest answer:

Use this	When you need
S3	Bulk storage, backups, data lakes, analytics, static assets, anything accessed via API
EBS	OS volumes, databases, single-instance high-performance storage
EFS	Shared file system for legacy NAS migration, on-premises workloads moving to cloud, apps that need pure NFS without S3
S3 Files	ML pipelines, agentic AI workflows, data engineering, any workload where both S3 API and file system access are needed
s3fs-fuse	Quick prototypes, read-heavy single-process scripts, legacy apps where you can't change the architecture

The quick comparison

Why This Matters for ML and AI Workloads

If you're building machine learning pipelines or agentic AI systems, S3 Files is worth paying close attention to.

The old workflow was: data lives in S3 → copy to EFS before training → run training job → copy results back to S3. For large datasets, that copy step alone could take hours. You were also paying double storage costs during the transition.

With S3 Files, your training job mounts the S3 bucket directly. The EFS cache warms up as your training reads data. No copy step. No sync script. No duplicate storage.

For agentic AI systems specifically — where multiple agents need to coordinate through shared files, read from each other's outputs, maintain shared state — S3 Files provides exactly the concurrent NFS access with close-to-open consistency that these workloads need. Standard Python file operations, standard shell tools, all working against data that lives in S3.

The Short Version

For a decade, AWS storage was a choice: pay S3 prices and lose file system semantics, or pay EFS prices and lose S3 integration. Teams wrote sync scripts, maintained duplicate data, and spent engineering time on storage plumbing instead of actual product work.

s3fs-fuse was the community's best attempt at a workaround — and it worked, up to a point. But it was always emulating file system behavior on top of an API that wasn't designed for it.

S3 Files is the first time AWS has genuinely solved this at the right layer. Real NFS semantics, real S3 storage, real production reliability. One bucket, two protocols, no compromises.

If you've ever maintained a sync script between your data lake and your compute layer — you know exactly what problem this solves. And you know exactly how good it feels to delete that script.

Resources

Published April 2026. All pricing figures reflect us-east-1 as of the time of writing.

If this helped you, drop a reaction or leave a comment — curious what storage patterns others are running into in the wild.

Forem: Pendela BhargavaSai

Kubernetes CNI Complete Guide: Flannel vs Cilium vs Calico + Cloud Provider CNIs

Table of Contents

Table of Contents

1. What Is a CNI and Why Does It Matter?

2. Open Source CNIs

2.1 Flannel Simple Overlay

2.2 Cilium — eBPF Native

🔭 Cilium + Hubble

2.3 Calico — BGP + Flexible Dataplane

2.4 Weave Net — Mesh Overlay

2.5 Antrea — OVS-based CNI

2.6 Multus — Meta CNI

3. Cloud Provider CNIs

3.1 AWS VPC CNI — EKS Default

3.2 Azure CNI — AKS Default

3.3 GKE Dataplane V2 — GKE Default

4. Data Plane Comparison

Service Scalability — All CNIs

5. Network Policy

Policy Feature Comparison

6. Observability

7. Performance Benchmarks

TCP Throughput — iperf3, Pod-to-Pod Same Node

p99 Latency — Same Node

8. Encryption

9. Multi-Cluster

10. Resource Usage

11. Full Feature Comparison

12. When to Choose Each

🟢 Choose Flannel when…

🟠 Choose Calico when…

🔵 Choose Cilium when…

🟡 Choose Weave when…

🟣 Choose Antrea when…

🔶 Choose AWS VPC CNI (EKS) when…

🔷 Choose Azure CNI (AKS) when…

♦️ Choose GKE Dataplane V2 (GKE) when…

13. K3s-Specific Setup

Flannel — Built-In, Nothing to Do

Installing Calico on K3s

Installing Cilium on K3s

Minimum Kernel Requirements

14. Migration Guide on K3s

Flannel → Calico

Flannel → Cilium

15. Conclusion

Open-Source CNIs

Cloud Provider CNIs

Further Reading

🔴 Supply Chain Attacks Are Breaking the Internet in 2026 — Every Major Hack Explained

The Year Everything Became a Weapon

What Is a Software Supply Chain Attack?

January 2026 — Cisco Unified Communications Zero-Day (CVE-2026-20045)

What got compromised

How it happened

Why it's a supply chain risk (not just a vulnerability)

How to protect yourself

February 2026 — GitHub Actions: The Misconfiguration That Started Everything

What got compromised

How it happened

The lesson about pull_request_target

How to protect yourself

March 2026 — The Month Everything Went Wrong

🔍 Trivy (Aqua Security) — March 19–20, 2026

What happened

The attack was elegant and terrifying

The forensic tells (that most teams missed)

How to protect yourself

🧠 LiteLLM — March 24, 2026

What happened

The 3-stage payload

Disclosure suppression

Immediate action

📦 Axios (npm) — March 30–31, 2026

What happened

The payload

Check your lockfiles now

How to protect yourself

🤖 Anthropic Claude Code — March 31, 2026

The lesson about `pull_request_target`