Forem: Ilia Gusev

Signed Images, Runtime Watchtowers, and Why Docker Pull Is an Act of Faith

Ilia Gusev — Thu, 19 Feb 2026 20:27:10 +0000

Originally published on Podo Stack

Every time you run docker pull, you're trusting that nobody tampered with that image between the build and your cluster. npm has signatures. Go modules have checksums. Docker images? Most of us just... hope for the best.

This week: supply chain security. The trust chain from build to runtime, and how to stop flying blind.

The Pattern: Supply Chain Trust

The problem is invisible

SolarWinds. Codecov. ua-parser-js. The pattern is always the same: attackers compromise the build or distribution pipeline, inject malicious code, and it flows downstream into production. Nobody notices because the artifact looks legitimate.

Container images have the same blind spot. You pull nginx:1.25, but how do you know it wasn't modified after the maintainer pushed it? You don't. Not unless you verify.

Three layers of defense

Good supply chain security works in layers - multiple checks, each catching what the previous one missed.

Layer 1: Build time - scan in CI. Tools like Trivy or Grype scan your images for known CVEs before they leave the pipeline. If something has a critical vulnerability, the build fails. You hear about it before it reaches a registry.

Layer 2: Registry - sign with cosign. After building, sign the image with cosign from the Sigstore project. The signature proves who built it and that the content hasn't changed. Think of it like a wax seal on a letter - break the seal, and everyone knows.

Layer 3: Admission - verify at the gate. Kyverno's verifyImages rule checks that every image entering your cluster has a valid signature. No signature? Rejected. This is the last line of defense.

Each layer alone has gaps. Together, they're solid.

Links:

Hidden Gem: Falco

Your IDS watches network traffic. Falco watches syscalls. Different universe.

Falco is a CNCF Graduated project - the highest maturity level - that does runtime threat detection. Not "scan and report later." Real-time, in the kernel, while your containers are running.

How it works

Falco hooks into Linux syscalls via eBPF. Every file open, every network connection, every process spawn - Falco sees it. Then it runs your rules against that stream. A rule says "if a shell is spawned inside a container, that's suspicious." Falco fires an alert within milliseconds.

- rule: Terminal shell in container
  desc: Detect a shell spawned in a container
  condition: >
    spawned_process and container
    and proc.name in (bash, sh, zsh)
  output: >
    Shell spawned in container
    (user=%user.name container=%container.name
     shell=%proc.name parent=%proc.pname)
  priority: WARNING

This catches things that scanning never will. A clean image can still be exploited at runtime. A zero-day doesn't show up in CVE databases. But someone opening a reverse shell inside your nginx container? Falco catches that.

Why eBPF matters here

eBPF lets Falco run its detection logic inside the kernel without modifying the kernel itself. No kernel modules to maintain, no recompilation. It hooks into syscall entry/exit points and streams events to userspace where the rules engine evaluates them.

The performance overhead is minimal - you're adding a few microseconds to syscall paths. For a security tool that watches everything in real time, that's a remarkable trade-off.

Links:

The Showdown: Distroless vs Alpine

Two approaches to minimal images. Very different trade-offs.

Alpine (the small one)

5MB base. Uses musl libc instead of glibc. Ships with apk package manager. You can sh into it, install debugging tools, poke around. About 260 packages in the base, which means roughly 150 CVEs per year to track. Small, but not empty.

Distroless (the empty one)

No package manager. No shell. No ls, no cat, no nothing. Just your binary and the runtime it needs. Google maintains the base images. Result: about 5 CVEs per year. There's almost nothing to exploit because there's almost nothing there.

When to choose what

Alpine - you need a shell for debugging, your app depends on C libraries that assume glibc (watch for musl compatibility issues), or you're in early development and need to iterate fast. It's the pragmatic choice.

Distroless - production workloads where security matters. Your Go or Rust binary is statically compiled anyway. You don't need a shell in production - that's what kubectl debug with ephemeral containers is for.

Worth mentioning: Chainguard Images offer a middle ground. Distroless-style images with better CVE tracking and daily rebuilds. If you haven't checked them out, they're worth a look.

Links:

The Policy: Verify Image Signatures (Kyverno + cosign)

Unsigned image gets deployed. Maybe it's fine. Maybe someone swapped the layers in your registry. You'd never know.

This Kyverno policy verifies cosign signatures before admitting any image. No valid signature, no admission.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    policies.kyverno.io/title: Verify Image Signatures
    policies.kyverno.io/category: Supply Chain Security
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
  - name: verify-cosign-signature
    match:
      any:
      - resources:
          kinds:
          - Pod
    verifyImages:
    - imageReferences:
      - "ghcr.io/your-org/*"
      attestors:
      - entries:
        - keyless:
            issuer: "https://token.actions.githubusercontent.com"
            subject: "https://github.com/your-org/*"
            rekor:
              url: https://rekor.sigstore.dev

A few things to note:

verifyImages is a dedicated Kyverno rule type - not a generic validate block. It understands OCI signatures natively.
The keyless configuration works with GitHub Actions' OIDC tokens. Your CI signs the image automatically, no private keys to manage.
rekor is Sigstore's transparency log. It provides an audit trail of every signature - who signed what and when.
Start with validationFailureAction: Audit first. Roll out to Enforce once your signing pipeline is solid.

Links:

The One-Liner: Trivy Image Scan

trivy image --severity CRITICAL nginx:latest

Scans nginx:latest for critical CVEs. No daemon, no config - Trivy is a single binary that downloads the vulnerability database on first run.

This is layer 1 of the trust pattern above. Put it in your CI pipeline: trivy image --exit-code 1 --severity CRITICAL your-image:tag. Build fails if anything critical shows up. Five minutes to set up, catches problems before they leave your laptop.

Bookmark it. You'll use it more than you think.

Links:

Trivy GitHub

How does your team handle image signing? Are you using cosign, Notary, or something else? I'd love to hear what's working - drop a comment below.

For weekly Cloud Native tools that actually work in production, subscribe to Podo Stack

Golden Paths, Guardrails, and Why Every Platform Needs a Catalog

Ilia Gusev — Wed, 11 Feb 2026 10:57:58 +0000

Originally published on Podo Stack

The last few issues of this newsletter covered individual tools -- image pulling, autoscaling, eBPF networking. All useful on their own. But tools don't help much if your engineers can't find them, use them safely, or provision infrastructure without filing a ticket and waiting three days.

This week we zoom out to the platform layer. The boring stuff that makes everything else work.

The Pattern: Platform Engineering Guardrails

Here's something I see a lot. A team builds a shiny Internal Developer Platform. Self-service. Kubernetes. The works. Then they write a 50-page "Platform Usage Guide" and email it to all engineers.

Nobody reads it. Someone deploys a public S3 bucket. Chaos.

Documentation is not a guardrail. A guardrail is code.

Gates vs Guardrails

Think of a highway. The old model is a tollbooth -- you stop, show your papers, wait for approval. That's a Change Advisory Board. It works, but it kills velocity.

Guardrails are the barriers on the sides of the road. You drive at full speed. If you try to go off the edge, something stops you. No human in the loop.

In practice, this means automated policies that either warn or block -- but never require manual approval when rules are followed.

Three Layers of Defense

Good guardrails exist at every stage of the delivery pipeline:

Design time -- your IDE flags that you're using a banned instance type. Fix it before it even hits Git.

Deploy time -- OPA or Conftest checks your manifests in CI. No memory limits? Pipeline fails with a clear message. You don't find out in production at 2 AM.

Runtime -- Kyverno or Gatekeeper intercepts the API call. Pod running as root? Rejected. The cluster itself says no.

Each layer catches what the previous one missed. Defense in depth, but for platform safety.

Start Soft

One mistake I've made (and seen others repeat): going full enforcement on day one. Engineers feel like a robot is slapping their hands every time they push code. Morale drops. People start looking for workarounds.

Better approach: start with 80% of guardrails in Audit mode. Let people see the warnings, understand the rules, ask questions. Give them a couple of weeks. Then gradually flip to Enforce -- starting with the policies that matter most (security, cost).

You'll get buy-in instead of resentment.

Links:

CNCF Platform Engineering Maturity Model

The Unsexy Tool: Backstage Software Catalog

Nobody gets excited about a catalog. There's no demo that makes the crowd gasp. But here's what happens without one: engineers Slack each other "who owns the payment service?" and nobody knows where the API docs live. Someone built a wiki page six months ago. It's already outdated.

Backstage is a CNCF Incubating project, originally built at Spotify. It's been around since 2020. Not new, not flashy. But it solves the "where is everything?" problem better than anything else I've seen.

The catalog-info.yaml Trick

The key idea is catalog-info.yaml -- a small file that lives next to your code. Developers own it. Backstage auto-discovers it from your Git repos. Here's what it looks like:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  annotations:
    github.com/project-slug: acme/payment-service
spec:
  type: service
  owner: team-alpha
  lifecycle: production
  providesApis:
    - payments-api

That's it. Now Backstage knows this service exists, who owns it, what APIs it exposes, and what it depends on. No separate documentation to maintain. The catalog stays accurate because it lives with the code.

The Entity Model

Backstage organizes everything into entities: Components (services, libraries), APIs, Resources (databases, queues), Groups (teams), and Users. They connect to each other through ownership and dependency relationships.

A team owns a component. That component provides an API. It depends on a database resource. You can trace the full graph in the UI. When something breaks at 3 AM, you know exactly who to page.

Golden Paths via the Scaffolder

Here's where it gets really useful. Backstage's Scaffolder lets you define templates for new services. Need a new microservice? Click a button, fill out a form, and get a repo with CI/CD pipeline, Dockerfile, monitoring dashboards, and catalog-info.yaml -- all pre-configured. Three minutes instead of three days.

The platform team controls the templates, not individual developers. Want to enforce a new security standard? Update the template. Every new service created from that point forward gets it automatically.

That's a golden path. You're not blocking engineers from doing things their own way. You're just making the right way the easiest way.

Links:

The Showdown: Crossplane vs Terraform

Both manage your cloud infrastructure. Completely different philosophies.

Terraform: The Standard

You write HCL files. You run terraform plan. You review the diff. You run terraform apply. Done.

It's simple, well-understood, and has providers for everything. But it's a one-shot operation. Between applies, nothing watches your infrastructure. Someone deletes a resource manually? Terraform doesn't know until your next plan. That could be days. Or weeks.

Crossplane: The K8s-Native Approach

Crossplane runs inside your cluster. You define a custom resource -- say, PostgreSQLCluster -- and Crossplane's controllers continuously reconcile it against reality. Just like how Kubernetes reconciles Deployments.

apiVersion: platform.acme.com/v1alpha1
kind: PostgreSQLCluster
metadata:
  name: orders-db
spec:
  version: "15"
  storageGB: 100
  environment: production

The developer doesn't know (or care) whether this creates an RDS instance, a Cloud SQL database, or something else. The platform team defines that mapping in a Composition. Developers get a simple API. Platform engineers keep control.

And if someone manually deletes the RDS instance? Crossplane notices and recreates it. Automatically.

When to Choose What

Terraform -- you have a small team, simple infrastructure, or you're early in your platform journey. It's proven and everyone knows it. Don't overcomplicate things if you don't need to.

Crossplane -- you're building a self-service platform. You want developers to request infrastructure through Kubernetes APIs without filing tickets. You need continuous reconciliation, not just plan-apply.

They're not competitors at the same maturity level. They're tools for different stages of the platform engineering journey. Plenty of teams use both -- Terraform for the foundational stuff, Crossplane for the self-service layer on top.

Links:

The Policy: Require PodDisruptionBudget

Node drain. Three replicas. No PDB. All pods evicted at once. Service down.

I've seen this happen in production more times than I'd like to admit. It's one of those things that doesn't matter until it really, really matters.

This Kyverno policy prevents exactly that. If your Deployment has more than one replica, it must have a matching PodDisruptionBudget. Otherwise, the API server rejects it.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-pdb
spec:
  validationFailureAction: Enforce
  background: false
  rules:
  - name: check-for-pdb
    match:
      any:
      - resources:
          kinds:
          - Deployment
          - StatefulSet
    preconditions:
      all:
      - key: "{{ request.object.spec.replicas }}"
        operator: GreaterThan
        value: 1
    context:
    - name: matchingPDBs
      apiCall:
        urlPath: "/apis/policy/v1/namespaces/{{request.object.metadata.namespace}}/poddisruptionbudgets"
        jmesPath: "items[].spec.selector.matchLabels"
    validate:
      message: >-
        Deployment with {{ request.object.spec.replicas }} replicas
        requires a matching PodDisruptionBudget.
      deny:
        conditions:
          all:
          - key: "{{ request.object.spec.template.metadata.labels }}"
            operator: NotIn
            value: "{{ matchingPDBs }}"

A few things to notice:

The preconditions block skips single-replica deployments. You don't need a PDB for a singleton -- there's nothing to disrupt gracefully.
The apiCall context actually queries the cluster for existing PDBs in the namespace, then checks whether any of them match the deployment's labels.
This is a runtime guardrail -- exactly what the first section of this article describes. No documentation needed. The cluster enforces it.

If you're starting with Kyverno, set validationFailureAction: Audit first. Let it report violations for a week. Then flip to Enforce once you've helped teams add their PDBs.

Links:

The One-Liner: Check Kubernetes EOL

curl -s https://endoflife.date/api/kubernetes.json | jq '.[0]'

Platform teams have to track version support. endoflife.date aggregates EOL data for hundreds of products -- it's a free API, no authentication needed. This command shows the latest Kubernetes release with its support dates, end-of-life timeline, and whether it's still getting patches.

Useful for audits, upgrade planning, or just settling the "should we upgrade yet?" debate in Slack.

Bookmark the API. It covers everything -- Node.js, PostgreSQL, Ubuntu, Go, Python, you name it.

What does your platform layer look like? Are you using Backstage, Crossplane, or something else entirely? I'd love to hear what's working (and what isn't) -- drop a comment below.

If you found this useful, consider subscribing to Podo Stack - weekly curation of Cloud Native tools ripe for production.

Lazy Pull, Smart Scale, eBPF Network

Ilia Gusev — Thu, 05 Feb 2026 11:38:14 +0000

This post was originally published on Podo Stack

Welcome back to Podo Stack. This week: how infrastructure deals with scale. Three layers of optimization — images, nodes, network. Each one solves a problem you've probably hit.

Here's what's good this week.

🏗️ The Pattern: Lazy Image Pulling with Stargz

The problem

You're scaling up. 50 new pods need to start. Every node pulls the same 2GB image. At the same time. Your registry groans. Your NAT gateway bill spikes. Containers sit there waiting instead of running.

Here's the kicker: research shows your app uses about 6% of the files in that image at startup. The other 94%? Downloaded "just in case."

The solution

Stargz Snapshotter flips the model. Instead of "download everything, then run" — it's "run now, download what you need."

The trick is a format called eStargz (extended seekable tar.gz). Normal tar.gz archives are sequential — to read a file at the end, you unpack the whole thing. eStargz adds a TOC (Table of Contents) at the start. Now you can jump directly to any file.

When your container starts:

Snapshotter fetches the TOC (kilobytes, not gigabytes)
Mounts the image via FUSE
Container starts immediately
Files get fetched on-demand via HTTP Range requests

The container is running while the image is still "downloading." Wild, right?

How this connects to Spegel

In Issue #1, we covered Spegel — P2P caching that shares images across nodes. Stargz takes a different approach: instead of optimizing distribution, it optimizes what gets downloaded in the first place.

They're complementary. Spegel says "pull once, share everywhere." Stargz says "only pull what you need." Use both and your image pull times will thank you.

The catch

FUSE runs in userspace, so there's some overhead for I/O-heavy workloads. Databases probably shouldn't use this. But for your typical microservice that loads a few MB at startup? Perfect fit.

⚔️ The Showdown: Karpenter vs Cluster Autoscaler

Two autoscalers. Same job. Very different approaches.

Cluster Autoscaler (the veteran)

CA has been around forever. It works through Node Groups (ASGs in AWS, MIGs in GCP). When pods are pending:

CA checks which Node Group could fit them
Bumps the desired count on that ASG
Cloud provider spins up a new node from the template
Node joins cluster, scheduler places pods

Time from pending to running: minutes. And you're stuck with whatever instance types you pre-defined in your node groups.

Karpenter (the new approach)

Karpenter skips node groups entirely. When pods are pending:

Karpenter reads their requirements — CPU, memory, affinity, tolerations
Calls the cloud API directly (EC2 Fleet in AWS)
Provisions a node that exactly fits what's waiting
Node joins, pods run

Time from pending to running: seconds. And it picks the cheapest instance type that works.

The comparison

Cluster Autoscaler:

Model: Node Groups (ASG)
Speed: Minutes
Sizing: Fixed templates
Cost: Often over-provisioned
Consolidation: Basic

Karpenter:

Model: Group-less
Speed: Seconds
Sizing: Right-sized
Cost: Optimized
Consolidation: Active

Karpenter also does active consolidation. It constantly checks: "Can I replace these three half-empty nodes with one smaller node?" If yes, it does.

The verdict

New cluster? Go Karpenter.

Already running CA successfully? Maybe keep it. Migration has costs. If it's not broken, weigh carefully.

🔬 The eBPF Trace: Cilium Replaces kube-proxy

The problem

kube-proxy uses iptables. Every service creates rules. Every rule gets checked sequentially.

1000 services = thousands of iptables rules. Every packet walks the chain. O(n) lookup. In 2025. In your kernel.

At scale, this hurts. CPU spikes during rule updates. Latency creeps up. Source IPs get lost in the NAT maze.

The fix

Cilium replaces all of this with eBPF. Instead of iptables chains, it uses hash map lookups — O(1), regardless of how many services you have.

helm install cilium cilium/cilium \
  --set kubeProxyReplacement=true

One flag. That's it.

eBPF programs intercept packets before they hit the iptables stack, do a single hash lookup, and route directly. Faster, simpler, and source IPs stay intact.

Real numbers

I've seen clusters go from 2-3ms service latency to sub-millisecond after switching. CPU usage during endpoint updates dropped significantly. The larger your cluster, the bigger the difference.

🔥 The Hot Take: eBPF is Eating Kubernetes

Look around. The data plane is being rewritten:

kube-proxy → Cilium eBPF
Service mesh sidecars → Cilium, Istio Ambient (ztunnel)
Observability → Pixie, Tetragon
Security → Falco, Tracee

The pattern is clear. Userspace proxies are getting replaced by kernel-level programs.

My hot take: In 3 years, half of the Kubernetes data plane will run on eBPF. The kernel is the new platform.

Agree? Disagree? Reply and tell me I'm wrong.

🛠️ The One-Liner: Karpenter Drift Detection

kubectl get nodeclaims -o custom-columns='NAME:.metadata.name,DRIFT:.status.conditions[?(@.type=="Drifted")].status'

Karpenter tracks "drift" — when a node no longer matches its NodePool spec. Maybe the AMI updated. Maybe requirements changed.

This command shows which nodes are marked as drifted and will be replaced during the next consolidation cycle.

Pairs nicely with the Showdown section above. Once you're on Karpenter, this becomes part of your daily toolkit.

If you're not on Karpenter yet

kubectl top nodes --sort-by=cpu | head -5

Shows your most loaded nodes. Useful for spotting where scaling is needed.

Questions? Feedback? Reply to this email. I read every one.

🍇 Podo Stack — Ripe for Prod.

Sidecar-Free Mesh, SLO from YAML, and Labels as Contracts

Ilia Gusev — Wed, 28 Jan 2026 09:20:00 +0000

Welcome back to Podo Stack. This week we're looking at how Istio finally killed the sidecar tax, a tool that turns SLO monitoring into a one-liner, and a policy that'll save your platform team from label chaos.

Here's what's good this week.

This post was originally published on Podo Stack

🚀 Sandbox Watch: Istio Ambient Mesh

What it is

Service mesh without sidecars. Istio Ambient hit GA in version 1.24, and it's not just a minor tweak — it's a completely different architecture.

Here's the problem with sidecars. Every pod gets an Envoy proxy injected. Run 100 pods, you're running 100 Envoys. Each one eats 50-100MB of RAM. Each one adds startup latency — your app waits for the sidecar to be ready before it can receive traffic. Scale to thousands of pods and you're burning serious resources on proxy overhead.

Ambient flips this model.

How it works

Two layers instead of one:

ztunnel — A lightweight L4 proxy that runs as a DaemonSet, one per node. It handles mTLS, basic routing, and telemetry. Most traffic never needs more than this.

Waypoint proxy — An L7 proxy that only spins up when you need HTTP-level features like header routing, retries, or traffic mirroring. It's on-demand. Don't need L7? Don't pay for it.

Think of it as "service mesh à la carte." You get the security baseline everywhere (ztunnel), and you add the fancy features only where they matter.

Why I like it

Memory overhead drops from ~100MB per pod to ~20MB per node
No more sidecar injection drama — pods start faster
Incremental migration: some namespaces on sidecar, some on ambient, same control plane
mTLS everywhere by default, no config needed

When to use it

You're running a large cluster. You're tired of the sidecar tax. You want mesh security without mesh complexity. You're okay with a newer (but now GA) approach.

⚔️ The Showdown: Ambient vs Sidecar

When should you stick with sidecars? When should you go ambient?

Sidecar mode:

Memory: ~50-100MB per pod
L7 features always available
Sidecar must init before your app starts
All-or-nothing migration per namespace
5+ years in production, familiar debugging

Ambient mode:

Memory: ~20MB per node (not per pod!)
L7 features on-demand via waypoint
No injection delay — pods start faster
Gradual migration, per-workload
GA since late 2024, newer tooling

The verdict:

Choose sidecar when you need fine-grained L7 control on every pod, you're already running it successfully, or your team knows the debugging patterns cold.

Choose ambient when memory is tight, you want mesh security without the overhead, you're starting fresh, or you want to migrate gradually without downtime.

Honestly? For new deployments in 2025+, ambient is the default choice. The sidecar tax was always the biggest complaint about service mesh — and now it's optional.

💎 The Hidden Gem: sloth

What it is

SLO monitoring without the PromQL PhD.

You want error budgets. You want burn rate alerts. You want dashboards that show if you're meeting your 99.9% availability target. The standard approach: spend three days writing Prometheus recording rules, debug the math, hope you got the multi-window burn rate calculation right.

sloth: write a YAML file, run one command, get everything.

How it works

service: payment-api
slos:
  - name: requests-availability
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate and you get:

Prometheus recording rules
Prometheus alert rules (multi-window burn rates)
Grafana dashboard JSON
Proper error budget calculation

The math is correct. The windows are correct. You focus on "what's my SLO" instead of "how do I calculate burn rates."

Why I like it

One YAML → complete SLO monitoring stack
Follows Google SRE book patterns exactly
Works with any Prometheus setup
The generated rules are readable — you can audit them

👮 The Policy: Require Labels

Copy this, apply it, and watch your platform governance improve overnight.

Why this matters

Labels aren't documentation — they're contracts. Without enforced labels, you get:

Cost allocation that's impossible ("which team owns this $50K/month workload?")
Access control that's broken (RBAC by label doesn't work if labels are missing)
Incident response that's slow ("who do I page for this failing deployment?")

The policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
  annotations:
    policies.kyverno.io/title: Require Labels
    policies.kyverno.io/category: Best Practices
    policies.kyverno.io/severity: medium
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-team-label
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: "Labels 'team', 'cost-center', and 'environment' are required."
        pattern:
          metadata:
            labels:
              team: "?*"
              cost-center: "?*"
              environment: "?*"

How to roll it out

Start with validationFailureAction: Audit — see what would be blocked
Fix your existing deployments
Switch to Enforce
Watch your platform team breathe easier

🛠️ The One-Liner: kubectl debug

kubectl debug -it my-pod --image=busybox --target=my-container

Your pod runs a distroless image. No shell. No curl. No nothing. How do you debug it?

This command injects an ephemeral container into the running pod. Same network namespace. Same filesystem view. Full debugging power.

Use cases

Check DNS resolution from inside the pod
Inspect files in a distroless image
Run tcpdump without rebuilding
Test connectivity to other services

Works on Kubernetes 1.25+. No pod restart required.

Pro tip

Add --share-processes to see the target container's process tree. Great for debugging stuck applications.

Spegel, Pixie, and Why :latest Is Evil

Ilia Gusev — Wed, 21 Jan 2026 09:48:37 +0000

This post was originally published on Podo Stack

Welcome to the first issue of Podo Stack. No fluff, no hype — just tools that are actually ripe for prod.

This week: a CNCF sandbox project that makes your cluster smarter about pulling images, an eBPF tool that sees your decrypted traffic (yes, really), a Kyverno policy you can deploy in 30 seconds, and a one-liner that gives you "terraform plan" for Kubernetes.

Let's get into it.

🚀 Sandbox Watch: Spegel

What it is

P2P container image caching for Kubernetes. Nodes share images directly with each other, no registry involved.

Here's the problem. You're scaling up your deployment — maybe 50 new pods need to start. Every single node goes to your registry and pulls the same image. At the same time. Your NAT gateway chokes. Docker Hub rate-limits you. Your cloud bill spikes from egress traffic.

Spegel fixes this with a dead-simple idea: what if nodes just shared images they already have?

How it works

Spegel runs as a DaemonSet. When a node pulls an image, Spegel indexes its layers and announces to the cluster: "Hey, I've got this one." When another node needs that image, it asks Spegel first. If someone in the cluster has it — boom, local transfer at 10 Gbps instead of crawling through the internet.

If nobody has it? Falls back to the external registry like normal.

The best part? It's stateless. No database, no PVC, no separate storage to manage. It piggybacks on containerd's existing cache. Install and forget.

Why I like it

One Helm chart. That's the setup.
No changes to your pod specs. Spegel works at the containerd level — it's transparent to your workloads.
Handles Docker Hub rate limits gracefully (the image only gets pulled once per cluster).
More nodes = more cache = faster pulls. It gets better as you scale.

When to use it

You're on containerd (most clusters are). You scale workloads frequently. You're tired of paying egress fees or hitting registry limits. You don't want to operate Harbor or Dragonfly.

💎 The Hidden Gem: Pixie

What it is

eBPF-powered observability that requires zero code changes. Install it, and two minutes later you have a service map of your entire cluster.

Most observability follows the same painful pattern: add a library, redeploy, wait for data. Pixie skips all of that.

It uses eBPF to intercept data at the kernel level. HTTP requests, SQL queries, DNS lookups — Pixie sees them all. Automatically. Without touching your code.

But here's the killer feature: it sees decrypted TLS traffic.

If you've ever tried to debug mTLS traffic in a service mesh, you know the pain. Wireshark shows garbage. Logs are empty because the developer forgot error handling. You're blind.

Pixie intercepts data before it hits the SSL library. You get the actual request body, in plain text, even when it's encrypted on the wire.

Real-world use case

Your payment service throws 500 errors. Logs say nothing. Metrics show increased latency but no obvious cause.

Old way: Add logging, rebuild, redeploy, wait for the error to happen again. Hope your logging catches it.

Pixie way: Open the console, run px/http_data, filter by status code 500. You see the exact request body and the SQL query that was running when it failed. Time to resolution: 2 minutes.

Other tricks

Continuous profiling without recompiling. Your Go/Rust/Java service is burning CPU? Pixie builds a flamegraph in real time. You'll see the exact function causing trouble.
PxL scripting. It's like Python with Pandas, but for your cluster telemetry. Query anything.

Quick start

px deploy

That's it. Seriously.

👮 The Policy: Disallow :latest Tags

Copy this, apply it, and save yourself from future headaches.

Why this matters

:latest is a lie. It doesn't mean "latest" — it means "whatever happened to be built last time someone didn't specify a tag." The image changes without the tag changing. Your Tuesday deployment works fine. Your Wednesday deployment breaks. Same manifest, different image.

Three nodes in your cluster might have three different versions of nginx:latest cached. You'll spend hours debugging why "the same pod" behaves differently on different nodes.

The policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
  annotations:
    policies.kyverno.io/title: Disallow Latest Tag
    policies.kyverno.io/category: Best Practices
    policies.kyverno.io/severity: medium
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-specific-image-tag
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: "The ':latest' tag is not allowed. Use a specific version tag."
        pattern:
          spec:
            =(initContainers):
              - image: "!*:latest"
            containers:
              - image: "!*:latest"

How it works

validationFailureAction: Enforce — blocks pods that violate the rule. Use Audit if you just want warnings.
image: "!*:latest" — the ! means "NOT". Any image except :latest.
Kyverno auto-applies this to Deployments, StatefulSets, Jobs — anything that creates pods.

Test before enforcing

Start with Audit mode. Check what would be blocked. Then switch to Enforce when you're confident.

🛠️ The One-Liner: flux diff

flux diff kustomization my-app --path ./clusters/prod/

This is "terraform plan" for Kubernetes.

You're about to merge a PR that changes your deployment. What actually changes in the cluster? With this command, you know before it happens.

Why it's better than kubectl diff

Understands Flux Kustomization CRDs
Handles SOPS-encrypted secrets (masks values in output)
Filters out noisy fields like status that change constantly

Bonus — diff everything

flux diff kustomization --all

Shows what would change across all kustomizations in your cluster.

Pro tip

Add this to your CI pipeline. Post the diff as a PR comment. Reviewers see the actual Kubernetes changes, not just YAML line diffs.