Forem: Pavan Madduri

Docker Model Runner Replaced My Entire Local AI Setup

Pavan Madduri — Wed, 13 May 2026 19:29:44 +0000

I used to have a ridiculous local AI setup. Ollama running as a service. A separate Python venv for LangChain experiments. Another terminal with llama.cpp because I wanted to test quantized models. Three different API formats, three different port numbers, three things that broke independently every time I updated macOS.

Then Docker shipped Model Runner and I deleted all of it.

What Model Runner Actually Is

It's built into Docker Desktop. No separate install. You pull models the same way you pull images:

docker model pull ai/llama3.1
docker model pull ai/phi3-mini
docker model pull ai/mistral

Run inference:

docker model run ai/llama3.1 "Explain NUMA topology in two sentences"

Or hit the API endpoint, which is OpenAI-compatible:

curl http://localhost:12434/engines/llama3.1/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is OKE?"}],
    "max_tokens": 100
  }'

That's it. No Python, no venv, no pip install, no CUDA drivers (it uses Metal on Mac, CPU elsewhere). It just runs.

Why I Switched From Ollama

Ollama is fine. I used it for months. But a few things bugged me:

Port conflicts. Ollama defaults to 11434. I kept forgetting it was running and then wondering why port 11434 was taken. Docker Model Runner runs inside the Docker VM so it doesn't occupy a host port in the same way — it's accessible at a consistent endpoint.

Update management. Ollama is a separate binary I have to update separately. Model Runner updates come with Docker Desktop. One less thing to think about.

API compatibility. I'm deploying vLLM in production on OKE. vLLM exposes an OpenAI-compatible API. Model Runner also exposes an OpenAI-compatible API. My client code works unchanged between local and production. With Ollama I was constantly converting between Ollama's native format and OpenAI's format.

Docker context. Model Runner models can be referenced from Docker Compose files and Dockerfiles. That means my local dev stack can include an LLM as a service alongside my API server, database, and cache — all in one docker compose up.

Using Model Runner in Docker Compose

This is the part that actually changed my workflow:

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      - LLM_ENDPOINT=http://host.docker.internal:12434/engines/llama3.1/v1
    depends_on:
      - db

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: dev

My API server talks to Model Runner at host.docker.internal:12434. In production on OKE, that env var points to my vLLM service instead. Same client code, same prompt format, different backend.

// Same code works with Model Runner locally and vLLM on OKE
func callLLM(prompt string) (string, error) {
    endpoint := os.Getenv("LLM_ENDPOINT") + "/chat/completions"

    body := map[string]interface{}{
        "messages": []map[string]string{
            {"role": "user", "content": prompt},
        },
        "max_tokens": 200,
    }

    jsonBody, _ := json.Marshal(body)
    resp, err := http.Post(endpoint, "application/json", bytes.NewBuffer(jsonBody))
    // ...
}

The Dev Loop That Actually Works

Before Model Runner, testing a prompt change meant:

Edit prompt template
Rebuild Docker image
Push to OCIR
Wait for OKE to pull the 8GB GPU image
Test
Realize the prompt is wrong
Repeat from step 1

That's 15-20 minutes per iteration. With Model Runner:

Edit prompt template
docker compose up --build (rebuilds in seconds, model already cached)
Test
Fix prompt
Repeat from step 1

Two minutes per iteration, maybe less. The model is already running locally. The API server rebuild is a Go binary, takes 3 seconds. I can iterate on prompts 10x faster.

Models I Actually Use

# Code assistance (great for generating test data)
docker model pull ai/codellama

# General purpose (good balance of speed and quality)
docker model pull ai/llama3.1

# Small and fast (for quick experiments)
docker model pull ai/phi3-mini

# List what's cached
docker model list

On my M3 MacBook Pro, phi3-mini generates ~30 tokens/sec. Llama 3.1 8B does about 15 tokens/sec. Not blazing fast, but fast enough for development. I'm not benchmarking model performance locally — I'm testing that my application handles LLM responses correctly.

Limitations I've Hit

No fine-tuned models. You can't load your own fine-tuned LoRA adapters into Model Runner. For that I still need vLLM or llama.cpp. This is a local dev tool, not a training platform.

Model selection. The catalog is growing but it's not as large as Ollama's. The main open models are there (Llama, Mistral, Phi, CodeLlama) but if you need something obscure, check first.

No GPU on Linux Docker Desktop. On Mac it uses Metal. On Linux it's CPU-only through Docker Desktop (you'd need to run vLLM directly for GPU inference). Fine for dev, not for benchmarking.

My Current Setup

# Start everything
docker compose up -d

# API server at localhost:8080
# Model Runner at localhost:12434
# Postgres at localhost:5432

# Test the AI endpoint
curl localhost:8080/api/summarize \
  -d '{"text": "Long document here..."}'

# Check which models are loaded
docker model list

One command, full stack, LLM included. When it's time to deploy to OKE, I change the LLM_ENDPOINT env var to point at my vLLM service and everything else stays the same.

I deleted Ollama, uninstalled llama.cpp, removed the Python venv. My local AI setup is now just Docker.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)

Pavan Madduri — Tue, 12 May 2026 14:49:48 +0000

Running GPU containers on Kubernetes is one of those things that works perfectly in tutorials and then breaks in confusing ways on real clusters. I've been deploying GPU workloads on OKE for a few months now, and I've built up a decent collection of debugging war stories.

This isn't a getting-started guide. This is the post I wish existed the first time I saw CrashLoopBackOff on a GPU pod with zero useful logs.

Bug 1: Pod Stuck in Pending — "0/3 nodes are available"

This was my first GPU deployment on OKE. Created a pod requesting nvidia.com/gpu: 1, and it just sat in Pending forever.

$ kubectl describe pod vllm-inference-0
Events:
  Warning  FailedScheduling  0/3 nodes are available: 3 Insufficient nvidia.com/gpu

Three nodes, but none had GPUs. Turns out I created the GPU node pool but it hadn't finished scaling up yet. OKE provisions GPU nodes on-demand when you create the node pool, and it takes 3-5 minutes for the instances to come up.

Fix: Just wait. But also — check that your node pool is actually using a GPU shape:

# Verify GPU nodes exist and are ready
kubectl get nodes -l nvidia.com/gpu=present
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Should show:
#   nvidia.com/gpu: 1
# under Allocatable

If nvidia.com/gpu doesn't appear in Allocatable, the NVIDIA device plugin isn't running on that node. On OKE it should be automatic, but I've seen it lag behind node creation by a minute or two.

Bug 2: CUDA Version Mismatch

This one was nasty. The container started, then immediately crashed:

CUDA error: no kernel image is available for execution on the device

My Dockerfile used nvidia/cuda:12.4-runtime-ubuntu22.04, but the GPU node's driver only supported CUDA 12.2. The container runtime is newer than the host driver can handle.

Fix: Check the driver version on the node and match your CUDA image:

# SSH to GPU node or run a debug pod
kubectl run gpu-debug --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"containers":[{"name":"gpu-debug","image":"nvidia/cuda:12.2.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  -- nvidia-smi

The output shows the driver version and maximum CUDA version it supports. Use a CUDA runtime image that doesn't exceed that version.

On OKE, Oracle controls the GPU node image. When they update the driver, you can bump your CUDA version. Don't go the other way around.

Bug 3: OOM Killed During Model Loading

vLLM loaded halfway and then the pod got OOM killed:

$ kubectl describe pod vllm-0
    Last State:  Terminated
      Reason:    OOMKilled
      Exit Code: 137

I had resources.limits.memory: 16Gi but the model needed more during the loading phase. vLLM memory-maps the model weights, and the kernel counts that against the container's memory limit even though it's not all resident.

Fix: Set memory limits higher than you think you need, or use --gpu-memory-utilization 0.85 in vLLM to cap GPU memory usage and reduce the spill to CPU memory:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 32Gi    # was 16Gi — doubled it
  requests:
    memory: 16Gi

The gap between requests and limits gives the pod burst room during model loading without permanently reserving 32GB on the node.

Bug 4: Liveness Probe Restart Loop

Already mentioned this in a previous post but it's worth repeating because it gets everyone.

Model loading takes 60-120 seconds. Default liveness probe fires at 10 seconds. Kubernetes thinks the pod is dead, kills it, it restarts, starts loading again, gets killed again. Infinite loop.

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 180   # Give model time to load
  periodSeconds: 15
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10

Use a startup probe if you want to be cleaner about it:

startupProbe:
  httpGet:
    path: /health
    port: 8000
  failureThreshold: 30
  periodSeconds: 10
  # Gives up to 300 seconds for model loading

Bug 5: Image Pull Timeout on Large GPU Images

GPU images are 5-15GB. OCIR pull worked fine for small images but GPU images would timeout:

Failed to pull image: context deadline exceeded

Fix: Increase the kubelet image pull deadline and use imagePullPolicy: IfNotPresent so the massive image only downloads once:

spec:
  containers:
    - name: inference
      image: iad.ocir.io/mytenancy/vllm:v1
      imagePullPolicy: IfNotPresent

Also — pull from the same OCI region as your cluster. Cross-region OCIR pulls over the internet are slow. Same-region pulls go over the internal network and are 5-10x faster.

Bug 6: GPU Not Released After Pod Deletion

Deleted a pod, created a new one, and it couldn't get the GPU:

0/1 nodes are available: 1 Insufficient nvidia.com/gpu

But the old pod was gone. kubectl get pods showed nothing using the GPU.

Turns out the pod was stuck in Terminating state because the vLLM process wasn't handling SIGTERM properly. The GPU was still allocated to the zombie pod.

Fix: Add a preStop hook and set a reasonable terminationGracePeriodSeconds:

spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: inference
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "kill -SIGTERM 1 && sleep 5"]

If a pod is truly stuck, kubectl delete pod <name> --grace-period=0 --force will release the GPU. Use it as a last resort.

My Debugging Checklist

When a GPU pod is broken on OKE, I now run through this in order:

# 1. Is the pod even scheduled?
kubectl get pod <name> -o wide

# 2. What's the event history?
kubectl describe pod <name>

# 3. Are GPU nodes ready with GPUs allocatable?
kubectl get nodes -l nvidia.com/gpu=present
kubectl describe node <gpu-node> | grep -A5 "Allocatable"

# 4. Can nvidia-smi run on the node?
kubectl exec -it <gpu-pod> -- nvidia-smi

# 5. What's the container's actual error?
kubectl logs <pod> --previous   # logs from the crashed container

# 6. Resource pressure on the node?
kubectl top node <gpu-node>

90% of the time, the answer is in steps 2 or 5. The other 10% is the CUDA version mismatch, which requires step 4.

Prevention

Most of these bugs hit me once and then I added checks so they wouldn't happen again:

CI validates CUDA version — build step runs nvidia-smi in a test container to verify compatibility
Startup probes on every GPU pod — no more liveness probe restart loops
Same-region OCIR — eliminated image pull timeouts
Memory limits 2x the model size — OOM kills are gone

None of this is complex. It's just stuff you learn by deploying real GPU workloads and watching them break.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

From Docker Compose on My Laptop to OKE in Production — Same App, Zero Rewrites

Pavan Madduri — Mon, 11 May 2026 17:06:09 +0000

I have a rule: if I can't run the full stack on my laptop with docker compose up, the architecture is too complicated.

But then you need to deploy to production, and suddenly you're rewriting everything as Kubernetes manifests. The Compose file that worked on your machine is useless. Config lives in two places and they drift apart.

Here's the workflow I settled on after trying a bunch of things that didn't work well.

The Local Stack

Standard web app — API, Redis, Postgres. Three services.

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=postgres://app:secret@db:5432/myapp
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "app"]

  cache:
    image: redis:7-alpine

volumes:
  pgdata:

docker compose up and the full stack is running. No external dependencies, fast iteration.

What I Tried and Abandoned

Kompose — kompose convert technically works but the output is ugly. Tons of annotations, weird formatting, needs so much cleanup I might as well write the YAML by hand.

Docker Compose on Kubernetes — Various tools that try to run Compose files directly on K8s. They all add complexity and break in subtle ways.

Trying to share one config — I wasted a weekend trying to make the same file work for both. Local dev and production have genuinely different requirements. Pretending otherwise creates worse problems.

What Actually Works: Convention Over Tooling

I keep two config sets, aligned by convention:

project/
├── docker-compose.yml          # Local dev
├── Dockerfile
├── k8s/
│   ├── base/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── kustomization.yaml
│   └── overlays/
│       ├── staging/
│       └── production/
└── Makefile

Same image names, same env var names, same port numbers in both places. When I change a port in Compose, I grep for it in k8s/ and update it. Manual, but nothing breaks silently.

The key differences between local and OKE:

Database — Container locally, OCI managed service in production. I don't run databases on K8s.
Secrets — Plain text in Compose, OCI Vault via External Secrets Operator on OKE.
Scaling — One replica locally, HPA on OKE.

The K8s Side

# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: iad.ocir.io/mytenancy/myapp:latest
          ports:
            - containerPort: 8080
          envFrom:
            - secretRef:
                name: app-secrets
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 10

Kustomize overlays handle the per-environment differences:

# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
  - ../../base
images:
  - name: iad.ocir.io/mytenancy/myapp
    newTag: v1.2.3

The Alignment Check I Actually Use

# Makefile
dev:
    docker compose up --build

build:
    docker build -t iad.ocir.io/$(TENANCY)/myapp:$(TAG) .
    docker push iad.ocir.io/$(TENANCY)/myapp:$(TAG)

deploy-staging:
    kubectl apply -k k8s/overlays/staging

check-alignment:
    @echo "=== Compose ports ===" && grep -A1 "ports:" docker-compose.yml
    @echo "=== K8s ports ===" && grep "containerPort" k8s/base/deployment.yaml
    @echo "=== Compose health ===" && grep "test:" docker-compose.yml
    @echo "=== K8s health ===" && grep "path:" k8s/base/deployment.yaml

make check-alignment is dumb but it catches drift. I run it before every deploy. It's saved me twice already from deploying with mismatched health check paths.

The Honest Take

This isn't elegant. I'd love a single config file that works everywhere. But every tool I tried to achieve that added more complexity than it removed.

The current setup is boring and it works. Compose for local, Kustomize for OKE, same Docker image, same env var names, a Makefile to keep me honest. I understand every piece of it, and when something breaks at 2am, that matters more than elegance.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

I Cut My Container Image Costs 60% by Building Multi-Arch Docker Images on OCI ARM

Pavan Madduri — Fri, 08 May 2026 19:40:30 +0000

I was running all my containers on AMD64 shapes because that's what I'd always done. x86, Intel/AMD, the default. Then I looked at my OCI bill and realized I was paying $0.064/OCPU/hr for AMD64 when ARM shapes cost $0.010/OCPU/hr. Six times cheaper for the same work.

The catch? My Docker images were all built for AMD64. They wouldn't run on ARM nodes. I had to figure out multi-arch builds.

It took me an afternoon to get right, and now every image I build supports both architectures. Here's what I learned.

Why ARM on OCI Is Different From ARM Everywhere Else

AWS has Graviton. GCP has Tau T2A. Azure has Ampere Altra. They're all ARM, and they're all cheaper than their x86 equivalents.

But OCI's pricing gap is the widest I've seen:

Architecture	Shape	$/OCPU/hr	4 OCPU + 24GB monthly
ARM	VM.Standard.A1.Flex	$0.010	~$29
AMD64	VM.Standard.E4.Flex	$0.064	~$184

And the Always Free tier gives you 4 ARM OCPUs and 24GB RAM forever. There's nothing comparable on the x86 side.

The problem is that most Docker images on Docker Hub are x86 only, and if you've been building images without thinking about architecture, yours probably are too.

How I Actually Did the Multi-Arch Build

Docker Buildx makes this surprisingly painless. The first time I tried it I expected hours of yak-shaving. It took about 20 minutes.

# Create a builder that supports multiple platforms
docker buildx create --name multiarch --driver docker-container --use

# Build and push for both architectures
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t iad.ocir.io/mytenancy/myapp:v1.2.0 \
  --push .

That's the core of it. Buildx uses QEMU emulation to build the ARM image on your x86 machine (or vice versa). The --push flag creates a manifest list in the registry that points to both architecture-specific images. When an ARM node pulls the image, it gets the ARM version automatically.

The Gotchas I Hit

QEMU is slow. Cross-compiling a Go binary via QEMU took 4x longer than native. For Go, I got around this by using Go's built-in cross-compilation:

FROM --platform=$BUILDPLATFORM golang:1.22-alpine AS builder
ARG TARGETARCH
WORKDIR /app
COPY . .
RUN GOARCH=$TARGETARCH CGO_ENABLED=0 go build -o server .

FROM alpine:3.20
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]

The --platform=$BUILDPLATFORM runs the build stage on your host architecture (fast), and GOARCH=$TARGETARCH tells Go to cross-compile for the target. No QEMU needed for the slow compilation step. Build went from 3 minutes back down to 40 seconds.

Python images need attention. Some pip packages have pre-built wheels for x86 but not ARM. When that happens, pip tries to compile from source inside the container, which needs gcc and build headers you might not have in your image. I hit this with numpy on an older version. Pinning to a version with ARM wheels fixed it.

Alpine vs Debian base images. Alpine uses musl libc, not glibc. Some binaries compiled for ARM + glibc won't work on Alpine. If you're getting weird segfaults on ARM, try switching to debian:bookworm-slim as your base and see if it goes away.

CI Pipeline for Multi-Arch

I have this in GitHub Actions. It builds for both architectures and pushes to OCIR:

name: Build Multi-Arch
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to OCIR
        uses: docker/login-action@v3
        with:
          registry: iad.ocir.io
          username: ${{ secrets.OCIR_USERNAME }}
          password: ${{ secrets.OCIR_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          platforms: linux/amd64,linux/arm64
          tags: iad.ocir.io/${{ secrets.OCIR_TENANCY }}/myapp:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

The cache-from: type=gha uses GitHub's cache for layer caching, which makes subsequent builds much faster.

Deploying on OKE with Mixed Node Pools

On OKE I run two node pools — one ARM, one x86. Most workloads go to ARM because it's cheaper. GPU workloads obviously stay on x86 (NVIDIA doesn't make ARM GPUs for data centers... yet).

# ARM node pool (cheap, general workloads)
oci ce node-pool create \
  --name arm-workers \
  --node-shape VM.Standard.A1.Flex \
  --node-shape-config '{"ocpus": 4, "memoryInGBs": 24}' \
  ...

# AMD64 node pool (GPU workloads, x86-only dependencies)
oci ce node-pool create \
  --name x86-workers \
  --node-shape VM.Standard.E4.Flex \
  --node-shape-config '{"ocpus": 4, "memoryInGBs": 32}' \
  ...

Kubernetes handles the scheduling automatically. When a multi-arch image gets pulled, each node gets the right architecture variant. I didn't have to add any node selectors or affinity rules for most workloads — the manifest list takes care of it.

For workloads that must run on a specific architecture (like anything that needs NVIDIA GPUs), I use a node selector:

nodeSelector:
  kubernetes.io/arch: amd64

The Actual Savings

I moved 6 microservices from x86 to ARM over two weeks. These are Go and Python services — nothing exotic. All of them worked on ARM without code changes. The Docker images needed rebuilding with Buildx, but the Dockerfiles didn't change.

Monthly compute cost went from ~$184 to ~$29 for the same 4 OCPU / 24GB configuration per service. Across 6 services, that's about $930/month saved. Not life-changing for a company, but for my side projects and dev environments? That's real money.

When ARM Doesn't Work

Not everything runs on ARM. In my experience:

NVIDIA GPU workloads — x86 only for now
Legacy binaries — anything compiled for x86 without source code
Some Java native libraries — JNI libraries that ship x86 .so files only
Electron / desktop tools — not relevant for server containers but worth mentioning

Everything else — Go, Python, Node, Rust, Java (pure), Ruby — works fine on ARM. The ecosystem has matured a lot in the last two years.

Try It

If you're on OCI and not using ARM shapes, you're leaving money on the table. Start with one service. Build it multi-arch with Buildx. Deploy it on an A1.Flex node. Compare the bill.

The Docker workflow doesn't change. docker build, docker push, kubectl apply. The only difference is adding --platform linux/amd64,linux/arm64 to your build command.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I write about containers, Kubernetes, and GPU infrastructure. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

The Zero-Trust Docker Pipeline: Securing GPU/AI Container Images from Build to Production

Pavan Madduri — Fri, 08 May 2026 15:23:20 +0000

GPU container images are the softest target in your infrastructure. A typical vLLM image is 15GB with hundreds of packages, a CUDA runtime, Python dependencies, and model weights. Most teams build these images once, push them, and never scan them again. That's a problem.

I've been building GPU infrastructure tools on Docker and Kubernetes for the past year — keda-gpu-scaler for autoscaling, otel-gpu-receiver for observability, and GPU NUMA topology scheduling for Volcano. Every one of these ships as a Docker container. This post walks through the zero-trust pipeline I use to build, scan, sign, and deploy GPU containers — from docker build to production.

The Attack Surface

A standard GPU inference image has five layers of dependencies, and every layer is a CVE vector:

Layer	Example	Risk
OS packages	Ubuntu 22.04, libc, OpenSSL	OS-level CVEs, often unpatched in CUDA base images
CUDA toolkit	libcudart, libnvml, cuDNN, NCCL	NVIDIA releases on their own cycle, often behind on OS patches
Python runtime	CPython 3.11+	Python CVEs, pip supply chain attacks
ML framework	PyTorch, TensorFlow, vLLM	Hundreds of transitive Python deps
Application code	Custom serving logic, prompt templates	Your code — hopefully the smallest attack surface

The CUDA layer is the sneaky one. NVIDIA maintains their own base images (nvidia/cuda:12.4-base-ubuntu22.04) on their own release schedule. When Ubuntu patches a critical OpenSSL vulnerability, the NVIDIA base image might not pick it up for weeks. If you're building on top of nvidia/cuda, you inherit that lag.

Step 1: Docker Hardened Images as the Foundation

Docker Hardened Images are pre-vetted, continuously patched base images maintained by Docker. Instead of building on NVIDIA's base image directly, you can use a Docker Hardened base and selectively copy in only the CUDA libraries you need:

# Instead of inheriting the full NVIDIA base:
# FROM nvidia/cuda:12.4-base-ubuntu22.04

# Use Docker Hardened base + only the CUDA libs you need
FROM docker.io/docker/hardened-runtime:ubuntu-22.04

# Copy ONLY the NVIDIA libraries required at runtime
COPY --from=nvidia/cuda:12.4-base-ubuntu22.04 \
  /usr/local/cuda/lib64/libcudart.so* \
  /usr/local/cuda/lib64/libnvml.so* \
  /usr/local/cuda/lib64/libcublas.so* \
  /usr/local/lib/

RUN ldconfig

What this gives you:

Patched OS layer — Docker maintains the base, not NVIDIA. Patches ship within hours, not weeks.
Smaller image — only the CUDA libraries your application actually links against. Not the full 3GB toolkit.
Scout-optimized — Docker Scout has first-party provenance data for Hardened Images, so scanning is faster and more accurate.

Step 2: Multi-Stage Builds with Minimal Runtime

The goal is to ship the smallest possible runtime image. Everything that's only needed at build time — compilers, Go toolchain, npm, development headers — stays in the build stage.

Here's the pattern I use for keda-gpu-scaler:

# === Build Stage ===
FROM golang:1.22-bookworm AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=1 go build -ldflags="-s -w" -o gpu-scaler ./cmd/scaler

# === Runtime Stage ===
FROM docker.io/docker/hardened-runtime:ubuntu-22.04

# Only the NVML library — nothing else from CUDA
COPY --from=nvidia/cuda:12.4-base-ubuntu22.04 \
  /usr/local/cuda/lib64/libnvml.so* /usr/local/lib/
RUN ldconfig

# Non-root user
USER 65534:65534

# Read-only filesystem — binary is statically positioned
COPY --from=builder /app/gpu-scaler /usr/local/bin/

ENTRYPOINT ["gpu-scaler"]

Result: Runtime image is ~80MB instead of 3.5GB. Attack surface reduced by 97%.

For Python-based inference images (vLLM, Triton), the same principle applies but with pip:

# Build stage: install Python deps
FROM python:3.11-bookworm AS builder
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Runtime stage: minimal base + only installed packages
FROM docker.io/docker/hardened-runtime:ubuntu-22.04

# Copy CUDA runtime libs
COPY --from=nvidia/cuda:12.4-runtime-ubuntu22.04 \
  /usr/local/cuda/lib64/ /usr/local/cuda/lib64/

# Copy Python and installed packages
COPY --from=python:3.11-slim /usr/local/ /usr/local/
COPY --from=builder /install /usr/local

# Non-root
USER 65534:65534

COPY ./app /app
ENTRYPOINT ["python", "-m", "app.serve"]

Step 3: Docker Scout in CI — Block on CVEs

Docker Scout integrates into your CI pipeline to catch vulnerabilities at build time, not after deployment:

# .github/workflows/docker-security.yml
name: Docker Security Pipeline
on:
  push:
    branches: [main]
  pull_request:

jobs:
  build-and-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: docker/setup-buildx-action@v3

      - uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build image
        uses: docker/build-push-action@v5
        with:
          context: .
          load: true
          tags: my-gpu-image:${{ github.sha }}

      # Scan for CVEs — fail on critical/high
      - name: Docker Scout CVE scan
        uses: docker/scout-action@v1
        with:
          command: cves
          image: my-gpu-image:${{ github.sha }}
          sarif-file: scout-results.sarif
          only-severities: critical,high
          exit-code: true

      # Check against Docker Scout policies
      - name: Docker Scout policy evaluation
        uses: docker/scout-action@v1
        with:
          command: policy
          image: my-gpu-image:${{ github.sha }}

      # Upload SARIF to GitHub Security tab
      - name: Upload SARIF
        if: always()
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: scout-results.sarif

      # Only push if scan passes
      - name: Push image
        if: success()
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            pmady/keda-gpu-scaler:${{ github.sha }}
            pmady/keda-gpu-scaler:latest

Key configuration:

exit-code: true — the build fails if Scout finds critical or high CVEs. No exceptions.
sarif-file — results show up in GitHub's Security tab for tracking over time.
Policy evaluation — Docker Scout policies can enforce rules like "no images older than 30 days" or "must use Docker Official Images as base."

Common GPU Image Findings

Finding	Cause	Fix
Critical OpenSSL CVE	NVIDIA base image lags Ubuntu patches	Use Hardened Image as base
High-severity Python CVE	Transitive deps in PyTorch/vLLM	Pin versions, run `pip audit` in CI
Medium glibc CVE	Base image outdated	Rebuild weekly with `--no-cache`
Outdated CUDA libraries	NVIDIA release cycle	Cherry-pick only needed `.so` files from latest CUDA image

Step 4: Container Signing with Docker Content Trust

Sign your images so that Kubernetes admission controllers can verify provenance:

# Enable Docker Content Trust
export DOCKER_CONTENT_TRUST=1

# Push signs automatically
docker push pmady/keda-gpu-scaler:v1.0.0

# Verify a signed image
docker trust inspect pmady/keda-gpu-scaler:v1.0.0

For CI, use cosign (Sigstore) for keyless signing:

# In your GitHub Actions workflow
- name: Sign image with cosign
  uses: sigstore/cosign-installer@v3

- name: Sign
  run: |
    cosign sign --yes \
      pmady/keda-gpu-scaler:${{ github.sha }}
  env:
    COSIGN_EXPERIMENTAL: 1

Then enforce in Kubernetes with Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-signature
      match:
        any:
          - resources:
              kinds: ["Pod"]
      verifyImages:
        - imageReferences: ["pmady/*"]
          attestors:
            - entries:
                - keyless:
                    subject: "pavan4devops@gmail.com"
                    issuer: "https://accounts.google.com"

Now unsigned or tampered images are rejected at admission. The chain is: Build → Scout scan → Sign → Push → Kubernetes admission verifies → Runtime policy enforces.

Step 5: Runtime Security for GPU Containers

GPU containers need device access (/dev/nvidia*) but they don't need anything else privileged. Lock them down:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    fsGroup: 65534
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: inference
      image: pmady/keda-gpu-scaler:v1.0.0
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 8Gi
        requests:
          nvidia.com/gpu: 1
          memory: 4Gi
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: models
          mountPath: /models
          readOnly: true
  volumes:
    - name: tmp
      emptyDir:
        sizeLimit: 100Mi
    - name: models
      persistentVolumeClaim:
        claimName: model-weights
        readOnly: true

What this enforces:

Non-root — NVML reads from sysfs, doesn't need root
Read-only root filesystem — no writes except /tmp (ephemeral) and /models (read-only PVC)
No privilege escalation — prevents container escape
Drop all capabilities — the NVIDIA device plugin handles GPU device injection, the container doesn't need any Linux capabilities
Seccomp — default syscall filter

The Full Pipeline

Source Code
    │
    ▼
docker build (multi-stage, Hardened Image base)
    │
    ▼
Docker Scout CVE scan ── FAIL? → Block merge
    │
    ▼
Docker Scout policy check ── FAIL? → Block merge
    │
    ▼
cosign sign (keyless, Sigstore)
    │
    ▼
docker push (to registry)
    │
    ▼
Kubernetes admission (Kyverno verifies signature + base image)
    │
    ▼
Runtime (non-root, read-only fs, drop caps, seccomp, resource limits)

Every stage has a gate. No image reaches production without being scanned, signed, and policy-checked. This is the same pipeline whether you're deploying a GPU inference service, a training job orchestrated by Volcano, or my keda-gpu-scaler DaemonSet.

This Isn't Optional for GPU Workloads

GPU containers are high-value targets. They run on expensive hardware ($3-30/hour per instance), they often have network access to model registries (HuggingFace, S3), and they process potentially sensitive input data. A compromised inference container is a direct path to data exfiltration and compute theft (cryptomining on your A100s).

The zero-trust pipeline adds ~2 minutes to your CI build. The alternative is finding out about a critical CVE from your security team after it's been running in production for three weeks.

Docker Scout + Hardened Images + container signing. Use all three.

GPU-Aware Autoscaling for Docker Containers: From NVML to Production

Pavan Madduri — Fri, 08 May 2026 15:21:41 +0000

Every GPU inference container has the same problem: Kubernetes HPA can't see the GPU. You scale on CPU and memory while your GPU sits at 95% utilization, completely invisible to the autoscaler. Or worse — your GPU is idle and you're paying $3/hour for an instance doing nothing.

I built keda-gpu-scaler to fix this. It's a KEDA external scaler that reads real GPU metrics via NVIDIA NVML and drives Kubernetes autoscaling decisions — including scale-to-zero. This post covers the Docker-specific parts: how GPU metrics flow from the NVIDIA Container Toolkit through Docker to KEDA, and how to build GPU-aware containers that actually scale.

How Docker Exposes GPUs to Containers

When you run a GPU container with Docker, three layers work together:

docker run --gpus all nvidia/cuda:12.4-base nvidia-smi

Docker Engine detects the --gpus flag and calls the NVIDIA Container Toolkit
The toolkit configures nvidia-container-runtime as the OCI runtime for this container
The runtime injects GPU device files (/dev/nvidia0, /dev/nvidiactl) and NVIDIA driver libraries into the container's filesystem

The container now has full access to NVML (NVIDIA Management Library), which exposes GPU utilization, memory usage, temperature, power draw, and more. This is the same mechanism my GPU scaler uses — each scaler pod runs on a GPU node and reads NVML metrics from the GPUs Docker has exposed to it.

┌─────────────┐     gRPC      ┌──────────────────────────┐
│ KEDA Operator│─────────────→│ keda-gpu-scaler (Docker)  │
│ (central pod)│              │ DaemonSet on each GPU node│
└─────────────┘              │                            │
                              │  NVML ──→ /dev/nvidia0    │
                              │  NVML ──→ /dev/nvidia1    │
                              │       (Docker-exposed)     │
                              └──────────────────────────┘

Building GPU Containers: The Dockerfile

GPU containers need CGO for NVML access. Here's the multi-stage Dockerfile I use for keda-gpu-scaler:

# Stage 1: Build
FROM golang:1.22-bookworm AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
# CGO_ENABLED=1 is required — NVML needs CGO
# This is why GPU scaling can't be a native KEDA scaler
# (KEDA builds with CGO_ENABLED=0)
RUN CGO_ENABLED=1 go build -ldflags="-s -w" -o keda-gpu-scaler ./cmd/scaler

# Stage 2: Minimal runtime
FROM nvidia/cuda:12.4-base-ubuntu22.04

# Security: non-root user
RUN useradd -r -u 65534 -s /bin/false scaler
USER 65534:65534

COPY --from=builder /app/keda-gpu-scaler /usr/local/bin/

EXPOSE 6000
ENTRYPOINT ["keda-gpu-scaler"]

Key decisions:

CGO_ENABLED=1 — NVML requires C bindings. This is the fundamental architectural reason keda-gpu-scaler exists as an external scaler instead of being built into KEDA core.
nvidia/cuda base image — provides the NVML shared libraries (libnvidia-ml.so) at runtime.
Non-root execution — NVML reads GPU data from sysfs, doesn't require root. Standard Docker security practice.
Multi-stage build — final image is ~150MB instead of 1.5GB (no Go toolchain, no build deps).

Docker Compose for Local GPU Development

Before deploying to Kubernetes, test the full stack locally with Docker Compose:

version: "3.8"
services:
  # The GPU scaler — reads NVML metrics, serves gRPC
  gpu-scaler:
    build: .
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "6000:6000"    # gRPC for KEDA
      - "9090:9090"    # Prometheus metrics
    environment:
      - LOG_LEVEL=debug

  # A real GPU workload to scale
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.2-1B
      --port 8000
      --max-model-len 2048

  # Prometheus to scrape GPU metrics
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

# Start the stack
docker compose up -d

# Check GPU metrics via gRPC
grpcurl -plaintext localhost:6000 externalscaler.ExternalScaler/GetMetrics

# Check GPU metrics via Prometheus
curl localhost:9090/metrics | grep gpu

# Send requests to vLLM and watch GPU utilization climb
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-1B", "messages": [{"role": "user", "content": "Hello"}]}'

This gives you the full autoscaling feedback loop locally: vLLM serving → GPU utilization rises → scaler reports metrics → you can verify KEDA would trigger a scale-up.

Docker Scout: Scanning GPU Container Images

GPU images are large and have deep dependency trees. A typical vLLM image pulls in CUDA, cuDNN, NCCL, Python, PyTorch, and dozens of transitive dependencies. More packages = more CVE surface.

# Scan the GPU scaler image
docker scout cves pmady/keda-gpu-scaler:latest

# Get base image recommendations
docker scout recommendations pmady/keda-gpu-scaler:latest

Common findings I've hit with GPU images:

Issue	Cause	Fix
High-severity OpenSSL CVE	CUDA base image uses older Ubuntu	Multi-stage build with patched base
Python package CVEs	Transitive deps in ML frameworks	Pin versions, use `pip audit`
Outdated CUDA libs	NVIDIA base image release lag	Use Docker Hardened Images as base

For production GPU containers, I run Scout in CI and block merges on critical/high CVEs:

# GitHub Actions
- name: Docker Scout CVE scan
  uses: docker/scout-action@v1
  with:
    command: cves
    image: pmady/keda-gpu-scaler:${{ github.sha }}
    only-severities: critical,high
    exit-code: true  # Fail the build on findings

Pre-Built Scaling Profiles

Different GPU workloads need different scaling strategies. keda-gpu-scaler ships profiles so you don't have to figure this out yourself:

# ScaledObject for vLLM inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-gpu-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 0    # Scale to zero when idle
  maxReplicaCount: 8
  triggers:
    - type: external
      metadata:
        scalerAddress: "keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000"
        profile: "vllm-inference"

Profile	Metric	Target	Scale-to-Zero	Why
`vllm-inference`	GPU memory (%)	80%	Yes (5% activation)	vLLM fills KV cache proportional to request load
`triton-inference`	GPU utilization (%)	75%	Yes (10% activation)	Triton batches requests, SM utilization is the bottleneck
`training`	GPU utilization (%)	90%	No	Training jobs should saturate GPUs
`batch`	GPU memory (%)	70%	Yes (1% activation)	Batch inference, aggressive scale-down

Scale-to-zero is the killer feature for inference. A single A100 instance costs ~$3/hour. If your inference service is idle overnight, that's $36/night wasted. keda-gpu-scaler detects GPU idle state and scales the deployment to zero pods. KEDA spins it back up on the first incoming request.

From Docker Desktop to GPU Cluster

The workflow end-to-end:

Build your GPU container with docker build — multi-stage, non-root, minimal runtime
Test locally with docker compose — verify GPU metrics, NVML access, gRPC endpoint
Scan with docker scout — catch CVEs before pushing
Push to your registry
Deploy to Kubernetes with keda-gpu-scaler for autoscaling

Docker is the consistent runtime from development to production. Same Dockerfile, same NVML metrics, same security model — just different scale.

The project is open source and being discussed for adoption under the KEDA organization. If you're running GPU workloads on Kubernetes and want autoscaling that actually looks at the GPU, give it a try: github.com/pmady/keda-gpu-scaler.

I Replaced a $3/hr GPU Dev Workflow with Docker Model Runner. Here's How

Pavan Madduri — Fri, 08 May 2026 15:19:29 +0000

Last month I was debugging a prompt template for a vLLM inference service. The change was two lines — swap the system prompt and adjust the temperature. To test it, I had to:

Rebuild a 15GB Docker image (the CUDA base alone is 3.5GB)
Push it to our registry (8 minutes on a good day)
Wait for Kubernetes to pull it on a GPU node
Realize the prompt still wasn't right
Repeat

Total cycle time: 22 minutes per iteration. For a two-line text change.

Then I tried Docker Model Runner. Pull the model once. Run inference locally. Iterate on the prompt in seconds. Push only when it's right. The same change took 14 seconds.

Docker shipped two features this year that I think every GPU/AI engineer needs to know about: Model Runner and Sandboxes. This post is the walkthrough I wish I had when I started using them.

My background: I build GPU infrastructure tools — keda-gpu-scaler for GPU autoscaling on Kubernetes, otel-gpu-receiver for GPU observability, and I contributed GPU NUMA topology scheduling to CNCF Volcano. Everything I build runs in Docker containers.

Part 1: Docker Model Runner — Run LLMs Like You Run Containers

If you've used docker pull and docker run, you already know how Model Runner works. Same mental model, same CLI patterns, but for AI models instead of containers.

Setup (one time)

Update Docker Desktop to 4.40+ and enable Model Runner:

Settings → Features in Development → Enable Docker Model Runner

Pull your first model

$ docker model pull ai/llama3.2:1B-Q8_0

This downloads quantized model weights from Docker Hub — same registry, same content-addressable storage, same layer deduplication. If two models share base weights, you only download the diff.

Run inference from the CLI

$ docker model run ai/llama3.2:1B-Q8_0 "What is NUMA topology in GPU scheduling?"

NUMA (Non-Uniform Memory Access) topology in GPU scheduling refers to the 
arrangement of CPUs and GPUs on a server where memory access speed depends 
on physical proximity. GPUs on the same NUMA node as the requesting CPU 
have faster memory access. NUMA-aware schedulers like Volcano place GPU 
workloads on nodes where the GPUs share a NUMA domain with the allocated 
CPUs, reducing cross-node memory latency by 10-20% for multi-GPU training...

That ran locally on my MacBook Pro. No cloud GPU. No 15GB container image. No Kubernetes cluster. Apple Silicon handles the inference via Metal/MLX. On Linux with NVIDIA GPUs, it uses CUDA automatically.

The killer feature: OpenAI-compatible API

Model Runner exposes a local endpoint that speaks the exact same protocol as OpenAI's API:

from openai import OpenAI

# Local — hits Docker Model Runner
client = OpenAI(
    base_url="http://localhost:12434/engines/llama3.2/v1/",
    api_key="not-needed"  # no key required locally
)

# This SAME code works in production with one env var change:
# client = OpenAI(base_url=os.environ["VLLM_ENDPOINT"])

response = client.chat.completions.create(
    model="ai/llama3.2:1B-Q8_0",
    messages=[
        {"role": "system", "content": "You are a Kubernetes GPU infrastructure expert."},
        {"role": "user", "content": "Explain GPU memory fragmentation in 3 sentences."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Why this matters: Your application code is identical between local dev and production. Locally you hit Model Runner. In production you hit vLLM, Triton, or OpenAI. Change one environment variable. That's it.

List and manage models

$ docker model ls
NAME                       SIZE      CREATED
ai/llama3.2:1B-Q8_0       1.3 GB    10 minutes ago
ai/mistral:7B-Q4_K_M      4.1 GB    2 hours ago

$ docker model rm ai/mistral:7B-Q4_K_M

Same workflow as docker image ls and docker image rm. If you know Docker, you know Model Runner. Zero learning curve.

Part 2: The Real Problem Model Runner Solves

I run GPU inference in production on Kubernetes. Here's what the inner development loop looked like before Model Runner:

Edit prompt → docker build (8 min) → docker push (8 min) → kubectl rollout → test → repeat
                    ↑                                                              │
                    └──────────── 22 minutes per iteration ────────────────────────┘

And after:

Edit prompt → docker model run (14 sec) → test → iterate → ship when ready
                                                              │
                                            ← seconds per iteration →

The three pain points Model Runner eliminates:

Problem	Before	After
Dependency hell	CUDA 12 vs 11, cuDNN mismatches, PyTorch pinning	Model Runner handles the inference runtime
Image bloat	15GB vLLM image with full CUDA toolkit	1.3GB quantized model, no container build needed
Dev-prod gap	Can't run A100 inference on a MacBook	Model Runner uses Apple Silicon or local NVIDIA GPU

The architecture end-to-end:

┌─── Your Laptop ──────────────────┐     ┌─── Production K8s Cluster ────────────┐
│                                   │     │                                        │
│  Docker Model Runner              │     │  vLLM containers (A100/H100)           │
│  ├─ Llama 3.2 (local inference)  │     │  ├─ keda-gpu-scaler (auto-scaling)    │
│  └─ OpenAI-compatible API        │     │  ├─ otel-gpu-receiver (GPU metrics)   │
│          ↕                        │     │  ├─ Volcano (NUMA-aware scheduling)   │
│  Your Application Code            │ ──→ │  └─ OpenAI-compatible API             │
│  (same code, same SDK)           │     │          ↕                              │
│                                   │     │  Same Application Code                 │
└───────────────────────────────────┘     └────────────────────────────────────────┘

Same application code. Same API. Same Docker. Different scale.

Part 3: Docker Sandboxes — Because AI Agents Will Try to Delete Your Files

Here's a scenario that keeps me up at night: an AI coding agent decides the best way to fix a test failure is to rm -rf the test directory. Or it installs a malicious pip package. Or it curls your AWS credentials to an external server.

If you're building agentic workflows — LLMs that execute code, call APIs, or modify files — running that code on your host is reckless. Docker Sandboxes fix this.

What Sandboxes give you

Filesystem isolation — the agent gets its own filesystem. Your SSH keys, browser cookies, and credentials are invisible.
Network whitelisting — you specify exactly which hosts the agent can reach. Everything else is blocked.
Resource caps — CPU, memory, GPU limits per sandbox. No runaway processes.
Ephemeral by default — sandbox is destroyed when the task completes. No persistent state leakage.

A real example: sandboxed coding agent

# docker-compose.sandbox.yml
services:
  coding-agent:
    image: my-coding-agent:latest
    sandbox:
      enabled: true
      network:
        egress:
          - "api.openai.com:443"      # LLM API calls
          - "pypi.org:443"            # pip install
          - "github.com:443"          # git clone
      resources:
        memory: 4g
        cpus: 2
    volumes:
      - ./workspace:/workspace        # Only this directory is accessible
    environment:
      - MODEL_ENDPOINT=http://host.docker.internal:12434  # Model Runner

What the agent can do:

Read/write files in /workspace
Call OpenAI, install Python packages, clone repos
Use up to 4GB RAM, 2 CPUs

What the agent cannot do:

Touch ~/.ssh, ~/.aws, or any file outside /workspace
Reach arbitrary servers (no data exfiltration)
Consume unlimited resources (no fork bombs)
Persist state after completion (clean slate every run)

The production mirror

This is the same isolation model I enforce in production Kubernetes with PodSecurityStandards:

# Production: Kubernetes pod security
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
---
# Local: Docker Sandbox
sandbox:
  enabled: true
  network:
    egress: ["api.openai.com:443"]

Same security boundary. Same isolation model. Docker Sandboxes for local dev, Kubernetes PodSecurity for production. The gap between "works on my machine" and "works in production" shrinks to almost nothing.

Part 4: Putting It All Together — The Full GPU/AI Stack on Docker

Here's the picture I've been building toward. Docker isn't just a container runtime anymore — it's the full development platform for AI:

On Your Laptop (Docker Desktop)

Layer	Tool	What It Does
Inference	Docker Model Runner	Pull and run LLMs locally, OpenAI-compatible API
Security	Docker Sandboxes	Isolate AI agents, whitelist network access
Supply Chain	Docker Scout	Scan GPU images for CVEs in CUDA/Python dependency trees
Observability	Docker Extensions	I built a GPU Dashboard showing real-time NVML metrics in Docker Desktop
Multi-service	Docker Compose	Run inference + app + monitoring together locally

In Production (Kubernetes)

Layer	Tool	What It Does
Runtime	Docker containers (containerd)	vLLM, Triton inference servers
Autoscaling	keda-gpu-scaler	Scale on real GPU utilization, not CPU proxy metrics. Scale to zero when idle.
Observability	otel-gpu-receiver	GPU metrics → OpenTelemetry → Prometheus/Grafana
Scheduling	Volcano GPU NUMA	Place multi-GPU training on NUMA-aligned GPUs (10-20% throughput improvement)
Security	NetworkPolicy + PodSecurity	Same isolation as Docker Sandboxes, enforced by the cluster

The container is the unit of deployment from your laptop to the GPU cluster. Docker owns the inner loop. Kubernetes owns the outer loop. Both speak the same language.

Part 5: Try It Right Now (5-Minute Walkthrough)

Everything below runs on Docker Desktop. No GPU required (it'll use CPU). Takes 5 minutes.

Step 1: Pull a model

docker model pull ai/llama3.2:1B-Q8_0

Step 2: Chat with it

docker model run ai/llama3.2:1B-Q8_0 "Write a Dockerfile for a Python FastAPI app"

Step 3: Hit the API

curl http://localhost:12434/engines/llama3.2/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1B-Q8_0",
    "messages": [
      {"role": "system", "content": "You are a Docker and Kubernetes expert."},
      {"role": "user", "content": "How do I expose GPUs to a Docker container?"}
    ],
    "temperature": 0.3
  }'

Step 4: Use it from Python

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:12434/engines/llama3.2/v1/",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="ai/llama3.2:1B-Q8_0",
    messages=[{"role": "user", "content": "Explain Docker Scout in one paragraph"}]
)
print(response.choices[0].message.content)

That's it. You're running local LLM inference with the same API your production code uses. No CUDA installation. No PyTorch dependency conflicts. No 15GB images.

What I Want to See Next from Docker

Model Runner is already good. Here's what would make it great for production GPU engineers:

docker model stats — VRAM usage per model, like docker stats for containers. Right now I have to run nvidia-smi separately.
Multi-model serving — run Llama + Mistral + CodeLlama concurrently with per-model resource limits. Production inference servers (Triton) do this; local dev should too.
OpenTelemetry integration — emit inference latency, tokens/second, and queue depth to an OTel collector. My otel-gpu-receiver handles hardware GPU metrics — application-level model metrics would complete the picture.
Sandbox GPU passthrough — let sandboxed AI agents access the GPU for local inference. Currently sandboxes are CPU-only.

I've shared this feedback in the Docker community forums. If you're building GPU/AI infrastructure on Docker, I'd love to hear what you're missing — drop a comment or reach out.

The Bottom Line

Docker used to be where you built containers and pushed them somewhere else. In 2026, Docker is where you develop AI applications end-to-end:

Model Runner replaces your 22-minute build-push-deploy cycle with a 14-second local inference call
Sandboxes give your AI agents the same security boundary they'll have in production
Scout catches CVEs in your CUDA dependency tree before they reach a GPU cluster
Compose runs your entire AI stack locally — inference, app, monitoring, all together

And when it's time to ship to production, the same containers deploy to Kubernetes with GPU autoscaling, GPU observability, and NUMA-aware scheduling.

The gap between "works on my machine" and "works on 8 A100s" just got a lot smaller.

Pavan Madduri is a Senior Cloud Platform Engineer at W.W. Grainger, Inc., CNCF Golden Kubestronaut, and Oracle ACE Associate. He maintains keda-gpu-scaler and otel-gpu-receiver, contributed GPU NUMA topology scheduling to Volcano, and is a Dragonfly Community Member. Published: PlatformEngineering.com. Follow on Facebook: Docker AI & Cloud-Native DevOps.

Docker + OKE: Running GPU Inference Containers on Oracle Cloud

Pavan Madduri — Fri, 08 May 2026 14:59:52 +0000

I wanted to deploy an LLM inference API without spending $1,200/month on AWS GPU instances. OCI turned out to be significantly cheaper, and the Docker workflow was identical. Here's what I set up.

Why I Looked at OCI for GPU Workloads

I've been building GPU infrastructure tools for a while now (keda-gpu-scaler, otel-gpu-receiver, GPU NUMA scheduling for Volcano), and most of my testing was on AWS. The g5.xlarge instances with A10 GPUs run about $1.01/hr, plus $73/month for the EKS control plane. It adds up fast when you're iterating.

Someone on the Volcano Slack mentioned OCI's GPU pricing and I was skeptical. But when I looked it up, the numbers were real — same A10 GPU, roughly 40% cheaper, and OKE doesn't charge for the Kubernetes control plane at all. So I tried moving a vLLM inference workload over.

OCI GPU Pricing

Here's what OCI actually charges for GPU instances. I had to double-check these because they seemed too low:

Shape	GPU	GPU Memory	OCPUs	RAM	Price/hr (on-demand)
VM.GPU.A10.1	1x A10	24 GB	15	240 GB	~$1.65
VM.GPU.A10.2	2x A10	48 GB	30	480 GB	~$3.30
BM.GPU.A100-v2.8	8x A100	640 GB	128	2 TB	~$25.00
BM.GPU.H100.8	8x H100	640 GB	112	2 TB	~$38.00
VM.GPU.A10.1 (preemptible)	1x A10	24 GB	15	240 GB	~$0.50

That preemptible A10 price made me do a double-take. $0.50/hr for an A10 GPU. That's $365/year. I was paying more than that per month on AWS for the same hardware.

Building the Inference Image

I used vLLM because it's what I was already running on AWS. The Dockerfile doesn't change at all between clouds, which is the whole reason I'm using containers in the first place.

# Dockerfile.inference
FROM nvidia/cuda:12.4-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    vllm==0.6.0 \
    fastapi \
    uvicorn

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "microsoft/Phi-3-mini-4k-instruct", \
     "--max-model-len", "4096", \
     "--gpu-memory-utilization", "0.9"]

Build and test locally (you'll need an NVIDIA GPU and the NVIDIA Container Toolkit installed):

# Build
docker build -f Dockerfile.inference -t gpu-inference:v1 .

# Run with GPU access
docker run --gpus all -p 8000:8000 gpu-inference:v1

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "prompt": "Explain Kubernetes in one sentence:",
    "max_tokens": 50
  }'

--gpus all is the magic flag. It tells Docker to use the NVIDIA Container Toolkit, which injects the GPU device files and driver libraries into the container at runtime. Your image only needs the CUDA runtime libraries, not the full driver stack.

If You Don't Have a Local GPU

I do most of my development on a Mac, which obviously doesn't have an NVIDIA GPU. Docker Model Runner is what I use to test the LLM interaction pattern locally:

docker model pull ai/phi3-mini
docker model run ai/phi3-mini "Explain Kubernetes in one sentence"

The API is OpenAI-compatible so the client code I write against Model Runner works unchanged against vLLM in production. I've been using this for prompt template iteration and it cut my feedback loop from 20+ minutes (push to registry, wait for K8s pull, test) to about 15 seconds.

Pushing to OCIR

# Login to OCIR
docker login iad.ocir.io -u '<tenancy-namespace>/oracleidentitycloudservice/<email>'

# Tag
docker tag gpu-inference:v1 iad.ocir.io/<tenancy>/gpu-inference/vllm:v1

# Scan before push
docker scout cves gpu-inference:v1 --only-severity critical,high

# Push
docker push iad.ocir.io/<tenancy>/gpu-inference/vllm:v1

Fair warning: GPU images are big. Mine was about 8GB. The first push took a while, but after that Docker's layer caching means only changed layers get uploaded. Most rebuilds push in under a minute.

Setting Up OKE with GPU Nodes

# Create cluster (control plane is free)
oci ce cluster create \
  --compartment-id $COMPARTMENT_ID \
  --kubernetes-version v1.30.1 \
  --name gpu-inference-cluster \
  --vcn-id $VCN_ID \
  --endpoint-subnet-id $API_SUBNET_ID \
  --service-lb-subnet-ids '["'$LB_SUBNET_ID'"]'

# Create GPU node pool
oci ce node-pool create \
  --cluster-id $CLUSTER_ID \
  --compartment-id $COMPARTMENT_ID \
  --kubernetes-version v1.30.1 \
  --name gpu-a10-pool \
  --node-shape VM.GPU.A10.1 \
  --node-config-details '{
    "size": 2,
    "placementConfigs": [{
      "availabilityDomain": "Uocm:US-ASHBURN-AD-1",
      "subnetId": "'$WORKER_SUBNET_ID'"
    }]
  }' \
  --node-source-details '{
    "imageId": "'$GPU_IMAGE_ID'",
    "sourceType": "IMAGE"
  }' \
  --initial-node-labels '[{
    "key": "nvidia.com/gpu",
    "value": "present"
  }]'

One thing I liked about OKE — the GPU node pools come with the NVIDIA device plugin already installed. On EKS I had to install the device plugin myself via a DaemonSet. Here it just works, and nvidia.com/gpu shows up as a schedulable resource immediately.

Deploying the Inference Service

# inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
        - name: vllm
          image: iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "4"
              memory: "16Gi"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache
      imagePullSecrets:
        - name: ocir-secret
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference
  namespace: inference
  annotations:
    oci.oraclecloud.com/load-balancer-type: "lb"
    service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
    service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "10"
    service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
spec:
  type: LoadBalancer
  selector:
    app: vllm-inference
  ports:
    - port: 80
      targetPort: http
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: inference
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: oci-bv
  resources:
    requests:
      storage: 50Gi

A few things I learned the hard way while setting this up:

The nvidia.com/gpu: 1 in resource limits is how Kubernetes knows to schedule this on a GPU node. Forget it and your pod lands on a CPU node and crashes.

The PVC for model cache is important. Without it, the model downloads from HuggingFace every time the pod restarts. Phi-3-mini is a few GB — that's 5-10 minutes of startup time you don't want to repeat.

The initialDelaySeconds: 120 on the liveness probe took me a restart loop to figure out. Model loading is slow. If your liveness probe fires before the model is loaded, Kubernetes kills the pod, it restarts, starts loading again, gets killed again... you get the idea. Give it at least 2 minutes.

The OCI Load Balancer annotations tell OKE to automatically provision a load balancer. No separate Terraform resource needed.

Deploy:

kubectl create namespace inference

# Create OCIR pull secret
kubectl create secret docker-registry ocir-secret \
  --namespace inference \
  --docker-server=iad.ocir.io \
  --docker-username='<tenancy>/<user>' \
  --docker-password='<auth-token>'

# Create HuggingFace token secret (from OCI Vault ideally)
kubectl create secret generic hf-token \
  --namespace inference \
  --from-literal=token=$HF_TOKEN

kubectl apply -f inference-deployment.yaml

After a few minutes (mostly model download time), the service is up and accessible via the load balancer:

LB_IP=$(kubectl get svc vllm-inference -n inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl http://$LB_IP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "prompt": "What is Oracle Cloud Infrastructure?",
    "max_tokens": 100
  }'

Monitoring GPU Utilization

Once the inference service was running, I wanted to see actual GPU utilization. Without this you're flying blind — you have no idea if the GPU is sitting at 10% or 95%. DCGM Exporter gives you Prometheus metrics:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring --create-namespace \
  --set serviceMonitor.enabled=true

This gives you DCGM_FI_DEV_GPU_UTIL (utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory), temperature, power draw, etc. I have a Grafana dashboard that shows all of these and it's been useful for right-sizing.

I also built otel-gpu-receiver which does something similar but for OpenTelemetry. If you're already running an OTel collector, it might be a better fit than DCGM Exporter.

What I'm Actually Paying

Here's the monthly bill comparison for running Phi-3-mini on a single A10, always-on:

Platform	Setup	Monthly Cost
OCI OKE + VM.GPU.A10.1	Managed K8s + GPU node	~$1,210
OCI OKE + preemptible A10	Same, but preemptible	~$365
AWS EKS + g5.xlarge	Managed K8s + GPU node	~$1,100 + $73 (control plane)
GCP GKE + g2-standard-4	Managed K8s + GPU node	~$1,300 + $73 (control plane)
Azure AKS + NC4as_T4_v3	Managed K8s + T4 GPU	~$550 + less powerful GPU

The free control plane saves $73/mo by itself compared to EKS or GKE. And for my dev/test workloads I switched to preemptible instances, which dropped the GPU cost to $365/mo. The pods get evicted occasionally but for development that's fine.

Local Dev with Docker Model Runner

I keep coming back to this because it changed how I work. Before Model Runner, testing a prompt change meant: edit prompt, rebuild image, push to OCIR, wait for OKE to pull it, test, realize it's wrong, repeat. Twenty minutes per iteration.

Now I just run the model locally:

# Pull a model
docker model pull ai/phi3-mini

# Run inference
docker model run ai/phi3-mini "Summarize: Oracle Cloud Infrastructure provides..."

# Or use the API endpoint
curl http://localhost:12434/engines/phi3-mini/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is OKE?", "max_tokens": 50}'

Same API, same prompt format. When the prompt works locally, I rebuild the production image and push. The container is what makes this portable.

Was It Worth Switching?

Honestly, yes. The Docker workflow didn't change at all — same Dockerfile, same docker build, same docker push. I just changed the registry URL and the Kubernetes annotations. The inference service runs the same. The GPU utilization is the same. The API responses are the same.

What changed is the bill. And the fact that I don't pay $73/month for a Kubernetes control plane anymore. If you're running GPU workloads on AWS or GCP and haven't priced out OCI, it's worth 30 minutes of your time.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I write about containers, Kubernetes, and GPU infrastructure. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Securing Docker Images on OCI — OCIR, Docker Scout, and OCI Vault

Pavan Madduri — Fri, 08 May 2026 14:56:18 +0000

I found a critical CVE in a production container image last month. It had been there for five months. Here's the setup I built on OCI so that doesn't happen again.

How I Found Out the Hard Way

A few weeks ago I ran docker scout cves on a container image that had been running in production for months. I wasn't expecting anything — I'd scanned it when I first built it and it was clean. But the base image (ubuntu:22.04) had picked up a handful of CVEs since then, including one rated critical.

Nobody had checked. The image was built once, pushed once, and forgotten. The container was humming along fine, but it was running vulnerable libraries that we never would've shipped knowingly.

This is a workflow problem more than a tooling problem. The scanners exist. I just wasn't running them at the right points. So I spent a weekend wiring together a proper pipeline on OCI using what was already available — Docker Scout, OCIR's built-in scanning, and OCI Vault for secrets.

The Setup (Three Places to Catch Problems)

The idea is simple: scan before you push, scan after you push, and never put secrets in the image. Three checkpoints, each catching different things.

Docker Scout runs on my laptop and in CI — it catches CVEs before the image leaves my machine. OCIR scans again after the push, which catches anything Scout might miss and gives me a second opinion from Oracle's vulnerability database. OCI Vault handles secrets so I'm not baking API keys into environment variables like it's 2015.

Docker Scout — Catching CVEs Before They Leave My Machine

I'd been ignoring Docker Scout for a while, thinking it was just another scanning tool. It's actually pretty good. It comes built into Docker Desktop and the CLI, so there's no extra install.

# Scan a local image
docker scout cves my-api:latest

# Quick view — just critical and high severity
docker scout cves my-api:latest --only-severity critical,high

# Compare two image versions
docker scout compare my-api:v2 --to my-api:v1

# Get remediation recommendations
docker scout recommendations my-api:latest

The recommendations command is the one that made me a convert. Instead of just listing CVE IDs and making you figure out what to do, it tells you exactly which base image version fixes the problem:

Recommended fixes:
  Base image: golang:1.22-alpine → golang:1.22.4-alpine
  Fixes: CVE-2024-24790, CVE-2024-24789

  Base image: alpine:3.19 → alpine:3.20
  Fixes: 3 vulnerabilities

That saved me probably 30 minutes of Googling CVE IDs and cross-referencing which alpine version patched what. I just updated the FROM line and rebuilt.

Putting Scout in CI

Running it locally is fine, but the real value is when it blocks bad images in CI automatically. Here's what I have in GitHub Actions:

# .github/workflows/security.yml
name: Container Security
on:
  push:
    branches: [main]
  pull_request:

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: docker build -t my-api:${{ github.sha }} .

      - name: Docker Scout scan
        uses: docker/scout-action@v1
        with:
          command: cves
          image: my-api:${{ github.sha }}
          only-severities: critical,high
          exit-code: true  # Fail the build if critical/high CVEs found

      - name: Docker Scout recommendations
        if: always()
        uses: docker/scout-action@v1
        with:
          command: recommendations
          image: my-api:${{ github.sha }}

The important bit is exit-code: true. Without that flag, Scout just prints the results and the build happily continues. With it, any critical or high CVE fails the pipeline. I've had this block two PRs in the last month and both times it was a legitimate issue in the base image.

OCIR Scanning — The Second Pair of Eyes

OCIR has its own vulnerability scanner that runs against Oracle's database. It sometimes catches things Scout doesn't (different vulnerability feeds) and vice versa. I like having both.

Setting It Up

# Create a repository with scanning enabled
oci artifacts container repository create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "production-api" \
  --is-immutable true \
  --readme-enabled true

The --is-immutable true flag is one I wish I'd known about earlier. It prevents anyone from overwriting a tag. So once v1.2.3 is pushed, that's it — nobody can push a different image with the same tag. Sounds obvious but I've been bitten by :latest being silently overwritten before.

Push and Scan

# Tag and push
docker tag my-api:v1.2.3 iad.ocir.io/$TENANCY/production-api:v1.2.3
docker push iad.ocir.io/$TENANCY/production-api:v1.2.3

# Trigger a scan (or it runs automatically)
oci vulnerability-scanning container scan create \
  --compartment-id $COMPARTMENT_ID \
  --image-id <image-ocid>

# Check scan results
oci vulnerability-scanning container scan result list \
  --compartment-id $COMPARTMENT_ID \
  --image-id <image-ocid>

Scan Policies

You can also set up automated scan targets in Terraform so every push to specific repos gets scanned without anyone having to remember to do it:

# Terraform — scan policy
resource "oci_vulnerability_scanning_container_scan_recipe" "strict" {
  compartment_id = var.compartment_id
  display_name   = "strict-scan-recipe"

  scan_settings {
    scan_level = "STANDARD"
  }
}

resource "oci_vulnerability_scanning_container_scan_target" "production" {
  compartment_id            = var.compartment_id
  container_scan_recipe_id  = oci_vulnerability_scanning_container_scan_recipe.strict.id
  display_name              = "production-registry-scan"

  target_registry {
    compartment_id = var.compartment_id
    type           = "OCIR"
    repositories   = ["production-api"]
  }
}

OCI Vault — Stop Putting Secrets in Environment Variables

This one is embarrassing to admit, but I've shipped containers with API keys in environment variables more times than I'd like. Not in the Dockerfile directly (I know better than that), but in docker-compose files that ended up in git, or in Kubernetes manifests that got copy-pasted around.

# WRONG — secrets baked into the image
ENV DATABASE_PASSWORD=hunter2
ENV API_KEY=sk-abc123

# WRONG — secrets in compose file committed to git
environment:
  - DATABASE_PASSWORD=hunter2

OCI Vault stores secrets in an HSM. The container pulls them at startup. They never exist in the image, in your compose file, or in git.

Create Vault and Secrets

# Create a vault
oci kms management vault create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "docker-secrets-vault" \
  --vault-type DEFAULT

# Create a master encryption key
oci kms management key create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "docker-secrets-key" \
  --key-shape '{"algorithm": "AES", "length": 32}' \
  --endpoint $VAULT_MGMT_ENDPOINT

# Store a secret
echo -n "my-database-password" | base64 | \
oci vault secret create-base64 \
  --compartment-id $COMPARTMENT_ID \
  --vault-id $VAULT_ID \
  --key-id $KEY_ID \
  --secret-name "prod-db-password" \
  --secret-content-content "$(cat -)"

Pull Secrets at Deploy Time

For OCI Container Instances, use an init script that fetches secrets from Vault:

#!/bin/sh
# entrypoint.sh — fetch secrets from OCI Vault before starting the app

# Using instance principal authentication (no credentials needed)
export DATABASE_PASSWORD=$(oci secrets secret-bundle get-secret-bundle-by-name \
  --vault-id $VAULT_ID \
  --secret-name "prod-db-password" \
  --stage CURRENT \
  --query 'data."secret-bundle-content".content' \
  --raw-output | base64 -d)

export API_KEY=$(oci secrets secret-bundle get-secret-bundle-by-name \
  --vault-id $VAULT_ID \
  --secret-name "prod-api-key" \
  --stage CURRENT \
  --query 'data."secret-bundle-content".content' \
  --raw-output | base64 -d)

# Start the application
exec /server

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app/server /server
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

For OKE, I use External Secrets Operator which syncs secrets from OCI Vault into Kubernetes secrets automatically:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: oci-vault
    kind: ClusterSecretStore
  target:
    name: app-secrets
  data:
    - secretKey: DATABASE_PASSWORD
      remoteRef:
        key: "prod-db-password"
    - secretKey: API_KEY
      remoteRef:
        key: "prod-api-key"

The nice thing here is that when I rotate a secret in Vault, ESO picks it up within an hour and updates the Kubernetes secret. Pods get the new value on their next restart. No redeployment, no new image push, no PR to change a YAML file.

The Full Flow

# 1. Build
docker build -t my-api:v1.2.3 .

# 2. Scan locally (fail fast)
docker scout cves my-api:v1.2.3 --only-severity critical,high --exit-code

# 3. Push to OCIR (triggers registry scan)
docker tag my-api:v1.2.3 iad.ocir.io/$TENANCY/production-api:v1.2.3
docker push iad.ocir.io/$TENANCY/production-api:v1.2.3

# 4. Deploy (secrets pulled from Vault at runtime)
oci container-instances container-instance create \
  --containers '[{
    "imageUrl": "iad.ocir.io/'$TENANCY'/production-api:v1.2.3",
    "environmentVariables": {
      "VAULT_ID": "'$VAULT_ID'"
    }
  }]' \
  ...

Two scan layers, immutable tags, no secrets in source control. None of this required buying a third-party security platform.

What I Learned

The tooling was never the problem. Docker Scout, OCIR scanning, OCI Vault — they were all available. I just wasn't using them consistently. The five-month-old CVE I found wasn't a failure of technology, it was a failure of workflow.

Now scanning happens automatically at two points (local/CI and registry), secrets are in Vault instead of YAML files, and image tags are immutable so nobody accidentally overwrites a production image. It took a weekend to set up. I should've done it a year ago.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I write about containers, Kubernetes, and GPU infrastructure. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Running Docker Containers on OCI Without Kubernetes

Pavan Madduri — Fri, 08 May 2026 14:51:57 +0000

I needed to run a container in the cloud. Not a microservices platform. Not a service mesh. Just one container, one port, accessible from the internet. Here's how OCI Container Instances turned out to be the right tool.

Why I Stopped Reaching for Kubernetes First

I'll be honest — I'm a Kubestronaut. I have all the CNCF Kubernetes certifications. My default muscle memory is kubectl apply -f for everything. But last month I needed to deploy a small Go API for a side project and I caught myself writing a Helm chart for a single container.

That felt ridiculous.

The API had two endpoints. It needed 256MB of RAM. There was no reason to stand up a control plane, configure node pools, set up an ingress controller, and maintain all of that just to serve JSON over HTTP.

I'd used OCI Container Instances before for a quick test and remembered it being dead simple. So I tried it for real this time.

What OCI Container Instances Actually Are

The closest analogy is docker run but Oracle manages the host. You give it an image, tell it how much CPU and memory you want, point it at a subnet, and it runs. The container gets a real IP on your VCN. You can pull from OCIR, Docker Hub, or any OCI-compliant registry.

I was surprised by the resource limits — you can go up to 64 OCPUs and 1TB of RAM on a single instance. Fargate caps out at 16 vCPUs. Cloud Run at 8. For most of my use cases that doesn't matter, but it's nice to know the ceiling is high if I need it later.

Feature	OCI Container Instances	AWS Fargate	Cloud Run
Max vCPUs	64	16	8
Max Memory	1024 GB	120 GB	32 GB
GPU Support	Yes (A10, A100)	No	Yes (L4)
Cold Start	~2-3s	5-15s	2-8s
Min Billing	1 second	1 minute	100ms

The GPU support is worth mentioning — you can run NVIDIA GPU containers without managing drivers or CUDA installs on the host. I haven't used this in production yet but I've tested it with a vLLM image and it worked without any changes to the Dockerfile.

The API I Deployed

Nothing fancy. A Go service with two endpoints — /health and /info. I chose Go because the final image is tiny (under 15MB with distroless) and it starts in milliseconds, which matters when you're paying per second.

// main.go
package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"
)

type HealthResponse struct {
    Status    string `json:"status"`
    Timestamp string `json:"timestamp"`
    Host      string `json:"host"`
    Region    string `json:"region"`
}

type InfoResponse struct {
    Service  string `json:"service"`
    Version  string `json:"version"`
    Runtime  string `json:"runtime"`
    Platform string `json:"platform"`
}

func main() {
    port := os.Getenv("PORT")
    if port == "" {
        port = "8080"
    }

    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        host, _ := os.Hostname()
        resp := HealthResponse{
            Status:    "healthy",
            Timestamp: time.Now().UTC().Format(time.RFC3339),
            Host:      host,
            Region:    os.Getenv("OCI_REGION"),
        }
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(resp)
    })

    http.HandleFunc("/info", func(w http.ResponseWriter, r *http.Request) {
        resp := InfoResponse{
            Service:  "oci-docker-demo",
            Version:  os.Getenv("APP_VERSION"),
            Runtime:  "OCI Container Instances",
            Platform: "Oracle Cloud Infrastructure",
        }
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(resp)
    })

    log.Printf("Starting server on :%s", port)
    log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}

The Dockerfile is a straightforward multi-stage build. The builder compiles the binary, and the final image is distroless so there's almost nothing in it:

FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod main.go ./
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o server .

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app/server /server
EXPOSE 8080
USER nonroot:nonroot
ENTRYPOINT ["/server"]

Build and test locally:

docker build -t oci-docker-demo:v1 .
docker run -p 8080:8080 -e OCI_REGION=us-ashburn-1 oci-docker-demo:v1

# Test it
curl http://localhost:8080/health

Pushing to OCIR

OCIR is OCI's private container registry. Free for standard usage, which is nice. The login command is a bit verbose compared to Docker Hub — you need the tenancy namespace prefix — but it works the same way after that.

# Log in to OCIR
docker login iad.ocir.io -u '<tenancy-namespace>/<username>'

# Tag for OCIR
docker tag oci-docker-demo:v1 iad.ocir.io/<tenancy-namespace>/docker-demos/oci-demo:v1

# Push
docker push iad.ocir.io/<tenancy-namespace>/docker-demos/oci-demo:v1

I ran Docker Scout on it before pushing, mostly out of habit:

docker scout cves oci-docker-demo:v1

Zero critical or high CVEs, which is expected with distroless. If you're using ubuntu or debian as your base, you'll probably see a few here.

Deploying — The Part That Surprised Me

This is where Container Instances won me over. One CLI command:

oci container-instances container-instance create \
  --compartment-id $COMPARTMENT_ID \
  --availability-domain "Uocm:US-ASHBURN-AD-1" \
  --display-name "docker-demo-api" \
  --shape "CI.Standard.A1.Flex" \
  --shape-config '{"ocpus": 1, "memoryInGBs": 2}' \
  --containers '[{
    "imageUrl": "iad.ocir.io/<tenancy>/docker-demos/oci-demo:v1",
    "displayName": "api",
    "environmentVariables": {
      "PORT": "8080",
      "OCI_REGION": "us-ashburn-1",
      "APP_VERSION": "1.0.0"
    },
    "resourceConfig": {
      "vcpusLimit": 1,
      "memoryLimitInGBs": 2
    }
  }]' \
  --vnics '[{
    "subnetId": "'$SUBNET_ID'",
    "isPublicIpAssigned": true
  }]'

I ran this and had a public IP with a working API in about 3 seconds. No joke. I spent more time writing the CLI command than waiting for it to deploy. Coming from Kubernetes where I'm used to waiting for nodes to scale up, load balancers to provision, and pods to pass readiness checks... this was refreshingly fast.

Terraform Version (for when this isn't just a side project)

I wouldn't run that CLI command manually every time in a real workflow. Here's the same thing in Terraform, which I'd use for anything that needs to be reproducible:

resource "oci_container_instances_container_instance" "demo" {
  compartment_id      = var.compartment_id
  availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
  display_name        = "docker-demo-api"

  shape = "CI.Standard.A1.Flex"
  shape_config {
    ocpus         = 1
    memory_in_gbs = 2
  }

  containers {
    image_url    = "iad.ocir.io/${var.tenancy_namespace}/docker-demos/oci-demo:v1"
    display_name = "api"

    environment_variables = {
      PORT        = "8080"
      OCI_REGION  = var.region
      APP_VERSION = "1.0.0"
    }

    resource_config {
      vcpus_limit        = 1
      memory_limit_in_gbs = 2
    }

    health_checks {
      health_check_type = "HTTP"
      port              = 8080
      path              = "/health"
      interval_in_seconds = 30
    }
  }

  vnics {
    subnet_id             = var.subnet_id
    is_public_ip_assigned = true
  }

  image_pull_secrets {
    registry_endpoint = "iad.ocir.io"
    secret_id         = oci_vault_secret.ocir_creds.id
    secret_type       = "VAULT"
  }
}

output "container_public_ip" {
  value = oci_container_instances_container_instance.demo.vnics[0].private_ip
}

Adding a Load Balancer

Container Instances give you a public IP directly, but for anything with real traffic you probably want a load balancer in front. TLS termination, health checks, the usual. Here's the Terraform for that:

resource "oci_load_balancer_load_balancer" "api_lb" {
  compartment_id = var.compartment_id
  display_name   = "docker-demo-lb"
  shape          = "flexible"
  subnet_ids     = [var.public_subnet_id]

  shape_details {
    minimum_bandwidth_in_mbps = 10
    maximum_bandwidth_in_mbps = 100
  }
}

resource "oci_load_balancer_backend_set" "api_backend" {
  load_balancer_id = oci_load_balancer_load_balancer.api_lb.id
  name             = "api-backends"
  policy           = "ROUND_ROBIN"

  health_checker {
    protocol          = "HTTP"
    port              = 8080
    url_path          = "/health"
    interval_ms       = 10000
    return_code       = 200
  }
}

resource "oci_load_balancer_listener" "https" {
  load_balancer_id         = oci_load_balancer_load_balancer.api_lb.id
  name                     = "https-listener"
  default_backend_set_name = oci_load_balancer_backend_set.api_backend.name
  port                     = 443
  protocol                 = "HTTP"

  ssl_configuration {
    certificate_ids          = [var.certificate_id]
    protocols                = ["TLSv1.2", "TLSv1.3"]
    server_order_preference  = "ENABLED"
  }
}

What It Actually Costs

I tracked the bill for a month. It was comically low:

Setup	Monthly Cost
OCI Container Instance (1 OCPU ARM, 2GB)	~$3.50
OCI Load Balancer (flexible, 10 Mbps)	~$12
Total	~$15.50/mo
AWS Fargate equivalent	~$35/mo
GCP Cloud Run equivalent	~$25/mo (usage-based)

ARM shapes on OCI are genuinely cheap. I was paying more for my morning coffee than for this API.

When I'd Still Use OKE

Container Instances aren't a Kubernetes replacement. They're for the cases where Kubernetes is more infrastructure than you need.

If I'm running 10+ services that talk to each other, need rolling deployments, RBAC, network policies, or auto-scaling — I'm using OKE. I work with Kubernetes daily and I'm not trying to avoid it.

But for a single API, a batch job, a webhook handler, or a quick prototype? Container Instances get me to production faster with less stuff to maintain. And the Docker workflow is the same — same Dockerfile, same docker build, same image. I just change where I deploy it.

Wrapping Up

I've been using Container Instances for about a month now for small services and side projects. The thing I keep coming back to is how little I think about infrastructure when using them. No node pools to right-size, no cluster upgrades to schedule, no ingress controllers to debug.

If you're on OCI and you haven't tried Container Instances yet, spend 10 minutes with it. You might realize, like I did, that half the containers you're running on Kubernetes don't actually need Kubernetes.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I write about containers, Kubernetes, and GPU infrastructure. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Deploying a Production-Ready K3s Cluster on OCI Always Free ARM Instances

Pavan Madduri — Wed, 11 Mar 2026 14:50:00 +0000

Deploying a Production-Ready K3s Cluster on OCI Always Free ARM Instances

How I turned Oracle Cloud's free ARM compute into a fully functional Kubernetes cluster — with ingress, persistent storage, and TLS — all without spending a dollar.

Introduction

I have been running Kubernetes clusters professionally for years — managed services like EKS, AKS, GKE, and self-hosted clusters with kubeadm. They all cost money. Even the cheapest managed Kubernetes offering runs $70-80/month just for the control plane.

Then I looked at what Oracle Cloud gives away for free: 4 ARM OCPUs and 24GB of RAM on the Always Free tier. That is more compute than most developers use for their entire home lab. The question was obvious — could I run a real Kubernetes cluster on it?

The answer is yes, and it works better than I expected.

In this post, I will walk through deploying K3s — Rancher's lightweight Kubernetes distribution — on OCI Always Free ARM instances. Not a toy cluster. A cluster with ingress routing, persistent volumes, automatic TLS certificates, and enough resources to run real workloads.

Why K3s on OCI ARM?

Why K3s over full Kubernetes?

K3s strips out the components most developers never use — cloud controller, storage drivers, legacy API versions — and replaces etcd with SQLite (or embedded etcd for HA). The result is a single binary under 100MB that starts in seconds.

On resource-constrained Always Free instances, this matters. Full kubeadm clusters consume 2-3GB of RAM just for the control plane. K3s uses around 512MB.

Why OCI ARM over other clouds?

Provider	Free Compute	RAM	Duration
OCI Always Free	4 ARM OCPUs	24 GB	Forever
AWS Free Tier	1 vCPU (t2.micro)	1 GB	12 months
GCP Free Tier	0.25 vCPU (e2-micro)	1 GB	Forever
Azure Free	1 vCPU (B1S)	1 GB	12 months

There is no comparison. OCI gives you 24x the RAM of any competitor's free tier, permanently.

Architecture

Here is what we are building:

┌──────────────────────────────────────────────────────┐
│                    OCI VCN (10.0.0.0/16)             │
│                                                      │
│  ┌────────────────────────────────────────────────┐  │
│  │           Public Subnet (10.0.1.0/24)          │  │
│  │                                                │  │
│  │  ┌──────────────────┐  ┌────────────────────┐  │  │
│  │  │   K3s Server     │  │   K3s Agent        │  │  │
│  │  │   (Control Plane)│  │   (Worker Node)    │  │  │
│  │  │                  │  │                    │  │  │
│  │  │  2 OCPU / 12GB   │  │  2 OCPU / 12GB     │  │  │
│  │  │  Oracle Linux 9  │  │  Oracle Linux 9    │  │  │
│  │  │                  │  │                    │  │  │
│  │  │  - K3s server    │  │  - K3s agent       │  │  │
│  │  │  - Traefik        │  │  - Workloads       │  │  │
│  │  │  - CoreDNS       │  │  - Pods            │  │  │
│  │  │  - Metrics       │  │                    │  │  │
│  │  └──────────────────┘  └────────────────────┘  │  │
│  │                                                │  │
│  └────────────────────────────────────────────────┘  │
│                                                      │
│  Security List:                                      │
│    Ingress: SSH(22), HTTP(80), HTTPS(443),           │
│             K8s API(6443), Kubelet(10250),           │
│             NodePort(30000-32767)                    │
│    Egress:  All traffic                               │
│                                                      │
│  ┌────────────────────────────────────────────────┐  │
│  │  OCI Load Balancer (10 Mbps - Always Free)     │  │
│  │  → Forwards 80/443 to K3s Traefik Ingress       │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘

We split the 4 OCPUs and 24GB evenly: 2 OCPUs + 12GB for the server node, 2 OCPUs + 12GB for the worker. This gives the control plane enough room to breathe while leaving serious capacity for workloads.

Prerequisites

Before starting, you need:

OCI account with Always Free tier — Sign up at cloud.oracle.com
OCI CLI configured — Use Cloud Shell (pre-configured) or install locally
Two A1.Flex instances provisioned — Follow the VCN + compute setup from my earlier posts, but create two instances instead of one
SSH access to both instances

If you do not have the instances yet, provision them with these shapes:

# Server node
SHAPE_CONFIG='{"ocpus":2,"memoryInGBs":12}'

# Agent node (same config)
SHAPE_CONFIG='{"ocpus":2,"memoryInGBs":12}'

Both must use an aarch64 Oracle Linux 9 image — ARM architecture is critical here.

Step 1: Preparing the Instances

SSH into both instances and run the same preparation steps. OCI's Oracle Linux 9 images have firewalld and iptables rules that interfere with Kubernetes networking. We need to handle this.

# On BOTH nodes
sudo dnf update -y

# Disable firewalld — K3s manages its own iptables rules
sudo systemctl stop firewalld
sudo systemctl disable firewalld

# Load required kernel modules
cat <<EOF | sudo tee /etc/modules-load.d/k3s.conf
br_netfilter
overlay
EOF

sudo modprobe br_netfilter
sudo modprobe overlay

# Set required sysctl parameters
cat <<EOF | sudo tee /etc/sysctl.d/k3s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

sudo sysctl --system

Why these specific settings?

br_netfilter — Enables iptables to see bridged traffic (required for pod-to-pod communication across nodes)
overlay — Required by the container runtime for overlay filesystem
ip_forward — Allows the kernel to forward packets between network interfaces (essential for routing traffic to pods)

I spent two hours debugging connectivity issues on my first attempt because I forgot br_netfilter. Pods on different nodes simply could not talk to each other. The symptom was DNS resolution failures — CoreDNS pods could not reach each other.

Step 2: OCI Security List Configuration

This is where most OCI + Kubernetes guides fall short. The default security list blocks inter-node communication that K3s needs.

You need these ingress rules on the security list attached to your subnet:

# Update security list with K3s-required ports
oci network security-list update \
    --security-list-id "$SL_ID" \
    --ingress-security-rules '[
        {"source":"0.0.0.0/0","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":22,"max":22}}},
        {"source":"0.0.0.0/0","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":80,"max":80}}},
        {"source":"0.0.0.0/0","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":443,"max":443}}},
        {"source":"10.0.0.0/16","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":6443,"max":6443}}},
        {"source":"10.0.0.0/16","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":10250,"max":10250}}},
        {"source":"10.0.0.0/16","protocol":"17",
         "udpOptions":{"destinationPortRange":{"min":8472,"max":8472}}},
        {"source":"10.0.0.0/16","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":2379,"max":2380}}},
        {"source":"0.0.0.0/0","protocol":"6",
         "tcpOptions":{"destinationPortRange":{"min":30000,"max":32767}}}
    ]' \
    --egress-security-rules '[
        {"destination":"0.0.0.0/0","protocol":"all"}
    ]' \
    --force > /dev/null

Port breakdown:

Port	Protocol	Purpose	Source
22	TCP	SSH access	Anywhere
80	TCP	HTTP ingress	Anywhere
443	TCP	HTTPS ingress	Anywhere
6443	TCP	K3s API server	VCN only
10250	TCP	Kubelet metrics	VCN only
8472	UDP	VXLAN (Flannel CNI)	VCN only
2379-2380	TCP	etcd (if HA)	VCN only
30000-32767	TCP	NodePort services	Anywhere

Notice that internal K3s ports (6443, 10250, 8472) are restricted to the VCN CIDR 10.0.0.0/16. Never expose the Kubernetes API to the internet in production.

Step 3: Installing K3s Server

SSH into your first instance (the server node) and install K3s:

# On the SERVER node
export INSTALL_K3S_EXEC="server"
export K3S_NODE_NAME="k3s-server"

curl -sfL https://get.k3s.io | sh -s - \
    --write-kubeconfig-mode 644 \
    --tls-san $(curl -s http://169.254.169.254/opc/v1/instance/metadata/public_ip) \
    --node-external-ip $(curl -s http://169.254.169.254/opc/v1/instance/metadata/public_ip) \
    --flannel-iface enp0s6 \
    --disable servicelb

Let me explain each flag because they all matter on OCI:

--write-kubeconfig-mode 644 — Makes the kubeconfig readable without sudo. Useful for development but tighten this in production
--tls-san <public_ip> — Adds the public IP to the K3s API server's TLS certificate. Without this, kubectl from your laptop will get TLS errors
--node-external-ip <public_ip> — Tells K3s about the node's public IP. OCI instances only see their private IP on the network interface
--flannel-iface enp0s6 — Forces Flannel to use the correct network interface. OCI ARM instances use enp0s6 as the primary interface, not eth0. I discovered this the hard way — Flannel defaulted to the wrong interface and VXLAN tunnels failed silently
--disable servicelb — Disables K3s's built-in load balancer (ServiceLB/Klipper). We will use OCI's Always Free Load Balancer instead

The instance metadata endpoint 169.254.169.254 is OCI's equivalent of AWS's metadata service. It returns instance details without needing the OCI CLI.

Verify the server is running:

sudo systemctl status k3s

# Check node status
kubectl get nodes
# NAME         STATUS   ROLES                  AGE   VERSION
# k3s-server   Ready    control-plane,master   45s   v1.31.4+k3s1

Grab the join token — the agent node needs this:

sudo cat /var/lib/rancher/k3s/server/node-token
# K10xxxx::server:yyyy

Step 4: Joining the Agent Node

SSH into your second instance and install K3s in agent mode:

# On the AGENT node
export K3S_URL="https://<SERVER_PRIVATE_IP>:6443"
export K3S_TOKEN="<TOKEN_FROM_STEP_3>"
export K3S_NODE_NAME="k3s-agent"

curl -sfL https://get.k3s.io | sh -s - \
    --node-external-ip $(curl -s http://169.254.169.254/opc/v1/instance/metadata/public_ip) \
    --flannel-iface enp0s6

Important: use the private IP of the server node for K3S_URL, not the public IP. Both instances are in the same VCN subnet, so they communicate over the private network. This is faster, free (no egress charges), and more secure.

Back on the server node, verify both nodes are ready:

kubectl get nodes -o wide
# NAME         STATUS   ROLES                  AGE     VERSION        INTERNAL-IP   EXTERNAL-IP
# k3s-server   Ready    control-plane,master   5m      v1.31.4+k3s1   10.0.1.10     <public>
# k3s-agent    Ready    <none>                 30s     v1.31.4+k3s1   10.0.1.11     <public>

Two nodes. 4 OCPUs. 24GB RAM. Zero dollars.

Step 5: Configuring the OCI Load Balancer

OCI's Always Free tier includes a 10 Mbps Flexible Load Balancer. We will point it at our K3s nodes to route HTTP/HTTPS traffic to the Traefik ingress controller.

# Create the load balancer
LB_ID=$(oci lb load-balancer create \
    --compartment-id "$COMPARTMENT_ID" \
    --display-name "k3s-ingress-lb" \
    --shape-name "flexible" \
    --shape-details '{"minimumBandwidthInMbps":10,"maximumBandwidthInMbps":10}' \
    --subnet-ids "[\"$SUBNET_ID\"]" \
    --is-private false \
    --query 'data.id' --raw-output \
    --wait-for-state SUCCEEDED)

# Create a backend set with health check
oci lb backend-set create \
    --load-balancer-id "$LB_ID" \
    --name "k3s-backends" \
    --policy "ROUND_ROBIN" \
    --health-checker-protocol "TCP" \
    --health-checker-port 80 \
    --health-checker-interval-in-ms 10000 \
    --health-checker-timeout-in-ms 3000 \
    --health-checker-retries 3 \
    --wait-for-state SUCCEEDED

# Add both nodes as backends
oci lb backend create \
    --load-balancer-id "$LB_ID" \
    --backend-set-name "k3s-backends" \
    --ip-address "<SERVER_PRIVATE_IP>" \
    --port 80 \
    --wait-for-state SUCCEEDED

oci lb backend create \
    --load-balancer-id "$LB_ID" \
    --backend-set-name "k3s-backends" \
    --ip-address "<AGENT_PRIVATE_IP>" \
    --port 80 \
    --wait-for-state SUCCEEDED

# Create HTTP listener
oci lb listener create \
    --load-balancer-id "$LB_ID" \
    --name "http-listener" \
    --default-backend-set-name "k3s-backends" \
    --protocol "HTTP" \
    --port 80 \
    --wait-for-state SUCCEEDED

The 10 Mbps shape is Always Free. It is enough for development, personal projects, and moderate traffic. The load balancer gets its own public IP, which becomes your cluster's entry point.

Get the load balancer IP:

LB_IP=$(oci lb load-balancer get \
    --load-balancer-id "$LB_ID" \
    --query 'data."ip-addresses"[0]."ip-address"' --raw-output)
echo "Load Balancer IP: $LB_IP"

Step 6: Deploying a Test Workload

Let us deploy something real to verify the entire pipeline works:

# nginx-demo.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo
  labels:
    app: nginx-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
    spec:
      containers:
      - name: nginx
        image: nginx:alpine
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-demo
spec:
  selector:
    app: nginx-demo
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-demo
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-demo
            port:
              number: 80

Apply it:

kubectl apply -f nginx-demo.yaml

# Watch the pods come up
kubectl get pods -w
# NAME                          READY   STATUS    RESTARTS   AGE
# nginx-demo-6d9f7c8b4-abc12   1/1     Running   0          10s
# nginx-demo-6d9f7c8b4-def34   1/1     Running   0          10s
# nginx-demo-6d9f7c8b4-ghi56   1/1     Running   0          10s

Three replicas spread across both nodes. Test it:

curl http://$LB_IP
# <!DOCTYPE html>
# <html>
# <head><title>Welcome to nginx!</title>...

Traffic flows: Internet → OCI Load Balancer → Traefik Ingress → nginx pods. All on free infrastructure.

Step 7: Persistent Storage with OCI Block Volumes

K3s includes the local-path storage provisioner by default, which creates volumes on the node's local disk. For Always Free instances, this works well since we have 200GB of boot volume.

# Verify the storage class exists
kubectl get storageclass
# NAME                   PROVISIONER             RECLAIMPOLICY   AGE
# local-path (default)   rancher.io/local-path   Delete          10m

Test it with a PVC:

# pvc-test.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: pvc-test
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sh", "-c", "echo 'Persistent storage works on OCI ARM' > /data/test.txt && cat /data/test.txt && sleep 3600"]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc

kubectl apply -f pvc-test.yaml
kubectl logs pvc-test
# Persistent storage works on OCI ARM

For production workloads that need data to survive node replacement, consider the OCI CSI driver — but for Always Free instances, local-path is practical and simple.

Cluster Resource Usage

After deploying K3s with Traefik, CoreDNS, and the test workload, here is what the resource consumption looks like:

kubectl top nodes
# NAME         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# k3s-server   180m         9%     1.2Gi           10%
# k3s-agent    95m          5%     780Mi           6%

The entire Kubernetes infrastructure — control plane, networking, DNS, ingress, and three nginx replicas — uses about 2GB of the available 24GB. That leaves 22GB free for your actual workloads.

For context, here is what fits comfortably:

Workload	CPU	Memory	Fits?
PostgreSQL	200m	512Mi	Yes
Redis	100m	256Mi	Yes
Go API server	100m	128Mi	Yes
Python Flask app	200m	256Mi	Yes
Grafana	100m	256Mi	Yes
Prometheus	200m	512Mi	Yes
Total	900m	1.9Gi	Easily

You could run a complete application stack — database, cache, API, monitoring — with room to spare.

Troubleshooting Common OCI + K3s Issues

I hit every one of these during my setup. Saving you the debugging time.

1. Pods stuck in ContainerCreating

Usually a Flannel networking issue. Check if VXLAN traffic (UDP 8472) is allowed in the security list and verify Flannel is using the correct interface:

journalctl -u k3s -f | grep flannel
# If you see "failed to find interface" — fix the --flannel-iface flag

2. Agent node shows NotReady

The agent cannot reach the server on port 6443. Verify the security list allows TCP 6443 from the VCN CIDR and that you used the private IP in K3S_URL:

# From the agent node
curl -k https://<SERVER_PRIVATE_IP>:6443
# Should return JSON (even if it says Unauthorized)

3. Ingress returns 404 for all routes

Traefik is running but not seeing your Ingress resources. Check Traefik logs:

kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

4. OCI Load Balancer shows backends as Critical

Health check is failing. Verify that Traefik is listening on port 80 on both nodes:

ss -tlnp | grep :80

5. Cannot pull container images

OCI instances need outbound internet access through a NAT gateway or Internet Gateway. Verify your route table has a default route to the Internet Gateway.

Security Hardening

For a cluster exposed to the internet, apply these minimum security measures:

# 1. Restrict API server access to your IP
# Update security list: change 6443 source from VCN to your specific IP

# 2. Create a non-root kubeconfig
kubectl create serviceaccount deploy-sa
kubectl create clusterrolebinding deploy-sa-binding \
    --clusterrole=edit --serviceaccount=default:deploy-sa

# 3. Enable Network Policies
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: default
spec:
  podSelector: {}
  policyTypes:
  - Ingress
EOF

# 4. Set resource limits on all deployments (prevent noisy neighbors)
# 5. Use OCI Vault for Kubernetes secrets (covered in my earlier post)

Accessing kubectl from Your Laptop

Copy the kubeconfig from the server node to your local machine:

# From your laptop
scp opc@<SERVER_PUBLIC_IP>:/etc/rancher/k3s/k3s.yaml ~/.kube/oci-k3s-config

# Update the server address from 127.0.0.1 to the public IP
sed -i '' "s/127.0.0.1/<SERVER_PUBLIC_IP>/g" ~/.kube/oci-k3s-config

export KUBECONFIG=~/.kube/oci-k3s-config
kubectl get nodes

This works because we added --tls-san with the public IP during installation.

Cost Comparison

Setup	Monthly Cost	Nodes	RAM
OCI Always Free + K3s	$0	2	24 GB
EKS (t3.medium x2)	~$150	2	8 GB
GKE Autopilot (equivalent)	~$120	Auto	Auto
AKS (B2s x2)	~$65	2	8 GB
DigitalOcean K8s	~$48	2	4 GB
Civo K3s	~$40	2	4 GB

OCI gives you 3x the RAM of paid alternatives, for free. The trade-off is that you manage K3s yourself — no managed control plane. For learning, development, and personal projects, that trade-off is excellent.

What Can You Run on This Cluster?

This is not theoretical. Here are workloads I have tested on this exact setup:

Gitea (self-hosted Git) — 128Mi RAM, works perfectly
Drone CI (CI/CD) — 256Mi RAM, builds containers on ARM
PostgreSQL — 512Mi RAM, handles small-to-medium databases
Grafana + Prometheus — 768Mi combined, full monitoring stack
Go/Rust microservices — Under 64Mi each, ARM-native builds are fast
Static sites with Hugo — Trivial resources, served through Traefik

Conclusion

Oracle Cloud's Always Free ARM allocation is the best-kept secret in cloud computing for Kubernetes enthusiasts. 4 OCPUs, 24GB RAM, 200GB storage, a load balancer, and 10TB of outbound transfer — all free, permanently.

K3s is the perfect match for this hardware. It is lightweight, ARM-native, and production-tested. The combination gives you a Kubernetes cluster that would cost $100-150/month on any other provider.

The setup takes about 30 minutes from scratch, and the result is a cluster you can use for learning, development, CI/CD, or running personal projects. I have had mine running for weeks with zero issues.

Stop paying for Kubernetes clusters you use for development. OCI and K3s give you a better option.

All resources in this post use OCI Always Free tier. No charges will be incurred.

Tags: #OracleCloud #Kubernetes #K3s #ARM #OCI #AlwaysFree #CloudNative #DevOps #Containers

Why Your Kubernetes Cluster Breaks 18 Minutes After a Successful Deployment

Pavan Madduri — Sat, 07 Mar 2026 17:29:46 +0000

You merge the Pull Request. The CI/CD pipeline flashes green. ArgoCD reports that your application is "Synced" and "Healthy." You grab a coffee, thinking the deployment was a complete success.

Then, 18 minutes later, your pager goes off. The cluster is degraded, and users are experiencing errors. What just happened?

The Delay of Reactive Monitoring

This scenario is incredibly common in large-scale Kubernetes environments. The problem lies in how GitOps tools handle configuration drift. Tools like ArgoCD use continuous reconciliation loops, constantly comparing your Git manifests against the live cluster resources.

However, this is a reactive approach. It only discovers problems post-deployment. According to comprehensive production benchmarks (Madduri, 2024), traditional monitoring detects drift an average of 18 minutes after problematic deployments complete.

For 18 minutes, your system might have been starved of resources, stuck in a circular dependency, or suffering from a security policy breach. In a mission-critical platform, an 18-minute delay means dropped transactions and unhappy users.
(To see the exact performance metrics comparing reactive vs. proactive monitoring, review the full study here: [Google Scholar])

Closing the 18-Minute Gap

To fix this, we have to stop relying on monitoring tools to catch our mistakes. We need to verify our manifests mathematically during the continuous integration phase, before the deployment ever reaches the cluster.

By using formal verification, we can construct state transition models to explore every possible failure mode of a manifest. When this proactive approach was tested across 850 production applications, it reduced the mean time to detect drift from 18 minutes down to under 30 seconds. It represents a 36x improvement in detection speed, entirely eliminating the dangerous 18-minute window.

Stop waiting for your monitoring tools to tell you that your deployment failed. Prove that it will succeed before you ever click merge.