Forem: sudhesh G

DevOps RealWorld Series#3 - Sudden increase in Cloud Bill - real incidents, real pain stories, real lessons.

sudhesh G — Fri, 20 Mar 2026 10:55:38 +0000

That One Debug Flag That Quietly Burned $4,200 in 48 Hours

At one of my previous organization,I still remember our manager reaction that day.
Tuesday morning standup. He opened AWS Cost Explorer like he does every week. Scrolled down,Stopped & Read it again.Then looked up at the rest of us.
"Why did our CloudWatch spend go from $180 last month to $4,200 in the last two days?"
Complete silence.
Nobody had a clue. And honestly,that silence was the most expensive part of the whole story.

A Normal Week, a Normal Hotfix
Nothing about that week felt unusual.We were running a mid scale microservices platform on EKS - 24 services, standard observability stack with Prometheus, Grafana, and Fluentd shipping logs to CloudWatch.The kind of setup that just hums along in the background.
Two days before that standup, one of our backend engineers caught a real bug in the payments service and shipped a fix.Quick turnaround. Small PR. Good work,exactly the kind of ownership you want on your team.
Nobody blamed him for what happened next. Not for a second.

But buried inside that hotfix was a single line in the Helm values override left in from a local debugging session, the kind of thing any of us would do:

env:
  - name: LOG_LEVEL
    value: "DEBUG"

The PR was small. The reviewer was moving fast. The CI pipeline didn't care about env vars. It sailed through to production.

What DEBUG Actually Means When You're Handling 14k Requests a Minute
Here's the thing nobody tells you early enough in your career: DEBUG in local dev and DEBUG in production are completely different beasts.
On your laptop, verbose logging is your friend. You see everything, trace your logic, fix the bug, move on.
In production at scale? That same flag becomes a money printer — just pointed in the wrong direction.
The payments service was handling roughly 14,000 requests per minute at peak. At INFO level, it emits maybe 3-4 log lines per request around 50,000 log lines/minute total. Normal.
At DEBUG level? Every internal function call gets logged. Every serialized object. Every DB query parameter. Every retry attempt. Every Thing.
That same service was suddenly emitting ~380,000 log lines per minute.
And Fluentd was doing its job perfectly by shipping every single one to CloudWatch Logs. Didn't warn us. Just kept shipping.

The Numbers We Didn't Want to see
AWS CloudWatch Logs pricing (at the time):

Ingestion: $0.50 per GB
Storage: $0.03 per GB/month

At INFO: ~50,000 lines/min × ~400 bytes avg = ~20 MB/min → ~28 GB/day
At DEBUG: ~380,000 lines/min × ~600 bytes avg = ~228 MB/min → ~320 GB/day
That's an 11x multiplier on log volume. Overnight.
Over 48 hours: ~640 GB of extra log data. At $0.50/GB ingestion, that's $320. Painful, but not $4,200.
So where did the rest go? This is the part that really stung.
Our on-call dashboards were running 6 Log Insights queries every 60 seconds, scanning the last 15 minutes of logs for anomalies. Log Insights charges $0.005 per GB scanned. At normal volumes, each query scanned ~0.4 GB. Fine. Cheap.
But now each query was chewing through ~4.8 GB per run.
6 queries × 4.8 GB × 1,440 runs/day × $0.005 = ~$207/day — just from dashboards refreshing. Dashboards that nobody was even watching overnight.
Add it all up over 48 hours: $4,200.
All from one env var.

How We Actually Found It
No fancy tooling. No AI-powered anomaly detection. Just AWS Cost Explorer with daily granularity, filtered by service.
When we drilled into CloudWatch → Log Ingestion, the spike looked like someone had drawn a vertical wall on the graph starting at the exact minute of that deployment.
From there it took maybe 3 minutes. Sorted services by log volume in the CloudWatch console. Payments-service was sitting at the top, ingesting 20x more than anything else. Ran one command:

bash

kubectl exec -it <payments-pod> -- env | grep LOG_LEVEL

DEBUG.
Two days of mystery, solved in three minutes once we knew where to look.

The Fix and the Harder Conversation After
The immediate fix was almost embarrassingly simple. Redeployed with LOG_LEVEL=INFO. Volume dropped back to normal within 2 minutes.
But we sat with the harder question for a while: how did our platform let this happen without a single alert, a single warning, a single anything?
That conversation led to four changes we shipped the following week:

Log level lives in a ConfigMap now, not Helm values LOG_LEVEL is environment-aware, **INFO **in staging and prod, **DEBUG **in dev. No overrides allowed in prod values files. You can't accidentally ship this anymore.
Fluentd has a circuit breaker We added a throttle filter any single source exceeding 100,000 lines/minute gets sampled at 10%. You lose some data in flood. That's a trade-off we're completely okay with. xml @type throttle group_key $.kubernetes.pod_name group_bucket_period_s 60 group_max_rate_per_bucket 100000 drop_logs false group_drop_logs true
A billing alarm that actually fires I'm still a bit embarrassed we didn't have one. SNS → PagerDuty, fires if daily CloudWatch spend crosses $50. If something spikes, we hear about it in hours, not days. 4.** One checkbox on every PR: "Does this change env vars in prod?**" Three seconds to read. Would have caught this entirely.

What I Actually Took Away From This
The debug flag wasn't the real problem. It was a symptom.
The real problem was that we'd built a platform with no opinion on log volume. We gave every service a direct firehose to CloudWatch and trusted that everyone would be careful with it.
That's not a platform design. That's hope.
At a certain scale, hope isn't a strategy. Your observability pipeline needs guardrails just as much as your application code does.
We got off relatively easy at $4,200. I've heard stories with another zero on the end.

The damage: $4,200 over 48 hours
The fix time: 4 minutes once identified
The detection time: 2 days
The real cost: those 2 days of not knowing
Has something like this happened to you? Drop it in the comments and I genuinely want to hear how it went down.

DevOps RealWorld Series #2 Ephemeral Storage: The Hidden Cause of CI/CD Failures in Kubernetes

sudhesh G — Thu, 26 Feb 2026 17:34:52 +0000

We encountered a strange issue where build agents were failing randomly While running Jenkins pipelines on Kubernetes.
Agent pods would start normally and builds would run successfully for some time. Then the pods would terminate unexpectedly, causing pipeline failures.
Initially, we investigated:
• Jenkins logs
• Pipeline configuration
• Docker build stages
• SonarQube scans
Everything looked normal.

The real cause became clear after inspecting pod events:

kubectl describe pod jenkins-agent-pod

The Events section showed:
Evicted
The node was low on resource: ephemeral-storage
What Consumes Ephemeral Storage in CI Agents?
CI agent pods often consume more disk than expected due to:
• Docker image layers
• Dependency downloads
• Temporary build files
• Test artifacts
• Coverage reports
• SonarQube cache
• Package manager caches
Unlike CPU and memory, ephemeral storage is frequently ignored in resource configuration.

Why This Causes Pipeline Failures?
When ephemeral storage usage exceeds node capacity:

Kubernetes marks the node under disk pressure
Pods get evicted
Jenkins agents disappear
Pipelines fail unexpectedly Since Jenkins does not clearly indicate storage-related failures, this often looks like a Jenkins or pipeline problem.

The Fix
We resolved the issue by explicitly defining ephemeral storage resources:

resources:
  requests:
    cpu: 500m
    memory: 1Gi
    ephemeral-storage: 4Gi
  limits:
    cpu: 2
    memory: 4Gi
    ephemeral-storage: 10Gi

Additional improvements included:
• Cleaning workspace after builds
• Splitting heavy pipelines into separate agents
• Increasing node storage capacity
• Reducing artifact retention
Key Takeaway
If Jenkins agents fail randomly in Kubernetes, always check pod events:

kubectl describe pod jenkins-agent-pod

Ephemeral storage exhaustion is one of the most common but overlooked causes of CI/CD instability.

DevOps RealWorld Series #1 --> Jenkins Pipelines Colliding on the Same Kubernetes Agent Pod

sudhesh G — Mon, 16 Feb 2026 17:33:47 +0000

We recently hit a strange CI/CD failure pattern in our Kubernetes-based Jenkins setup. Under parallel load, multiple pipelines were triggered but builds started failing randomly after the first job completed.

At first glance, it looked like Jenkins instability.It wasn’t.

The real issue was Kubernetes pod scheduling and node resource pressure.

This post walks through the symptoms, investigation, root cause, and the production fix that stabilized our pipelines.

🧩 Environment Context
Our setup looked like this:

Jenkins running inside Kubernetes
Dynamic Kubernetes agents created per pipeline
Shared agent template across pipelines
Parallel builds enabled
Nodes with limited ephemeral storage
No pod spread rules defined

Expected behavior:
Each pipeline → separate agent pod → isolated execution.

Actual behavior:
Pipelines indirectly collided due to scheduling concentration.

🚨 Symptoms
When multiple pipelines ran at the same time:

Only one agent pod appeared initially
First pipeline completed successfully
Agent pod terminated afterward
Other pipelines failed waiting for executors
Jenkins logs showed executor loss
Kubernetes events showed resource pressure
Agent pods repeatedly scheduled onto the same node
This made Jenkins look unstable but Jenkins was not the failing layer.

🔍 Investigation Steps
We verified:

Jenkins executor configuration ✅
Kubernetes plugin pod templates ✅
Pipeline definitions ✅
Agent provisioning logs ✅
Pod lifecycle events ✅
Node describe output ✅

Key observation:
Agent pods were consistently landing on the same node under load.

That node showed:

Ephemeral storage pressure
Resource exhaustion warnings
Pod eviction events
Pods were being created correctly but placed poorly.

⚙ Root Cause
No scheduling distribution rules were defined for Jenkins agent pods.
Kubernetes scheduler packed multiple agent pods onto the same node.

That caused:

Rapid ephemeral storage consumption
Node pressure conditions
Pod termination after first job
Waiting pipelines losing executors and then we found that
This was a pod placement problem, not a Jenkins provisioning problem.

🛠 Fix Implemented
We updated the Jenkins agent pod template to include topology spread constraints and better resource sizing.

Added topology spread constraints

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: jenkins-agent

Resource tuning

Increased ephemeral storage limits
Increased CPU & memory requests
Prevented node-level overload from agent bursts

✅ Result After Fix
After rollout:

Agent pods distributed across nodes
No repeated scheduling concentration
No executor loss after first pipeline
Stable parallel builds
No unexpected agent pod termination
CI behavior became predictable again.

🎯 Key Lesson
When Jenkins pipelines fail under parallel load:
Do not inspect Jenkins alone.
Also check:

Pod scheduling patterns
Node resource pressure
Ephemeral storage limits
Pod distribution rules
Poor pod spread can look exactly like CI instability.

🔜 Next in This Series

DevOps Real-World Series #2 — Ephemeral Storage: The Silent CI/CD Pipeline Killer