<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: sudhesh G</title>
    <description>The latest articles on Forem by sudhesh G (@sudhesh_g_b1ddefe9194fd09).</description>
    <link>https://forem.com/sudhesh_g_b1ddefe9194fd09</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2250119%2F8461747f-fb54-4cb2-a0f9-02a43d033f55.jpg</url>
      <title>Forem: sudhesh G</title>
      <link>https://forem.com/sudhesh_g_b1ddefe9194fd09</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sudhesh_g_b1ddefe9194fd09"/>
    <language>en</language>
    <item>
      <title>DevOps RealWorld Series#3 - Sudden increase in Cloud Bill - real incidents, real pain stories, real lessons.</title>
      <dc:creator>sudhesh G</dc:creator>
      <pubDate>Fri, 20 Mar 2026 10:55:38 +0000</pubDate>
      <link>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-realworld-series3-sudden-increase-in-cloud-bill-real-incidents-real-pain-stories-real-105e</link>
      <guid>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-realworld-series3-sudden-increase-in-cloud-bill-real-incidents-real-pain-stories-real-105e</guid>
      <description>&lt;p&gt;&lt;strong&gt;That One Debug Flag That Quietly Burned $4,200 in 48 Hours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At one of my previous organization,I still remember our manager reaction that day.&lt;br&gt;
Tuesday morning standup. He opened AWS Cost Explorer like he does every week. Scrolled down,Stopped &amp;amp; Read it again.Then looked up at the rest of us.&lt;br&gt;
"Why did our CloudWatch spend go from $180 last month to $4,200 in the last two days?"&lt;br&gt;
Complete silence. &lt;br&gt;
Nobody had a clue. And honestly,that silence was the most expensive part of the whole story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Normal Week, a Normal Hotfix&lt;/strong&gt;&lt;br&gt;
Nothing about that week felt unusual.We were running a mid scale microservices platform on EKS - 24 services, standard observability stack with Prometheus, Grafana, and Fluentd shipping logs to CloudWatch.The kind of setup that just hums along in the background.&lt;br&gt;
Two days before that standup, one of our backend engineers caught a real bug in the payments service and shipped a fix.Quick turnaround. Small PR. Good work,exactly the kind of ownership you want on your team.&lt;br&gt;
Nobody blamed him for what happened next. Not for a second.&lt;/p&gt;

&lt;p&gt;But buried inside that hotfix was a single line in the Helm values override left in from a local debugging session, the kind of thing any of us would do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LOG_LEVEL&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEBUG"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The PR was small. The reviewer was moving fast. The CI pipeline didn't care about env vars. It sailed through to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What DEBUG Actually Means When You're Handling 14k Requests a Minute&lt;/strong&gt;&lt;br&gt;
Here's the thing nobody tells you early enough in your career: &lt;strong&gt;DEBUG&lt;/strong&gt; in local dev and &lt;strong&gt;DEBUG&lt;/strong&gt; in production are completely different beasts.&lt;br&gt;
On your laptop, verbose logging is your friend. You see everything, trace your logic, fix the bug, move on.&lt;br&gt;
In production at scale? That same flag becomes a money printer — just pointed in the wrong direction.&lt;br&gt;
The payments service was handling roughly 14,000 requests per minute at peak. At INFO level, it emits maybe 3-4 log lines per request around 50,000 log lines/minute total. Normal.&lt;br&gt;
At DEBUG level? Every internal function call gets logged. Every serialized object. Every DB query parameter. Every retry attempt. Every Thing.&lt;br&gt;
That same service was suddenly emitting ~380,000 log lines per minute.&lt;br&gt;
And Fluentd was doing its job perfectly by shipping every single one to CloudWatch Logs. Didn't warn us. Just kept shipping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Numbers We Didn't Want to see&lt;/strong&gt;&lt;br&gt;
AWS CloudWatch Logs pricing (at the time):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion&lt;/strong&gt;: $0.50 per GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: $0.03 per GB/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At &lt;strong&gt;INFO&lt;/strong&gt;: ~50,000 lines/min × ~400 bytes avg = ~20 MB/min → ~28 GB/day&lt;br&gt;
At &lt;strong&gt;DEBUG&lt;/strong&gt;: ~380,000 lines/min × ~600 bytes avg = ~228 MB/min → ~320 GB/day&lt;br&gt;
That's an &lt;strong&gt;11x multiplier&lt;/strong&gt; on log volume. Overnight.&lt;br&gt;
Over 48 hours: ~640 GB of extra log data. At $0.50/GB ingestion, that's $320. Painful, but not $4,200.&lt;br&gt;
So where did the rest go? This is the part that really stung.&lt;br&gt;
Our on-call dashboards were running 6 Log Insights queries every 60 seconds, scanning the last 15 minutes of logs for anomalies. Log Insights charges $0.005 per GB scanned. At normal volumes, each query scanned ~0.4 GB. Fine. Cheap.&lt;br&gt;
But now each query was chewing through ~4.8 GB per run.&lt;br&gt;
6 queries × 4.8 GB × 1,440 runs/day × $0.005 = &lt;strong&gt;~$207/day&lt;/strong&gt; — just from dashboards refreshing. Dashboards that nobody was even watching overnight.&lt;br&gt;
Add it all up over 48 hours: $4,200.&lt;br&gt;
All from one env var.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How We Actually Found It&lt;/strong&gt;&lt;br&gt;
No fancy tooling. No AI-powered anomaly detection. Just AWS Cost Explorer with daily granularity, filtered by service.&lt;br&gt;
When we drilled into CloudWatch → Log Ingestion, the spike looked like someone had drawn a vertical wall on the graph starting at the exact minute of that deployment.&lt;br&gt;
From there it took maybe 3 minutes. Sorted services by log volume in the CloudWatch console. Payments-service was sitting at the top, ingesting 20x more than anything else. Ran one command:&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &amp;lt;payments-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;LOG_LEVEL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DEBUG&lt;/strong&gt;.&lt;br&gt;
Two days of mystery, solved in three minutes once we knew where to look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix and the Harder Conversation After&lt;/strong&gt;&lt;br&gt;
The immediate fix was almost embarrassingly simple. Redeployed with &lt;strong&gt;LOG_LEVEL=INFO&lt;/strong&gt;. Volume dropped back to normal within 2 minutes.&lt;br&gt;
But we sat with the harder question for a while: how did our platform let this happen without a single alert, a single warning, a single anything?&lt;br&gt;
That conversation led to four changes we shipped the following week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log level lives in a ConfigMap now, not Helm values&lt;/strong&gt;
LOG_LEVEL is environment-aware, **INFO **in staging and prod, **DEBUG **in dev. No overrides allowed in prod values files. You can't accidentally ship this anymore.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fluentd has a circuit breaker&lt;/strong&gt;
We added a throttle filter any single source exceeding 100,000 lines/minute gets sampled at 10%. You lose some data in flood. That's a trade-off we're completely okay with.
xml
@type throttle
group_key $.kubernetes.pod_name
group_bucket_period_s 60
group_max_rate_per_bucket 100000
drop_logs false
group_drop_logs true
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A billing alarm that actually fires&lt;/strong&gt;
I'm still a bit embarrassed we didn't have one. SNS → PagerDuty, fires if daily CloudWatch spend crosses $50. If something spikes, we hear about it in hours, not days.
4.** One checkbox on every PR: "Does this change env vars in prod?**"
Three seconds to read. Would have caught this entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What I Actually Took Away From This&lt;/strong&gt;&lt;br&gt;
The debug flag wasn't the real problem. It was a symptom.&lt;br&gt;
The real problem was that we'd built a platform with &lt;strong&gt;no opinion on log volume&lt;/strong&gt;. We gave every service a direct firehose to CloudWatch and trusted that everyone would be careful with it.&lt;br&gt;
That's not a platform design. That's hope.&lt;br&gt;
At a certain scale, hope isn't a strategy. Your observability pipeline needs guardrails just as much as your application code does.&lt;br&gt;
We got off relatively easy at $4,200. I've heard stories with another zero on the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The damage&lt;/strong&gt;: $4,200 over 48 hours&lt;br&gt;
&lt;strong&gt;The fix time&lt;/strong&gt;: 4 minutes once identified&lt;br&gt;
&lt;strong&gt;The detection time&lt;/strong&gt;: 2 days&lt;br&gt;
&lt;strong&gt;The real cost&lt;/strong&gt;: those 2 days of not knowing&lt;br&gt;
Has something like this happened to you? Drop it in the comments and I genuinely want to hear how it went down.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>cloudwatch</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>DevOps RealWorld Series #2 Ephemeral Storage: The Hidden Cause of CI/CD Failures in Kubernetes</title>
      <dc:creator>sudhesh G</dc:creator>
      <pubDate>Thu, 26 Feb 2026 17:34:52 +0000</pubDate>
      <link>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-realworld-series-2ephemeral-storage-the-hidden-cause-of-cicd-failures-in-kubernetes-33nd</link>
      <guid>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-realworld-series-2ephemeral-storage-the-hidden-cause-of-cicd-failures-in-kubernetes-33nd</guid>
      <description>&lt;p&gt;We encountered a strange issue where build agents were failing randomly While running Jenkins pipelines on Kubernetes.&lt;br&gt;
Agent pods would start normally and builds would run successfully for some time. Then the pods would terminate unexpectedly, causing pipeline failures.&lt;br&gt;
Initially, we investigated:&lt;br&gt;
• Jenkins logs&lt;br&gt;
• Pipeline configuration&lt;br&gt;
• Docker build stages&lt;br&gt;
• SonarQube scans&lt;br&gt;
Everything looked normal.&lt;/p&gt;

&lt;p&gt;The real cause became clear after inspecting pod events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe pod jenkins-agent-pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Events section showed:&lt;br&gt;
Evicted&lt;br&gt;
The node was low on resource: &lt;strong&gt;ephemeral-storage&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;What Consumes Ephemeral Storage in CI Agents?&lt;/strong&gt;&lt;br&gt;
CI agent pods often consume more disk than expected due to:&lt;br&gt;
• Docker image layers&lt;br&gt;
• Dependency downloads&lt;br&gt;
• Temporary build files&lt;br&gt;
• Test artifacts&lt;br&gt;
• Coverage reports&lt;br&gt;
• SonarQube cache&lt;br&gt;
• Package manager caches&lt;br&gt;
Unlike CPU and memory, ephemeral storage is frequently ignored in resource configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Causes Pipeline Failures?&lt;/strong&gt;&lt;br&gt;
When ephemeral storage usage exceeds node capacity:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kubernetes marks the node under disk pressure&lt;/li&gt;
&lt;li&gt;Pods get evicted&lt;/li&gt;
&lt;li&gt;Jenkins agents disappear&lt;/li&gt;
&lt;li&gt;Pipelines fail unexpectedly
Since Jenkins does not clearly indicate storage-related failures, this often looks like a Jenkins or pipeline problem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Fix&lt;/strong&gt;&lt;br&gt;
We resolved the issue by explicitly defining ephemeral storage resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:
  requests:
    cpu: 500m
    memory: 1Gi
    ephemeral-storage: 4Gi
  limits:
    cpu: 2
    memory: 4Gi
    ephemeral-storage: 10Gi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional improvements included:&lt;br&gt;
• Cleaning workspace after builds&lt;br&gt;
• Splitting heavy pipelines into separate agents&lt;br&gt;
• Increasing node storage capacity&lt;br&gt;
• Reducing artifact retention&lt;br&gt;
Key Takeaway&lt;br&gt;
If Jenkins agents fail randomly in Kubernetes, always check pod events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe pod jenkins-agent-pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ephemeral storage exhaustion is one of the most common but overlooked causes of CI/CD instability.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>jenkins</category>
      <category>cicd</category>
    </item>
    <item>
      <title>DevOps RealWorld Series #1 --&gt; Jenkins Pipelines Colliding on the Same Kubernetes Agent Pod</title>
      <dc:creator>sudhesh G</dc:creator>
      <pubDate>Mon, 16 Feb 2026 17:33:47 +0000</pubDate>
      <link>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-real-world-series-1-jenkins-pipelines-colliding-on-the-same-kubernetes-agent-pod-d09</link>
      <guid>https://forem.com/sudhesh_g_b1ddefe9194fd09/devops-real-world-series-1-jenkins-pipelines-colliding-on-the-same-kubernetes-agent-pod-d09</guid>
      <description>&lt;p&gt;We recently hit a strange CI/CD failure pattern in our Kubernetes-based Jenkins setup. Under parallel load, multiple pipelines were triggered but builds started failing randomly after the first job completed.&lt;/p&gt;

&lt;p&gt;At first glance, it looked like Jenkins instability.It wasn’t.&lt;/p&gt;

&lt;p&gt;The real issue was Kubernetes pod scheduling and node resource pressure.&lt;/p&gt;

&lt;p&gt;This post walks through the symptoms, investigation, root cause, and the production fix that stabilized our pipelines.&lt;/p&gt;

&lt;p&gt;🧩 &lt;strong&gt;Environment Context&lt;/strong&gt;&lt;br&gt;
Our setup looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jenkins running inside Kubernetes&lt;/li&gt;
&lt;li&gt;Dynamic Kubernetes agents created per pipeline&lt;/li&gt;
&lt;li&gt;Shared agent template across pipelines&lt;/li&gt;
&lt;li&gt;Parallel builds enabled&lt;/li&gt;
&lt;li&gt;Nodes with limited ephemeral storage&lt;/li&gt;
&lt;li&gt;No pod spread rules defined&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expected behavior:&lt;br&gt;
Each pipeline → separate agent pod → isolated execution.&lt;/p&gt;

&lt;p&gt;Actual behavior:&lt;br&gt;
Pipelines indirectly collided due to scheduling concentration.&lt;/p&gt;

&lt;p&gt;🚨 &lt;strong&gt;Symptoms&lt;/strong&gt;&lt;br&gt;
When multiple pipelines ran at the same time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only one agent pod appeared initially&lt;/li&gt;
&lt;li&gt;First pipeline completed successfully&lt;/li&gt;
&lt;li&gt;Agent pod terminated afterward&lt;/li&gt;
&lt;li&gt;Other pipelines failed waiting for executors&lt;/li&gt;
&lt;li&gt;Jenkins logs showed executor loss&lt;/li&gt;
&lt;li&gt;Kubernetes events showed resource pressure&lt;/li&gt;
&lt;li&gt;Agent pods repeatedly scheduled onto the same node&lt;/li&gt;
&lt;li&gt;This made Jenkins look unstable but Jenkins was not the failing layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔍 &lt;strong&gt;Investigation Steps&lt;/strong&gt;&lt;br&gt;
We verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jenkins executor configuration ✅&lt;/li&gt;
&lt;li&gt;Kubernetes plugin pod templates ✅&lt;/li&gt;
&lt;li&gt;Pipeline definitions ✅&lt;/li&gt;
&lt;li&gt;Agent provisioning logs ✅&lt;/li&gt;
&lt;li&gt;Pod lifecycle events ✅&lt;/li&gt;
&lt;li&gt;Node describe output ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key observation:&lt;br&gt;
Agent pods were consistently landing on the same node under load.&lt;/p&gt;

&lt;p&gt;That node showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ephemeral storage pressure&lt;/li&gt;
&lt;li&gt;Resource exhaustion warnings&lt;/li&gt;
&lt;li&gt;Pod eviction events&lt;/li&gt;
&lt;li&gt;Pods were being created correctly but placed poorly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⚙ &lt;strong&gt;Root Cause&lt;/strong&gt;&lt;br&gt;
No scheduling distribution rules were defined for Jenkins agent pods.&lt;br&gt;
Kubernetes scheduler packed multiple agent pods onto the same node.&lt;/p&gt;

&lt;p&gt;That caused:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rapid ephemeral storage consumption&lt;/li&gt;
&lt;li&gt;Node pressure conditions&lt;/li&gt;
&lt;li&gt;Pod termination after first job&lt;/li&gt;
&lt;li&gt;Waiting pipelines losing executors and then we found that&lt;/li&gt;
&lt;li&gt;This was a pod placement problem, not a Jenkins provisioning problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🛠 &lt;strong&gt;Fix Implemented&lt;/strong&gt;&lt;br&gt;
We updated the Jenkins agent pod template to include topology spread constraints and better resource sizing.&lt;/p&gt;

&lt;p&gt;Added topology spread constraints&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topologySpreadConstraints:
- maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: jenkins-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resource tuning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increased ephemeral storage limits&lt;/li&gt;
&lt;li&gt;Increased CPU &amp;amp; memory requests&lt;/li&gt;
&lt;li&gt;Prevented node-level overload from agent bursts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ &lt;strong&gt;Result After Fix&lt;/strong&gt;&lt;br&gt;
After rollout:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent pods distributed across nodes&lt;/li&gt;
&lt;li&gt;No repeated scheduling concentration&lt;/li&gt;
&lt;li&gt;No executor loss after first pipeline&lt;/li&gt;
&lt;li&gt;Stable parallel builds&lt;/li&gt;
&lt;li&gt;No unexpected agent pod termination&lt;/li&gt;
&lt;li&gt;CI behavior became predictable again.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🎯 &lt;strong&gt;Key Lesson&lt;/strong&gt;&lt;br&gt;
When Jenkins pipelines fail under parallel load:&lt;br&gt;
Do not inspect Jenkins alone.&lt;br&gt;
Also check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pod scheduling patterns&lt;/li&gt;
&lt;li&gt;Node resource pressure&lt;/li&gt;
&lt;li&gt;Ephemeral storage limits&lt;/li&gt;
&lt;li&gt;Pod distribution rules&lt;/li&gt;
&lt;li&gt;Poor pod spread can look exactly like CI instability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔜 Next in This Series&lt;/p&gt;

&lt;p&gt;DevOps Real-World Series #2 — Ephemeral Storage: The Silent CI/CD Pipeline Killer&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>jenkins</category>
      <category>cicd</category>
    </item>
  </channel>
</rss>
