<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Neeraja Khanapure</title>
    <description>The latest articles on Forem by Neeraja Khanapure (@neeraja_khanapure_4a33a5f).</description>
    <link>https://forem.com/neeraja_khanapure_4a33a5f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3594314%2Feb8f6250-03b3-4528-af8b-17a146fe27c2.png</url>
      <title>Forem: Neeraja Khanapure</title>
      <link>https://forem.com/neeraja_khanapure_4a33a5f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/neeraja_khanapure_4a33a5f"/>
    <language>en</language>
    <item>
      <title>Something I wish someone had told me five years earlier:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 03 Apr 2026 09:39:52 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-4lo7</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-4lo7</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-03)
&lt;/h1&gt;

&lt;p&gt;Something I wish someone had told me five years earlier:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-downtime deployments: what 'zero' actually requires most teams don't have&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams say they do zero-downtime deploys and mean 'we haven't gotten a complaint in a while.' Actually measuring it reveals the truth: connection drops, in-flight request failures, and cache invalidation spikes during rollouts that nobody's tracking because nobody defined what zero means.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What 'zero downtime' actually requires:

✓ Health checks reflect REAL readiness (not just 'process started')
✓ Graceful shutdown drains in-flight requests (SIGTERM handling)
✓ Connection draining at the load balancer (not just the pod)
✓ Rollback faster than the deploy (&amp;lt; 5 min, automated)
✓ SLI measurement during the rollout window (not just after)

Missing any one of these = not zero downtime. Just unmonitored downtime.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The most common failure mode is passing health checks before the app is actually ready — DB connections not pooled, caches not warm, background workers not started. The pod is 'Ready' and the app is still initializing. Users see errors. Nobody's dashboard shows it because nobody's measuring error rate during the rollout window.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Define 'zero downtime' with a measurable SLI: error rate &amp;lt; 0.1% during any 5-minute deploy window. Validate this in staging before calling it done. Measure it in production on every release.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Kubernetes deployment strategies — rolling, blue/green, canary with traffic splitting&lt;br&gt;
▸ AWS ALB / GCP Cloud Load Balancing — connection draining configuration and health check tuning&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/zero-downtime-deployments-what-zero-actually-requires-most-teams-dont-have" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/zero-downtime-deployments-what-zero-actually-requires-most-teams-dont-have&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're a manager reading this — it's worth asking your team where they are on this.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>End of week. Here's the thing I kept coming back to:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Thu, 02 Apr 2026 14:42:22 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/end-of-week-heres-the-thing-i-kept-coming-back-to-3hi9</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/end-of-week-heres-the-thing-i-kept-coming-back-to-3hi9</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-02)
&lt;/h1&gt;

&lt;p&gt;End of week. Here's the thing I kept coming back to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLOs work when they create conversations, not when they create compliance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most SLOs are set once, filed in a doc, and forgotten until an incident. The teams getting real value from error budgets use them as a weekly forcing function — a number that makes the reliability vs. velocity tradeoff visible to engineers and product managers in the same room.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLO as compliance (common):     SLO as conversation (effective):

Set SLO ──▶ Monitor              Set SLO ──▶ Weekly budget review
     │                                │          │
  Incident ──▶ Check SLO         Budget OK  Budget low
     │              │                │          │
   Blame       Finger-pointing    Ship fast  Freeze features
                                   │          │
                               Engineering + Product aligned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ An SLO that's never violated is almost always a problem. Either it's too loose (you're over-investing in reliability) or it's not being measured honestly. Both cost money in different ways. The goal is a number that occasionally creates productive tension.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Review error budgets in sprint planning alongside features. If engineering and product aren't having an uncomfortable conversation once a quarter, your SLO isn't tight enough.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Alex Hidalgo — 'Implementing Service Level Objectives' (O'Reilly, 2020)&lt;br&gt;
▸ Google SRE Workbook — Alerting on SLOs (ch. 5, free at sre.google)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/slos-work-when-they-create-conversations-not-when-they-create-compliance" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/slos-work-when-they-create-conversations-not-when-they-create-compliance&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the version of this that your org gets wrong? Drop it below.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>This pattern has saved production twice in the last year:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 31 Mar 2026 09:44:19 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/this-pattern-has-saved-production-twice-in-the-last-year-38md</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/this-pattern-has-saved-production-twice-in-the-last-year-38md</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-03-31)
&lt;/h1&gt;

&lt;p&gt;This pattern has saved production twice in the last year:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service mesh adoption: the operational debt lands before the value does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Service meshes promise mTLS, traffic splitting, and deep observability. What arrives first is a new category of production failures your team has never debugged before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Adoption curve reality:

Value
  │                              ╱ mTLS + traffic control
  │                         ╱
  │              ╱╲  complexity trough
  │         ╱╲╱
  │    ╱╲╱   ← sidecar failures, upgrade pain
  │╱
  └──────────────────────────────▶ Time
     Week 1     Month 3     Month 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Sidecar injection failures look like app bugs — hours spent debugging the wrong layer.&lt;br&gt;
▸ mTLS policy rollout in a live cluster requires namespace-by-namespace phasing — one mistake stops traffic.&lt;br&gt;
▸ Mesh upgrades require coordinated sidecar restarts across the cluster — on large deployments, that's everything.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Start mesh in observability-only mode (no policy enforcement). Prove value in one namespace first. Earn the rollout, don't mandate it.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Linkerd for latency-sensitive workloads — lower resource overhead than Istio's Envoy per sidecar.&lt;br&gt;
▸ Namespace-level feature flags for mesh policy — lets you roll back one team without affecting others.&lt;/p&gt;

&lt;p&gt;The difference between a senior engineer and a principal is knowing which guardrails to build before you need them.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/service-mesh-adoption-the-operational-debt-lands-before-the-value-does" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/service-mesh-adoption-the-operational-debt-lands-before-the-value-does&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this triggered a war story, I'd genuinely love to hear it.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #devops #sre #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something every senior engineer learns the expensive way:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 28 Mar 2026 20:48:45 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-a8h</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-a8h</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-03-28)
&lt;/h1&gt;

&lt;p&gt;Something every senior engineer learns the expensive way:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform DAGs at scale: when the graph becomes the hazard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Terraform's dependency graph is elegant at small scale. At 500+ resources across a mono-repo, it becomes the most dangerous part of your infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SAFE (small module):              DANGEROUS (at scale):

[vpc] ──▶ [subnet] ──▶ [ec2]     [shared-net] ──▶ [team-a-infra]
                                          │         [team-b-infra]
                                          │         [team-c-infra]
                                          │         [data-layer]
                                  One change → fan-out destroy/create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Implicit ordering assumptions survive until a refactor exposes them — usually as an unplanned destroy chain in prod.&lt;br&gt;
▸ Fan-out graphs make blast radius review near-impossible. 'What does this change affect?' has no fast answer.&lt;br&gt;
▸ &lt;code&gt;depends_on&lt;/code&gt; papering over bad module interfaces — it fixes the symptom and couples the modules permanently.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ If a module needs &lt;code&gt;depends_on&lt;/code&gt; to be safe, the module boundary is wrong. Redesign the interface — don't paper over it.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ &lt;code&gt;terraform graph | dot -Tsvg &amp;gt; graph.svg&lt;/code&gt; — visualize fan-out and cycles before every major refactor.&lt;br&gt;
▸ Gate all applies with OPA/Conftest + mandatory human review on any planned destroy operations.&lt;/p&gt;

&lt;p&gt;The difference between a senior engineer and a principal is knowing which guardrails to build before you need them.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  terraform #iac #devops #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>A hard-earned rule from incident retrospectives:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 28 Mar 2026 17:11:20 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1pj1</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1pj1</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-03-28)
&lt;/h1&gt;

&lt;p&gt;A hard-earned rule from incident retrospectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitOps drift: the silent accumulation that makes clusters unmanageable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GitOps promises Git as the source of truth. The reality: every manual &lt;code&gt;kubectl&lt;/code&gt; during an incident is a lie you told your cluster and forgot to retract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitOps truth gap over time:

Week 1:  Git ══════════ Cluster  (clean)
Week 4:  Git ══════╌╌╌╌ Cluster  (2 manual patches)
Week 12: Git ════╌╌╌╌╌╌╌╌╌╌╌╌╌  (drift accumulates)
                         Cluster  (unknown state)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Manual patches during incidents create cluster state Git doesn't know about — Argo/Flux will overwrite it silently.&lt;br&gt;
▸ Secrets managed outside GitOps (sealed-secrets, Vault agent) drift independently — invisible in sync status.&lt;br&gt;
▸ Multi-cluster setups multiply drift: each cluster diverges at its own pace once human intervention happens.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Treat every manual cluster change as a 5-minute loan. Commit it back to Git before the incident closes — or it's gone.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Argo CD drift detection dashboard — surface out-of-sync resources before they become incident contributors.&lt;br&gt;
▸ Weekly diff job: live cluster state vs Git. Opens a PR for anything untracked. Makes drift visible before it's painful.&lt;/p&gt;

&lt;p&gt;The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  gitops #kubernetes #devops #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>One insight that changed how I design systems:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 28 Mar 2026 16:48:55 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/one-insight-that-changed-how-i-design-systems-1b1m</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/one-insight-that-changed-how-i-design-systems-1b1m</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-03-28)
&lt;/h1&gt;

&lt;p&gt;One insight that changed how I design systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runbook quality decays silently — and that decay kills MTTR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Runbooks that haven't been run recently are wrong. Not outdated — wrong. The service changed. The tool was deprecated. The endpoint moved. Nobody updated the doc because nobody reads it until 3am. And at 3am, a wrong runbook is worse than no runbook — it sends engineers down confident paths that dead-end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runbook decay curve:

Quality
  │▓▓▓▓▓▓▓▓▓▓
  │         ▓▓▓▓▓
  │              ▓▓▓▓
  │                  ▓▓▓▓▓
  │                       ▓▓▓░░░░░░░
  │                             ░░░░░░░░ ← "last validated 8 months ago"
  └────────────────────────────────────▶
  Write   Month 1  Month 3  Month 6  Month 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The highest-leverage runbook improvement isn't better writing — it's a validation date and a quarterly review reminder. A runbook with 'last validated: 2 weeks ago' that's 70% accurate is worth more than a beautifully written one from 8 months ago that's 40% accurate.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Every runbook gets a 'last validated' date. Anything older than 3 months is assumed broken until proven otherwise. Review is part of the on-call rotation, not optional.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ PagerDuty Incident Response guide — runbook standards and validation cadence&lt;br&gt;
▸ Post-incident review template: 'Did the runbook help, mislead, or was it missing?' (standard question)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/runbook-quality-decays-silently-and-that-decay-kills-mttr" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/runbook-quality-decays-silently-and-that-decay-kills-mttr&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the version of this that your org gets wrong? Drop it below.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Insight of the Week</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 28 Mar 2026 16:06:06 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-125o</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-125o</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-03-28)
&lt;/h1&gt;

&lt;p&gt;{{opener}}&lt;/p&gt;

&lt;p&gt;Observability is a label strategy problem disguised as a tooling problem&lt;/p&gt;

&lt;p&gt;You can’t debug what you can’t slice. Most “noisy dashboards” are really missing ownership labels, consistent dimensions, and SLI intent.&lt;/p&gt;

&lt;p&gt;What I’ve seen go wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;{{pitfall1}}&lt;/li&gt;
&lt;li&gt;{{pitfall2}}&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My rule:&lt;br&gt;
Define SLIs first, then design labels that let you isolate (service, env, version, tenant) without blowing up cardinality.&lt;/p&gt;

&lt;p&gt;If you want to go deeper: &lt;a href="https://neeraja-portfolio-v1.vercel.app/projects/prometheus-scaling" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/projects/prometheus-scaling&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;{{closer}}&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>aiops</category>
      <category>mlops</category>
      <category>automation</category>
    </item>
    <item>
      <title>Insight of the Week</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 27 Mar 2026 09:39:22 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-2kem</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-2kem</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-03-27)
&lt;/h1&gt;

&lt;p&gt;{{opener}}&lt;/p&gt;

&lt;p&gt;Observability is a label strategy problem disguised as a tooling problem&lt;/p&gt;

&lt;p&gt;You can’t debug what you can’t slice. Most “noisy dashboards” are really missing ownership labels, consistent dimensions, and SLI intent.&lt;/p&gt;

&lt;p&gt;What I’ve seen go wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;{{pitfall1}}&lt;/li&gt;
&lt;li&gt;{{pitfall2}}&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My rule:&lt;br&gt;
Define SLIs first, then design labels that let you isolate (service, env, version, tenant) without blowing up cardinality.&lt;/p&gt;

&lt;p&gt;If you want to go deeper: &lt;a href="https://neeraja-portfolio-v1.vercel.app/projects/prometheus-scaling" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/projects/prometheus-scaling&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;{{closer}}&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>aiops</category>
      <category>mlops</category>
      <category>automation</category>
    </item>
    <item>
      <title>Workflow Deep Dive</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 24 Mar 2026 09:36:48 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/workflow-deep-dive-2o29</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/workflow-deep-dive-2o29</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-03-24)
&lt;/h1&gt;

&lt;p&gt;{{opener}}&lt;/p&gt;

&lt;p&gt;End‑to‑end MLOps retraining loop: reliability is in the guardrails&lt;/p&gt;

&lt;p&gt;Auto‑retraining is easy to wire. Making it safe in production is the hard part: data drift, silent label shifts, and rollback semantics.&lt;/p&gt;

&lt;p&gt;What usually bites later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A “better” offline model can degrade live KPIs due to skew (training vs serving features) and traffic shift.&lt;/li&gt;
&lt;li&gt;Unversioned data/labels make incident RCA impossible — you can’t reproduce what trained the model.&lt;/li&gt;
&lt;li&gt;Promotion without canary + rollback turns retraining into a weekly outage generator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My default rule:&lt;br&gt;
No model ships without: dataset/version lineage, shadow/canary evaluation, and a one‑click rollback path.&lt;/p&gt;

&lt;p&gt;When I’m sanity-checking this, I usually do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track dataset + features with DVC/LakeFS + model registry (MLflow/SageMaker Registry) for auditable promotion.&lt;/li&gt;
&lt;li&gt;Monitor drift + performance slices with Prometheus/Grafana + alert on &lt;em&gt;trend&lt;/em&gt;, not single spikes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deep dive (stable link): &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/resilient-architecture" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/resilient-architecture&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;{{closer}}&lt;/p&gt;

&lt;h1&gt;
  
  
  mlops #aiops #automation #python
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Insight of the Week</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 20 Mar 2026 09:25:33 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-28ac</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/insight-of-the-week-28ac</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-03-20)
&lt;/h1&gt;

&lt;p&gt;{{opener}}&lt;/p&gt;

&lt;p&gt;CI/CD isn’t speed — it’s predictable change under load&lt;/p&gt;

&lt;p&gt;Most pipelines fail not because tests are slow, but because &lt;strong&gt;rollout risk isn’t modeled&lt;/strong&gt; (blast radius, rollback, and observability gates).&lt;/p&gt;

&lt;p&gt;What I’ve seen go wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;{{pitfall1}}&lt;/li&gt;
&lt;li&gt;{{pitfall2}}&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My rule:&lt;br&gt;
If you can’t explain rollback + SLO gates in one slide, the pipeline is not production‑ready.&lt;/p&gt;

&lt;p&gt;If you want to go deeper: &lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/cicd-isnt-speed-its-predictable-change-under-load" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/cicd-isnt-speed-its-predictable-change-under-load&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;{{closer}}&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>aiops</category>
      <category>mlops</category>
      <category>automation</category>
    </item>
    <item>
      <title>Claude on AWS Bedrock was throttling requests and the billing dashboard showed zero issues</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Mon, 16 Mar 2026 18:36:21 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/claude-on-aws-bedrock-was-throttling-requests-and-the-billing-dashboard-showed-zero-issues-290g</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/claude-on-aws-bedrock-was-throttling-requests-and-the-billing-dashboard-showed-zero-issues-290g</guid>
      <description>&lt;p&gt;Most teams running Claude on Bedrock watch latency and cost.&lt;br&gt;
Neither shows you when you are about to get throttled.&lt;br&gt;
Claude Sonnet output tokens cost 5x more compute to generate than input tokens to process. AWS counts them at 5x against your TPM quota. Your bill charges for real tokens. Your quota gate reflects real compute.&lt;br&gt;
Your bill shows 100 tokens.&lt;br&gt;
Bedrock counted 500 against your limit.&lt;br&gt;
Throttling hits. Dashboard looks clean.&lt;br&gt;
What AWS just shipped&lt;br&gt;
AWS just released two CloudWatch metrics that fix this blind spot. Both are free, automatic, and already in your AWS/Bedrock CloudWatch namespace. No code changes. No opt-in.&lt;br&gt;
EstimatedTPMQuotaUsage&lt;br&gt;
Real quota consumed per request, burndown multipliers included. Not what you were billed. What Bedrock actually counted against your limit.&lt;br&gt;
TimeToFirstToken&lt;br&gt;
Server side metric. Measures time from request to first Claude response token. Tells you if slowness lives in Bedrock or your own stack. Stops the guessing. Narrows the debug in seconds.&lt;br&gt;
3 alarms worth setting today&lt;br&gt;
80% of TPM limit on EstimatedTPMQuotaUsage&lt;br&gt;
Warning before throttle hits. You get runway instead of a surprise.&lt;br&gt;
P95 threshold on TimeToFirstToken&lt;br&gt;
Catch Claude response degradation before users feel it.&lt;br&gt;
Compare TimeToFirstToken vs InvocationLatency&lt;br&gt;
If TTFT is fine but total latency is high, the problem is output generation not model startup. Narrows the debug surface immediately.&lt;br&gt;
Were you tracking billing tokens thinking that was your quota? Most teams are.&lt;br&gt;
Source: &lt;a href="https://aws.amazon.com/blogs/machine-learning/improve-operational-visibility-for-inference-workloads-on-amazon-bedrock-with-new-cloudwatch-metrics-for-ttft-and-estimated-quota-consumption/%5B%5D(url)" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/machine-learning/improve-operational-visibility-for-inference-workloads-on-amazon-bedrock-with-new-cloudwatch-metrics-for-ttft-and-estimated-quota-consumption/[](url)&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Kubernetes rollouts: promote on SLOs, not on "pods are Ready"</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 14 Mar 2026 15:57:32 +0000</pubDate>
      <link>https://forem.com/neeraja_khanapure_4a33a5f/kubernetes-rollouts-promote-on-slos-not-on-pods-are-ready-27f0</link>
      <guid>https://forem.com/neeraja_khanapure_4a33a5f/kubernetes-rollouts-promote-on-slos-not-on-pods-are-ready-27f0</guid>
      <description>&lt;p&gt;Readiness is a local signal. Production impact is global.&lt;br&gt;
Pods can be Ready while your SLO window is already burning.&lt;br&gt;
The failure chain&lt;br&gt;
Rollout shifts traffic fast.&lt;br&gt;
New pods saturate before HPA reacts.&lt;br&gt;
HPA scrape window is 15 to 30 seconds minimum.&lt;br&gt;
P95 latency climbs.&lt;br&gt;
Error rate ticks up.&lt;br&gt;
SLI degrades.&lt;br&gt;
Everything looks healthy. The error budget is draining quietly.&lt;br&gt;
Why "pods are Ready" lies to you&lt;br&gt;
Ready means the container started and passed a health check.&lt;br&gt;
It says nothing about P95 latency, error rate, or whether your SLO slice is holding.&lt;br&gt;
Canary gets stuck green because metrics are too coarse.&lt;br&gt;
No labels, no slices, blast radius stays invisible.&lt;br&gt;
Three resolvers&lt;br&gt;
Pre-scale before the first canary step&lt;br&gt;
Bump replicas before traffic shifts.&lt;br&gt;
HPA catches up from a safe baseline instead of a saturated one.&lt;br&gt;
Match step interval to your HPA scaleUp window&lt;br&gt;
Default stabilization window is 3 minutes.&lt;br&gt;
Check yours with:&lt;br&gt;
bashkubectl get hpa -o yaml&lt;br&gt;
Promoting before that window closes is promoting blind.&lt;br&gt;
Gate steps on SLI health&lt;br&gt;
Wire an AnalysisRun in Argo Rollouts that checks error rate and P95 latency are within SLO bounds before promoting.&lt;br&gt;
If the SLI is still recovering, promotion waits.&lt;br&gt;
The rule&lt;br&gt;
Promote only when the canary holds the SLO slice that matters for a fixed window.&lt;br&gt;
Anything outside that window triggers auto-rollback.&lt;br&gt;
Rollout speed and autoscaler reaction time are tuned independently.&lt;br&gt;
That gap is where error budget burns before anyone pages.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://www.linkedin.com/posts/neerajakhanapure_kubernetes-sre-devops-ugcPost-7438609724431998976-SUaY?utm_source=share&amp;amp;amp%3Butm_medium=member_desktop&amp;amp;amp%3Brcm=ACoAABMKgGoB6y5JLsCvHdjL6I5oxdA-230Stpg" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.licdn.com%2Fdms%2Fimage%2Fv2%2FD5612AQFbRKl4bpocww%2Farticle-cover_image-shrink_720_1280%2FB56Zzs.3doGgAI-%2F0%2F1773502429603%3Fe%3D2147483647%26v%3Dbeta%26t%3D89YU0lG-UVi1F7AbfcwdPbGX94cg1wpg44HcHtqxXIQ" height="auto" class="m-0"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://www.linkedin.com/posts/neerajakhanapure_kubernetes-sre-devops-ugcPost-7438609724431998976-SUaY?utm_source=share&amp;amp;amp%3Butm_medium=member_desktop&amp;amp;amp%3Brcm=ACoAABMKgGoB6y5JLsCvHdjL6I5oxdA-230Stpg" rel="noopener noreferrer" class="c-link"&gt;
            Kubernetes rollouts: promote on SLOs, not on "pods are Ready" | Neeraja Khanapure
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Pods are Ready. P95 is climbing. Error rate is ticking. HPA has not moved.
The rollout looks healthy. The SLO window is already burning.
The exact failure chain and three resolvers that actually work. Pre-scaling, matching step interval to your HPA stabilization window, and gating promotion on SLI health instead of pod status.
#kubernetes #sre #devops #platformengineering
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstatic.licdn.com%2Faero-v1%2Fsc%2Fh%2Fal2o9zrvru7aqj8e1x2rzsrca"&gt;
          linkedin.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;




&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-promote-on-slos-not-on-pods-are-ready" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-promote-on-slos-not-on-pods-are-ready&lt;/a&gt;&lt;br&gt;
What is the step interval on your rollouts right now?&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>sre</category>
      <category>cloudnative</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
