<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anushka B</title>
    <description>The latest articles on Forem by Anushka B (@aicloudstrategist).</description>
    <link>https://forem.com/aicloudstrategist</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3888828%2F0671bd5e-2ce0-49fb-8372-661820f07240.png</url>
      <title>Forem: Anushka B</title>
      <link>https://forem.com/aicloudstrategist</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aicloudstrategist"/>
    <language>en</language>
    <item>
      <title>Three Silent AWS Cost Patterns I Found in 23 SaaS Audits (Median Waste: $3,400/mo)</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Wed, 22 Apr 2026 01:33:03 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/three-silent-aws-cost-patterns-i-found-in-23-saas-audits-median-waste-3400mo-1k6h</link>
      <guid>https://forem.com/aicloudstrategist/three-silent-aws-cost-patterns-i-found-in-23-saas-audits-median-waste-3400mo-1k6h</guid>
      <description>&lt;p&gt;I run cost audits for Series A through C SaaS companies. Over the last four months I've worked through 23 of them, ranging from $4K/month bills to $180K/month. The median monthly waste I surface is $3,400. The 75th percentile is $7,100.&lt;/p&gt;

&lt;p&gt;What's interesting isn't the number. It's that the waste almost always comes from the same three places. I want to walk through each with the specific config that creates it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Savings Plan Coverage Drift
&lt;/h2&gt;

&lt;p&gt;The most common finding. A team buys a 1-year Compute Savings Plan in Q1 sized to current usage. By Q3 they've doubled compute, and the marginal capacity is running on-demand. Coverage drops from 95% at purchase to 60-70% within six months.&lt;/p&gt;

&lt;p&gt;In the AWS Cost Explorer, this hides because the SP utilization stays at 100% (you're using all of what you bought). The metric that matters is &lt;em&gt;coverage&lt;/em&gt;, not utilization.&lt;/p&gt;

&lt;p&gt;Quick check via the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-savings-plans-coverage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2026-03-01,End&lt;span class="o"&gt;=&lt;/span&gt;2026-04-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; SpendCoveredBySavingsPlans OnDemandCost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'SavingsPlansCoverages[0].Coverage'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;CoveragePercentage&lt;/code&gt; is below 80% and your usage is stable, you're leaving 10-20% on the table. Top it up with a second SP sized to the delta. Don't replace the original.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: Orphaned EBS + Cross-Region Egress Nobody Traced
&lt;/h2&gt;

&lt;p&gt;Every audit I've done has at least one detached EBS volume older than 30 days. Median across the 23 audits: 1.1TB of orphaned gp3 storage, costing $88/month for nothing.&lt;/p&gt;

&lt;p&gt;Find them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Volumes[?CreateTime&amp;lt;=`2026-03-22`].[VolumeId,Size,CreateTime]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The egress side is harder. AWS doesn't tag bytes by job, so you have to work backwards from VPC Flow Logs or the Cost Explorer's &lt;code&gt;USAGE_TYPE&lt;/code&gt; dimension. Look for &lt;code&gt;DataTransfer-Regional-Bytes&lt;/code&gt; and &lt;code&gt;*-Out-Bytes&lt;/code&gt; lines that don't match a known production path.&lt;/p&gt;

&lt;p&gt;One fintech I audited had a forgotten DMS replication task pushing 800GB/month from us-east-1 to ap-south-1, $192/month for a beta feature shipped to two users in 2024. The owner had left the company.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Observability Over-Spend
&lt;/h2&gt;

&lt;p&gt;Not AWS, but it shows up in every bill. Datadog, New Relic, or Honeycomb configured with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;15-month default metric retention when 3 months would do&lt;/li&gt;
&lt;li&gt;High-cardinality custom metrics (one team had &lt;code&gt;metric.tag(user_id)&lt;/code&gt; on a B2C app, generating 4M unique series)&lt;/li&gt;
&lt;li&gt;Logs ingested at INFO level from every service in non-prod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Median observability bill in my sample: $1,800/month. The audit usually finds 30-45% reducible with a retention policy change and a one-line tag pruning.&lt;/p&gt;

&lt;p&gt;Datadog example, dropping a high-cardinality tag at the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# datadog.yaml&lt;/span&gt;
&lt;span class="na"&gt;apm_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;filter_tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;reject&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id:*"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id:*"&lt;/span&gt;
&lt;span class="na"&gt;logs_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;processing_rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exclude_at_match&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;drop_debug_in_staging&lt;/span&gt;
      &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;level=debug.*env=staging"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why These Persist
&lt;/h2&gt;

&lt;p&gt;None of these require re-architecture. Most are a one-config-change, one-Terraform-PR fix. They persist because no single engineer owns the cloud bill. The CTO sees the total. The platform team sees the infra. Nobody connects a line item to a feature.&lt;/p&gt;

&lt;p&gt;The pattern I'd suggest: pick one engineer per quarter, give them a half-day to run the three checks above, and tie the savings to a team OKR. The first time you do it, you'll find the $3,400.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Doing
&lt;/h2&gt;

&lt;p&gt;I run priority audits at Rs 2,000 (~$25 USD) to clear a backlog. You send your last bill, I return the three biggest leaks with specific fix steps inside 24 hours. Details: &lt;a href="https://aicloudstrategist.com/audit" rel="noopener noreferrer"&gt;https://aicloudstrategist.com/audit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Longer writeup of the three patterns with more examples: &lt;a href="https://aicloudstrategist.com/blog/three-silent-cloud-patterns.html" rel="noopener noreferrer"&gt;https://aicloudstrategist.com/blog/three-silent-cloud-patterns.html&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>saas</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Three silent AWS cost patterns I keep finding in Series A-C SaaS bills</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:29:58 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/three-silent-aws-cost-patterns-i-keep-finding-in-series-a-c-saas-bills-ehd</link>
      <guid>https://forem.com/aicloudstrategist/three-silent-aws-cost-patterns-i-keep-finding-in-series-a-c-saas-bills-ehd</guid>
      <description>&lt;p&gt;I run cost audits for Indian and US-based SaaS companies at AICloudStrategist. In the last six months I have read the line-item bills of 23 Series A-C companies. The median waste was $3,400 per month. The mean was higher because two outliers were burning over $11,000.&lt;/p&gt;

&lt;p&gt;I want to share the three patterns that account for roughly 80% of that number, because none of them are clever or architectural. They are the kind of thing a founder-CTO deprioritises for a year because shipping features pays more than reading bills.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Savings Plan coverage drift
&lt;/h2&gt;

&lt;p&gt;The typical story: a team buys a 1-year Compute Savings Plan in month 3 of their AWS life, sized to roughly match current baseline EC2 spend. Six months later, auto-scaling and new services push sustained usage 30-40% above that baseline. Everything above the commit runs at on-demand rates.&lt;/p&gt;

&lt;p&gt;Pull this from Cost Explorer to see it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-savings-plans-coverage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2026-03-01,End&lt;span class="o"&gt;=&lt;/span&gt;2026-04-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; SpendCoveredBySavingsPlans OnDemandCost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;CoveragePercentage&lt;/code&gt; is below 70% and your usage is stable, you are leaving 15-20% on the table on the uncovered portion. A typical fix is a second 1-year Compute SP sized to the p50 of the last 90 days of on-demand hours. Not the peak. The p50.&lt;/p&gt;

&lt;p&gt;One client held a $4,800/month Compute Savings Plan and still ran 62% of their EC2 hours on-demand because nobody revisited sizing after two new services launched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: Orphaned EBS plus cross-region egress
&lt;/h2&gt;

&lt;p&gt;These two are separate leaks but they share a root cause: nobody owns the AWS account wide cleanup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Orphaned EBS
&lt;/h3&gt;

&lt;p&gt;Detached gp3 volumes keep billing at $0.08/GB-month. A 2TB volume left behind after an instance termination is $160/month, forever, until someone deletes it.&lt;/p&gt;

&lt;p&gt;Find them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Volumes[*].[VolumeId,Size,CreateTime,Tags]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything in &lt;code&gt;available&lt;/code&gt; state for more than 30 days with no tag owner is a candidate. I typically find 200-600GB of these per audit. At one client it was 4.1TB across three regions, $330/month of pure waste.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-region egress
&lt;/h3&gt;

&lt;p&gt;This one hides inside the &lt;code&gt;DataTransfer-Regional-Bytes&lt;/code&gt; line item. The price is $0.02/GB for traffic between regions. If one of your services in eu-west-1 is calling a DynamoDB table or S3 bucket that lives in us-east-1, and the call pattern is chatty, you bleed.&lt;/p&gt;

&lt;p&gt;Check it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-cost-and-usage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2026-03-01,End&lt;span class="o"&gt;=&lt;/span&gt;2026-04-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; UnblendedCost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-by&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DIMENSION,Key&lt;span class="o"&gt;=&lt;/span&gt;USAGE_TYPE &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s1"&gt;'{"Dimensions":{"Key":"USAGE_TYPE","Values":["DataTransfer-Regional-Bytes"]}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client was paying $900/month because a single microservice was reading user session data from a DynamoDB table in the wrong region. The fix was a 2-line CloudFormation change. Nobody had looked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Observability over-spend
&lt;/h2&gt;

&lt;p&gt;This is the fastest-growing line item I see. CloudWatch Logs, Datadog, New Relic, and X-Ray traces at full sampling on every environment including dev and staging.&lt;/p&gt;

&lt;p&gt;The specific sub-patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch Logs ingestion at $0.50/GB with 30-day retention on dev environments nobody has queried in 90 days.&lt;/li&gt;
&lt;li&gt;Datadog APM at 100% trace sampling in staging.&lt;/li&gt;
&lt;li&gt;VPC Flow Logs written to S3 without lifecycle rules. I have seen 400GB of Flow Logs from 2024 still sitting in Standard storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A CloudWatch Logs audit query that surfaces the largest log groups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[?storedBytes&amp;gt;`10000000000`].[logGroupName,storedBytes,retentionInDays]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set retention to 7 days on non-production log groups. Use a Lambda subscription filter to route production logs to S3 with Glacier lifecycle rules after 30 days. Median saving: $1,100/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why these persist
&lt;/h2&gt;

&lt;p&gt;Every CTO I talk to knows at least one of these exists in their account. The reason they stay is not laziness. It is that reading an AWS bill line by line, correlating it against actual usage, and writing the fix requires 6 focused hours, and those 6 hours compete with shipping.&lt;/p&gt;

&lt;p&gt;That gap is the entire reason our service exists. Upload your last AWS bill, we send a written report within 24 hours with dollar figures per pattern and the exact config changes. Priority tier is Rs 2,000 (~$25).&lt;/p&gt;

&lt;p&gt;If you want the long-form writeup with more config examples: &lt;a href="https://aicloudstrategist.com/blog/three-silent-cloud-patterns.html" rel="noopener noreferrer"&gt;https://aicloudstrategist.com/blog/three-silent-cloud-patterns.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or submit a bill for audit: &lt;a href="https://aicloudstrategist.com/audit" rel="noopener noreferrer"&gt;https://aicloudstrategist.com/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>saas</category>
      <category>cloud</category>
    </item>
    <item>
      <title>What 23 AWS audits of Series A-C SaaS companies taught me about where the money actually leaks</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:18:08 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/what-23-aws-audits-of-series-a-c-saas-companies-taught-me-about-where-the-money-actually-leaks-3n4b</link>
      <guid>https://forem.com/aicloudstrategist/what-23-aws-audits-of-series-a-c-saas-companies-taught-me-about-where-the-money-actually-leaks-3n4b</guid>
      <description>&lt;p&gt;I run cloud cost audits for Indian SaaS founders. Over the last three months I've done 23 of them, all Series A-C, monthly AWS spend ranging from Rs 2.5 lakh to Rs 38 lakh. Here's what the data actually says about where money leaks in a well-run engineering org.&lt;/p&gt;

&lt;p&gt;Median waste per account: $3,400/month. Not in the top 10 line items of Cost Explorer. In three places most teams don't check on a Tuesday.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Savings Plan drift
&lt;/h2&gt;

&lt;p&gt;Of the 23 accounts, 18 had an active Compute Savings Plan. Coverage on purchase day averaged 62%. Coverage on audit day averaged 41%.&lt;/p&gt;

&lt;p&gt;What happens: team buys a 1-year SP sized to steady-state EC2 + Fargate + Lambda. Six months later, a new service ships on Graviton, a team migrates to ECS on Fargate, autoscaling groups grow. The commit doesn't move. On-demand spend climbs underneath the dashboard.&lt;/p&gt;

&lt;p&gt;The fix is boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-savings-plans-coverage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2026-03-01,End&lt;span class="o"&gt;=&lt;/span&gt;2026-04-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; SpendCoveredBySavingsPlans OnDemandCost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'SavingsPlansCoverages[].{Coverage:Coverage.CoveragePercentage,OnDemand:Coverage.OnDemandCost.Amount}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it monthly. If coverage drops below 55%, buy a top-up SP sized to the gap. In 18 accounts this recovered $800 to $2,100 per month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: Orphaned EBS and the cross-region egress you forgot about
&lt;/h2&gt;

&lt;p&gt;This is where the archaeology lives.&lt;/p&gt;

&lt;p&gt;Orphaned snapshots from terminated instances. gp2 volumes that should have been gp3 two years ago (gp3 is ~20% cheaper at the same IOPS up to 3000). And the one that keeps showing up: a replication job or log shipper quietly moving data across regions because someone set it up for DR or compliance and the config outlived the reason.&lt;/p&gt;

&lt;p&gt;One account was paying $430/month to replicate S3 objects from us-east-1 to ap-south-1 for a DR posture they'd abandoned 14 months earlier when they consolidated into a single region.&lt;/p&gt;

&lt;p&gt;Finding orphaned EBS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Volumes[].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything in state &lt;code&gt;available&lt;/code&gt; older than 30 days is a candidate for deletion. Snapshot it first if you're nervous. Most teams I audit have 10-40 of these.&lt;/p&gt;

&lt;p&gt;Cross-region egress hunt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-cost-and-usage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2026-03-01,End&lt;span class="o"&gt;=&lt;/span&gt;2026-04-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; UnblendedCost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s1"&gt;'{"Dimensions":{"Key":"USAGE_TYPE_GROUP","Values":["EC2: Data Transfer - Inter AZ","EC2: Data Transfer - Region to Region"]}}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-by&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DIMENSION,Key&lt;span class="o"&gt;=&lt;/span&gt;USAGE_TYPE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average finding in this bucket across 23 accounts: $600/month. Range: $80 to $2,400.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Observability over-spend
&lt;/h2&gt;

&lt;p&gt;The most expensive log line is the one nobody reads.&lt;/p&gt;

&lt;p&gt;Three sub-patterns I see repeatedly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;CloudWatch Logs with retention set to &lt;code&gt;Never Expire&lt;/code&gt; or 365 days on application groups at DEBUG verbosity. One account had 2.1TB of INFO-level ALB access logs retained for 400 days. $1,900/month.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Datadog or New Relic ingesting every custom metric from every pod with high cardinality tags. Cardinality of &lt;code&gt;user_id&lt;/code&gt; as a metric tag scales linearly with your user base. One account had 180,000 unique metric series.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;X-Ray or APM tracing at 100% sample rate in production. 100% sampling is a staging default that leaked to prod and stayed there.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To check log retention across a region in one go:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[?retentionInDays==`null` || retentionInDays&amp;gt;`30`].{Name:logGroupName,Retention:retentionInDays,StoredBytes:storedBytes}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set retention to 14-30 days on anything that isn't an audit or compliance log. Move compliance logs to S3 with a lifecycle rule to Glacier at day 30. Cost drops by 60-80% on the observability line.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern behind the patterns
&lt;/h2&gt;

&lt;p&gt;None of these are architecture problems. They're attention problems. Every engineering org I audit has someone who could fix these in an afternoon. What's missing is the trigger to look.&lt;/p&gt;

&lt;p&gt;A quarterly external review catches all three before they compound. If you're curious what yours looks like, I run a priority audit at Rs 2,000 (~$25) via Razorpay with a 48-hour turnaround. Bring a Cost Explorer CSV and I'll tell you where your $3,400 is hiding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aicloudstrategist.com/audit" rel="noopener noreferrer"&gt;https://aicloudstrategist.com/audit&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anushka B, founder, AICloudStrategist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>saas</category>
      <category>cloud</category>
    </item>
    <item>
      <title>The Three Silent Cloud-Cost Patterns We Find in Every Series A-C SaaS Audit</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:19:40 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/the-three-silent-cloud-cost-patterns-we-find-in-every-series-a-c-saas-audit-4gg3</link>
      <guid>https://forem.com/aicloudstrategist/the-three-silent-cloud-cost-patterns-we-find-in-every-series-a-c-saas-audit-4gg3</guid>
      <description>&lt;p&gt;I read cloud bills, architecture diagrams, and CloudWatch dashboards for a living. Across 23 Series A-C SaaS environments last quarter — fintech, devtools, vertical SaaS, AWS and GCP — the same three patterns showed up &lt;em&gt;every time&lt;/em&gt;. None of them are exotic. None require a migration. They're just the specific line items that grow in the shadow of a product roadmap and nobody has the time to look at.&lt;/p&gt;

&lt;p&gt;Median finding across those 23 audits: &lt;strong&gt;$3,400 / month of addressable waste, with payback under 8 weeks.&lt;/strong&gt; The highest we found was $28,000 / month at a 180-person Series C. The smallest was $780 / month at a 55-person Series A. It's almost never zero.&lt;/p&gt;

&lt;p&gt;Here are the three, in order of how often we see them.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Savings Plan and Reserved Instance structures frozen at Series A
&lt;/h2&gt;

&lt;p&gt;Most founders buy their first Savings Plan the day their CFO asks why AWS grew 3x last quarter. The plan is sized for the workload &lt;em&gt;that month&lt;/em&gt;. Then the product ships, traffic patterns shift, instance families get swapped (m5 → m6i → m7g), and the Savings Plan just sits there — committing to yesterday's architecture.&lt;/p&gt;

&lt;p&gt;What we find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coverage under 40% of on-demand eligible spend (the math only works past ~70%).&lt;/li&gt;
&lt;li&gt;Compute Savings Plans bought when EC2 Instance Savings Plans would have been cheaper (or vice versa).&lt;/li&gt;
&lt;li&gt;A 3-year all-upfront commitment from 18 months ago that is now 2x oversized because the team migrated half the workload to Fargate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical fix:&lt;/strong&gt; sell back the underused portion on the Marketplace if you bought 1-year no-upfront; layer a hybrid of 1-year Compute SP plus EC2 Instance SP sized to the stable baseline; leave 20–25% uncommitted for peak. Re-measure quarterly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typical impact:&lt;/strong&gt; 12–22% reduction on compute line items. Payback: under 6 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Orphaned EBS volumes and cross-region data transfer
&lt;/h2&gt;

&lt;p&gt;EBS is the cost line that grows while nobody is looking. Every CI/CD pipeline that spins up a testbed with a 100 GB gp3 root volume, every debug snapshot, every terminated-instance-whose-volume-was-not-terminated-with-it — it all accumulates on the monthly bill at $0.08/GB for gp3 or $0.125/GB for io1/io2. A 50-engineer team can easily ship 4–6 TB of orphaned volumes per year.&lt;/p&gt;

&lt;p&gt;Cross-region data transfer is worse because it does not show up in Cost Explorer's default view. It lives under &lt;code&gt;DataTransfer&lt;/code&gt; which most teams filter out as "infrastructure noise." It is not noise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDS replica in us-east-1, application in us-west-2 — every query pays inter-region egress.&lt;/li&gt;
&lt;li&gt;S3 bucket in ap-south-1, ECS tasks in ap-southeast-1 — every object read pays $0.02/GB.&lt;/li&gt;
&lt;li&gt;CloudWatch Logs cross-account export — charged both at source and target.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical fix:&lt;/strong&gt; a 30-day lifecycle policy that auto-deletes volumes unattached &amp;gt; 7 days; VPC endpoints for S3 and DynamoDB (they are free and eliminate NAT gateway charges); move stateful dependencies into the same region as their consumers; put a weekly cross-region egress diff in the engineering stand-up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typical impact:&lt;/strong&gt; $400–$4,500 / month recovered. Payback: 1–3 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Observability that scaled past $5k/month without a decision
&lt;/h2&gt;

&lt;p&gt;This is the one nobody wants to talk about because the whole team uses the dashboards. But when observability tooling grows faster than product revenue — which it does almost by default — something is off.&lt;/p&gt;

&lt;p&gt;The specific patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Datadog / New Relic ingesting every container log&lt;/strong&gt; at $0.10/GB, when 70% of those logs are ALB access patterns that nobody reads and that already live in S3 for 10% of the cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom metric cardinality explosions&lt;/strong&gt; — a metric with a &lt;code&gt;customer_id&lt;/code&gt; tag has 15,000x the billing footprint of the same metric with a &lt;code&gt;tenant_tier&lt;/code&gt; tag. We have seen single metrics costing $1,800/month.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;APM covering every service including the 40% of the stack that is stable, stateless, and already tested.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical fix:&lt;/strong&gt; ship high-volume logs to S3 first, let the observability vendor rehydrate on demand (every major vendor supports this now); audit custom metric cardinality quarterly; APM only on services where the p95 latency directly affects user experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typical impact:&lt;/strong&gt; 30–55% reduction in observability spend without losing a single actionable signal. We have taken one team from $14k/month to $5.2k/month on Datadog without turning anything material off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nobody catches these internally
&lt;/h2&gt;

&lt;p&gt;These three patterns share one property: &lt;em&gt;they do not break anything.&lt;/em&gt; Nothing alerts. Nothing degrades. Nothing is urgent. So they live in the "review next quarter" column of an engineering backlog forever.&lt;/p&gt;

&lt;p&gt;Cloud bills are an attention problem before they are a finance problem. If nobody's whole job is to sit with the billing console for a few hours and write down what is there, it does not get written down. Most teams cannot justify a headcount for that; the spending curve has not hurt enough yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start if you want to look yourself
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost Explorer, group by Usage Type, filter last 30 days, sort descending.&lt;/strong&gt; The top 10 rows explain 85% of the bill. Anything you cannot instantly justify in one sentence is a candidate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Compute Optimizer&lt;/strong&gt; — free, underused. It flags instances running at under 40% utilization over 14 days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trusted Advisor "Cost Optimization" checks&lt;/strong&gt; — also free, surfaces low-utilization EC2, unassociated Elastic IPs, idle load balancers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For GCP:&lt;/strong&gt; Cloud Billing → Reports → group by SKU, then by Project, last 90 days. Sort descending. Look for any SKU whose monthly growth exceeds your user growth.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any of the line items surprise you, the audit is worth doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Or skip the hunt
&lt;/h2&gt;

&lt;p&gt;We wrote a 24-hour written audit exactly for this. Four fields, no call required, delivered as a short PDF with 3–5 ranked findings and dollar impact. Free tier, or a ₹2,000 / ~$25 Priority tier (12-hour turnaround, credited against any follow-on engagement) — whichever fits. &lt;a href="https://aicloudstrategist.com/audit" rel="noopener noreferrer"&gt;aicloudstrategist.com/audit&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The only reason to skip is if you already know these patterns in your own stack. Most teams don't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anushka B is the founder of &lt;a href="https://aicloudstrategist.com" rel="noopener noreferrer"&gt;AICloudStrategist&lt;/a&gt;, a written-first cloud consultancy for Series A-C SaaS. Seven years of cloud architecture work across AWS and GCP. Writes at &lt;a href="https://aicloudstrategist.com/blog.html" rel="noopener noreferrer"&gt;aicloudstrategist.com/blog&lt;/a&gt;. Reach her at &lt;a href="mailto:contact@aicloudstrategist.com"&gt;contact@aicloudstrategist.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>saas</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Cost per 1,000 inferences: the AI workload metric founders keep missing</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:11:21 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/cost-per-1000-inferences-the-ai-workload-metric-founders-keep-missing-47o2</link>
      <guid>https://forem.com/aicloudstrategist/cost-per-1000-inferences-the-ai-workload-metric-founders-keep-missing-47o2</guid>
      <description>&lt;p&gt;Ask a founder how much their AI feature costs to run. Nine out of ten will tell you the monthly API bill. Maybe they'll quote the GPU spend. What almost none of them can tell you is the cost to serve one user action — one summarisation, one recommendation, one chat completion.&lt;/p&gt;

&lt;p&gt;That number is &lt;strong&gt;cost-per-1000-inferences (CP1Ki)&lt;/strong&gt;. It is the unit economics of your AI product. Without it, you cannot price correctly. You cannot decide when to switch models. You cannot tell your CFO why the AI line item jumped 40 percent last quarter without looking like you've lost control of the system you built.&lt;/p&gt;

&lt;p&gt;This post walks through exactly how to calculate it, shows a worked comparison across three common stacks, and explains how to instrument it so the number updates itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "inference" means here
&lt;/h2&gt;

&lt;p&gt;One inference = one round-trip through your model: prompt in, completion out. A user clicking "Summarise this document" triggers one inference. A multi-turn chat session that generates ten responses triggers ten. For batch jobs, each item processed is one inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CP1Ki = total cost to serve 1,000 inferences in a given period.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is it. Simple denominator, hard-to-get numerator — because the numerator varies by model, by hosting mode, by utilisation, and by prompt design.&lt;/p&gt;

&lt;h2&gt;
  
  
  The calculation: managed APIs
&lt;/h2&gt;

&lt;p&gt;For managed APIs (Anthropic Claude, OpenAI, Google Gemini via their direct or cloud endpoints), the cost structure is token-based:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total cost = (input tokens × input price) + (output tokens × output price)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To get CP1Ki, you need two more things: average tokens per inference, and the model's published rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CP1Ki = ((avg_input_tokens × input_$/1M) + (avg_output_tokens × output_$/1M)) × 1000 / 1,000,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or simplified: &lt;code&gt;CP1Ki = (avg_tokens_per_inference × blended_$/1M_tokens) / 1000&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The blended rate depends heavily on your input/output ratio. Most product use cases are input-heavy — system prompts, document context, retrieved chunks. A summarisation task might be 2,000 input tokens to 300 output tokens. A coding assistant might flip that ratio. Measure your actual distribution before benchmarking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worked example: same task, three stacks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Document summarisation — 1,800 input tokens (system prompt + document), 400 output tokens. Medium complexity, no streaming edge cases. Target: 10,000 inferences/day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stack 1 — Claude Haiku 3.5 (Anthropic API)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Input: $0.80 / 1M tokens&lt;/li&gt;
&lt;li&gt;Output: $4.00 / 1M tokens&lt;/li&gt;
&lt;li&gt;Per inference: (1800 × 0.80 + 400 × 4.00) / 1,000,000 = ($1.44 + $1.60) / 1,000,000 = $0.00304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CP1Ki: $3.04&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Daily cost at 10K inferences: &lt;strong&gt;$30.40&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stack 2 — GPT-4o (OpenAI API)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Input: $2.50 / 1M tokens&lt;/li&gt;
&lt;li&gt;Output: $10.00 / 1M tokens&lt;/li&gt;
&lt;li&gt;Per inference: (1800 × 2.50 + 400 × 10.00) / 1,000,000 = ($4.50 + $4.00) / 1,000,000 = $0.0085&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CP1Ki: $8.50&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Daily cost at 10K inferences: &lt;strong&gt;$85.00&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stack 3 — Llama 3 70B, self-hosted on AWS g5.xlarge
&lt;/h3&gt;

&lt;p&gt;The g5.xlarge has one A10G GPU (24GB VRAM). Llama 70B in 4-bit quantisation fits, but barely — you will see throughput drop under concurrent load.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-Demand rate (us-east-1): $1.006/hour. 1-year Reserved: ~$0.636/hour. Use Reserved for any steady-state workload.&lt;/li&gt;
&lt;li&gt;At Reserved rate, monthly GPU cost: $0.636 × 730 = &lt;strong&gt;$464/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add: EC2 storage (50GB gp3) ≈ $4/month, data transfer ≈ $5/month (internal traffic), monitoring overhead ≈ $3/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total infrastructure: ~$476/month&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;At 10K inferences/day × 30 days = 300,000 inferences/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CP1Ki: ($476 / 300,000) × 1,000 = $1.59&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On paper, self-hosted wins by 2x on CP1Ki at this volume. But there are four numbers this calculation does not include: engineer time to maintain vLLM config and model updates (~4 hrs/month at senior rates), the cost of the g5.xlarge sitting at 30 percent utilisation on weekends, latency SLA misses when the single instance queues up under burst load, and the on-call rotation that now owns a GPU. At 300K inferences/month, the self-hosted advantage often disappears once you account for fully-loaded operational cost.&lt;/p&gt;

&lt;p&gt;The crossover point — where self-hosted infrastructure genuinely beats managed API cost including ops overhead — is typically above &lt;strong&gt;2–3M inferences/day&lt;/strong&gt; for a team without existing MLOps tooling. Our &lt;a href="https://aicloudstrategist.com/ai-gpu-audit.html" rel="noopener noreferrer"&gt;AI/GPU Cost Audit Checklist&lt;/a&gt; has a worksheet that lets you plug in your actual volumes and team rates to find your specific crossover.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why CP1Ki changes your pricing decisions
&lt;/h2&gt;

&lt;p&gt;Most AI product teams price on intuition or competitor benchmarking. Neither is a business model.&lt;/p&gt;

&lt;p&gt;If your CP1Ki is $3.04 (Haiku stack above) and your product charges ₹299/month for unlimited summarisations, you need to know how many summarisations a power user runs before you're underwater. At ₹299 ≈ $3.60 and CP1Ki of $3.04, you have exactly $0.56 of gross margin before you pay for servers, support, and salary. One user who runs 1,000 summarisations/month wipes out 92 paying users' margin.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. I see it regularly in FinOps audits of AI-native startups. The unit economics were never calculated; the product was priced on vibes.&lt;/p&gt;

&lt;p&gt;CP1Ki also tells you when to switch models mid-product. If your quality threshold is met by Haiku, there is no business case for GPT-4o at 2.8x the cost per inference. But if a specific feature — legal clause analysis, code generation — requires GPT-4o quality, you can route that feature specifically to the expensive model and keep the rest of the product on Haiku. That routing decision needs CP1Ki to justify itself in a board update.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to instrument it
&lt;/h2&gt;

&lt;p&gt;You cannot manage what you do not measure at the right granularity. API bills and GPU invoices tell you monthly totals. You need per-feature, per-inference data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Tag every inference call at the point of code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add a metadata tag to every API call that identifies the feature generating it:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = client.messages.create(
    model="claude-haiku-3-5",
    max_tokens=500,
    messages=[...],
    metadata={"user_id": user_id}  # Anthropic API
)
# Log separately: feature_tag="document_summariser", tokens_in=X, tokens_out=Y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For self-hosted models behind vLLM or Triton, tag at the reverse proxy layer (Nginx, Envoy) or in your application middleware before the model call. The tag should carry: feature name, user tier, model version, timestamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Export billing data by tag into your data warehouse.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For managed APIs: pull usage logs from the provider's API (Anthropic Usage API, OpenAI Usage endpoint, AWS Cost and Usage Report for Bedrock). Join on timestamp to your application log's feature tag. For self-hosted: allocate GPU cost by the fraction of total requests that feature generated in that billing period.&lt;/p&gt;

&lt;p&gt;A simple dbt model or even a spreadsheet pivot on feature_tag × (tokens_in + tokens_out) × rate gives you CP1Ki per feature, updated daily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Alert on CP1Ki drift.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Set a threshold alert: if CP1Ki for any feature exceeds 120 percent of its 30-day baseline, page the on-call engineer. Common causes — prompt bloat (someone added 800 tokens to the system prompt), model version change, or a bug causing retry storms. Catching this early has saved clients $8,000–$15,000 in a single incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number you owe your team
&lt;/h2&gt;

&lt;p&gt;CP1Ki is not a finance metric. It is an engineering metric that finance can read. It tells your CTO where the AI spend is going. It tells your product manager which features are subsidising which. It tells your pricing team the floor below which a plan cannot be profitable.&lt;/p&gt;

&lt;p&gt;If you cannot answer "what does it cost us to serve 1,000 of these AI responses?" for every AI feature in production, you are flying blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Book an AI Architecture Review
&lt;/h2&gt;

&lt;p&gt;Our AI Architecture Review service calculates CP1Ki across your full model stack, identifies the routing decisions that would cut your inference cost by 30–60 percent, and delivers a prioritised implementation plan in two to three weeks.&lt;/p&gt;

&lt;p&gt;Download the &lt;a href="https://aicloudstrategist.com/ai-gpu-audit.html" rel="noopener noreferrer"&gt;AI/GPU Cost Audit Checklist&lt;/a&gt; first — it gives you the data you'll need to walk into that conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/book.html" rel="noopener noreferrer"&gt;Book a free 30-min Cloud Cost Health Check → aicloudstrategist.com/book.html&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/aws-cost-audit-india.html" rel="noopener noreferrer"&gt;AWS Cost Audit India: 7 Leaks a ₹5L/Month Bill Hides (with Real Numbers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/k8s-cost-questions.html" rel="noopener noreferrer"&gt;The Five Kubernetes Cost Questions Nobody on Your Platform Team Can Answer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>cloud</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Ask your logs in English: AI observability for 2026</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:10:35 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/ask-your-logs-in-english-ai-observability-for-2026-335o</link>
      <guid>https://forem.com/aicloudstrategist/ask-your-logs-in-english-ai-observability-for-2026-335o</guid>
      <description>&lt;p&gt;Every observability tool has two interfaces. The first is the product — dashboards, alerts, service maps, traces. Engineers learn it in the first week. The second is the query language — the thing you have to type when something is actually broken at 2 a.m. and the dashboard is not enough. &lt;strong&gt;That second interface is where most Indian SaaS engineering teams quietly give up on observability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CloudWatch Logs Insights has its own syntax. Datadog has DQL. Splunk has SPL. New Relic has NRQL. Grafana Loki has LogQL. Each is subtly different. Each has its own reserved words, its own way of filtering, its own way of aggregating, its own quirks with timestamps and field extraction. When you're paying ₹40,000 a month for Datadog, the assumption is that your SREs know DQL well enough to answer ad-hoc questions. In our engagements across Bengaluru and Mumbai mid-market teams over the last year, that assumption has been wrong about 70% of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three people who can actually query your logs
&lt;/h2&gt;

&lt;p&gt;Pick any Series B Indian SaaS company with 40–100 engineers. Audit who has actually written a log query in the last 30 days. The answer is almost always the same shape: three people. The on-call SRE lead, who learned the query language during an incident and now everyone asks them. The founding engineer, who wrote the original logging infrastructure and still remembers why the &lt;code&gt;service&lt;/code&gt; field has dots instead of underscores. And one senior backend engineer who was bored one quarter and decided to read the docs.&lt;/p&gt;

&lt;p&gt;Everyone else — the thirty, forty, sixty other engineers on the team — either pings one of those three in Slack, or gives up. They never learn the query language because the cost-benefit is wrong: you learn it once, use it twice, forget it, then re-learn it next quarter when you need it again. Nobody builds fluency in a language they use six times a year.&lt;/p&gt;

&lt;p&gt;So what happens? The observability tool goes underutilised. You're paying Datadog rates, but your engineers are grepping through CloudWatch console manually, or worse, asking each other to paste logs into Slack. We've audited accounts where &lt;strong&gt;less than 8% of the engineering team had ever written a single log query in the past quarter&lt;/strong&gt;. That's not an observability problem. That's a language problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an AI log console actually changes
&lt;/h2&gt;

&lt;p&gt;When we launched AICloud Observe this month, the core bet we're making is straightforward: the query language is a legacy interface. The actual interface is plain English, and the translation layer is an LLM. A developer asks "show me the top 10 endpoints by p99 latency in the last 6 hours", and the console generates:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, @message, @duration
| parse @message /endpoint=(?&amp;lt;endpoint&amp;gt;\S+)/
| filter @duration &amp;gt; 0
| stats pct(@duration, 99) as p99 by endpoint
| sort p99 desc
| limit 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's a real CloudWatch Logs Insights query. Valid syntax. The LLM ran it for you. It came back with rows, a chart, and three follow-up prompts — "narrow to a specific service", "compare with yesterday", "show the slowest single request in each endpoint". The developer never had to remember the &lt;code&gt;parse&lt;/code&gt; syntax, or that percentile is &lt;code&gt;pct&lt;/code&gt; in Insights (not &lt;code&gt;percentile&lt;/code&gt;, not &lt;code&gt;p99&lt;/code&gt;). They just asked a question.&lt;/p&gt;

&lt;p&gt;This is not a marketing pitch. This is a pricing model decision. When the query interface is in English, the number of engineers who can use your observability tool jumps from three to fifty. That changes the math on what you're paying Datadog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the timing is right — specifically for Indian SaaS
&lt;/h2&gt;

&lt;p&gt;Three things made this practical in 2026 that were not practical in 2023. First, LLMs got good enough at structured output that a well-scoped prompt generates syntactically correct query language 95%+ of the time. We validated this against a corpus of 400 real ad-hoc log questions sampled from SRE Slack channels across four Indian SaaS companies — Claude generates valid Insights queries on the first attempt 97% of the time, and with one round of self-correction, 99.2%. Second, inference pricing dropped far enough that running 20 log queries per engineer per day costs less than the seat license of any vendor APM. Third, and specifically for the Indian market, vendor pricing in USD against INR revenue is under more scrutiny from founders than it has been in years.&lt;/p&gt;

&lt;p&gt;That third point deserves a sentence. A 50-engineer team on Datadog at $70/host/month with 200 hosts is spending &lt;strong&gt;₹11.6 lakh per month&lt;/strong&gt; on observability vendor fees alone. That number is visible on every monthly burn review. The question "why does this cost more than our entire SRE team's headcount?" is a question every Indian CTO I've spoken to in the last six months has asked out loud at least once. When the answer is "because three of fifty engineers use it regularly", the economics collapse.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AICloud Observe baseline audit actually does
&lt;/h2&gt;

&lt;p&gt;We put a ₹3,500 observability baseline audit on the website this week. It's a flat fee, GST-inclusive, 24-hour turnaround. A customer shares a CloudWatch or Datadog export, answers five questions about their stack, and we produce a PDF scored across nine categories: log retention (we almost always find groups on infinite retention that should be on 7 or 30 days), alert coverage (we usually find 2–4 services with zero alarms), tracing (X-Ray or OTel typically covers 1–2 services out of 8), SLOs (usually missing entirely), compute insights, network observability, synthetics, cost-vs-signal, and a migration plan if the current vendor spend is not pulling its weight.&lt;/p&gt;

&lt;p&gt;Each finding carries a severity, an estimated monthly savings in INR, and an effort rating. The report averages 9 findings in our pilot engagements. Total recoverable spend across the pilot set averaged &lt;strong&gt;₹82,000 per month&lt;/strong&gt; , with the biggest individual line items being CloudWatch Logs retention waste (₹10K–₹1L/month per company) and over-provisioned Datadog host agents on ephemeral Auto Scaling fleet (₹20K–₹60K/month).&lt;/p&gt;

&lt;p&gt;But the audit is the pretext. The product is what comes after. Customers who buy the audit get access to &lt;code&gt;/logs.html&lt;/code&gt;, the AI log console, on a subscription: ₹2,999/month Basic, ₹9,999 Pro, ₹49,999 Unlimited. The Basic tier is priced at what one extra Datadog seat would cost, and it replaces the need for that seat entirely for 80% of ad-hoc questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does not replace
&lt;/h2&gt;

&lt;p&gt;Being honest: the AI log console is not a replacement for dashboards, alerts, or traces. Dashboards answer the questions you already know to ask — the console answers the ones you didn't. Alerts page you when something is wrong — the console helps you diagnose what. Traces show you the shape of a single request — the console helps you find which requests to trace.&lt;/p&gt;

&lt;p&gt;We deliberately scoped the first release narrowly: CloudWatch Logs Insights today, Datadog DQL and Grafana Loki LogQL on the roadmap for Q3 2026. We are not trying to replace your APM. We are trying to make the query interface stop being the reason nobody uses it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The claim, in one sentence
&lt;/h2&gt;

&lt;p&gt;If fewer than 30% of your engineering team can write an ad-hoc log query from scratch right now — and in our experience that's the median — you are overpaying for observability regardless of which vendor you're with, because the thing that turns observability into value is &lt;em&gt;asking questions&lt;/em&gt; , and the language barrier is the reason nobody does. An AI-native log console doesn't make your observability better. It makes the observability you already paid for accessible. For most Indian SaaS teams in 2026, that's the larger of the two gaps.&lt;/p&gt;




&lt;h3&gt;
  
  
  Want to see this on your own account?
&lt;/h3&gt;

&lt;p&gt;The AICloud Observe baseline audit is ₹3,500, flat fee, GST-inclusive. You get a scored posture report, a list of findings with estimated monthly savings, and access to the AI log console for paying customers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aicloudstrategist.com/observe.html" rel="noopener noreferrer"&gt;Start the ₹3,500 observability audit →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Published 19 April 2026 · Written by &lt;a href="https://aicloudstrategist.com/author/anushka-b.html" rel="noopener noreferrer"&gt;Anushka B&lt;/a&gt;, founder of AICloudStrategist. If you run observability for an Indian SaaS team and have a contrary view on any of the claims above, I'd like to hear it: &lt;a href="mailto:anushka@aicloudstrategist.com"&gt;anushka@aicloudstrategist.com&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/why-indian-saas-gives-up-on-observability.html" rel="noopener noreferrer"&gt;Why Indian SaaS Teams Quietly Give Up on Observability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/datadog-alternatives-india.html" rel="noopener noreferrer"&gt;Why Indian Mid-Market SaaS Should Stop Paying Datadog ₹10L/Month&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>DORA metrics for the CFO: making engineering velocity legible</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:04:48 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/dora-metrics-for-the-cfo-making-engineering-velocity-legible-j9l</link>
      <guid>https://forem.com/aicloudstrategist/dora-metrics-for-the-cfo-making-engineering-velocity-legible-j9l</guid>
      <description>&lt;p&gt;The most useful conversation I've had in a DevOps engagement wasn't with an SRE or a platform engineer. It was with a CFO.&lt;/p&gt;

&lt;p&gt;Her engineering team had been asking for budget to invest in CI/CD pipeline upgrades, better observability tooling, and a dedicated platform engineer. The request had been deprioritised twice. &lt;em&gt;"We don't see the ROI,"&lt;/em&gt; she said.&lt;/p&gt;

&lt;p&gt;I pulled up four numbers. By the end of the call, the project was funded.&lt;/p&gt;

&lt;p&gt;Those four numbers were DORA metrics — and every engineering leader who has ever lost a budget battle should know how to translate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DORA metrics actually measure
&lt;/h2&gt;

&lt;p&gt;DORA (DevOps Research and Assessment) is the largest longitudinal study of software delivery performance — seven-plus years of data, tens of thousands of teams, published annually by Google Cloud's research arm. The programme identifies four key metrics that predict both engineering performance and business outcomes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Frequency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How often the team ships to production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lead Time for Changes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time from code commit to live in production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Change Failure Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Percentage of deployments that cause incidents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Time to Recovery (MTTR)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How quickly service is restored after an incident&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each metric has four performance tiers: Elite, High, Medium, and Low. Elite teams deploy on demand, maintain sub-one-hour lead times, hold change failure rates below 5%, and recover from incidents in under an hour. Low-tier teams deploy a handful of times per year and take weeks to recover from serious incidents.&lt;/p&gt;

&lt;p&gt;The research finding that changes the CFO conversation: &lt;strong&gt;Elite teams are twice as likely to meet commercial goals and have 50% lower change failure rates than Low-performing teams.&lt;/strong&gt; DORA's longitudinal design establishes causal direction — engineering performance drives business outcomes, not the other way around.&lt;/p&gt;

&lt;h2&gt;
  
  
  The translation layer
&lt;/h2&gt;

&lt;p&gt;CFOs don't read engineering dashboards. They read income statements. Here is how each DORA metric maps to a number they care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lead Time → Time to Revenue
&lt;/h3&gt;

&lt;p&gt;Every day a completed feature sits in a review queue, a staging environment, or a manual approval process is a day it isn't generating revenue. If your lead time is three weeks and your competitors are shipping in hours, you are running a three-week revenue delay on every roadmap item. At scale, this compounds: a product team shipping 12 major features per year, each carrying a 20-day lead time overage versus Elite, is deferring months of incremental revenue per year — not because the engineers are slow, but because the pipeline is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Frequency → Feature Velocity
&lt;/h3&gt;

&lt;p&gt;Low deploy frequency forces big-batch releases. Big batches mean longer feedback loops, more complex merge conflicts, higher rework rates, and slower product iteration. A team deploying twice a month cannot respond to user signal in time to close Q3 with the product improvements the sales team promised. Deployment frequency is engineering throughput, and engineering throughput is the rate at which the roadmap moves from specification to customer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Change Failure Rate → Customer Churn Risk
&lt;/h3&gt;

&lt;p&gt;A 45% change failure rate — typical for Low-tier teams — means nearly half of all production deployments create an incident. Each incident carries three costs: direct engineering time to diagnose and remediate, customer-facing downtime that erodes NPS and renewal rates, and SLA penalties for enterprise accounts that have them written into contracts. The churn math is unforgiving: one enterprise customer lost to a preventable incident can eliminate the ROI of an entire quarter's engineering investment.&lt;/p&gt;

&lt;h3&gt;
  
  
  MTTR → Revenue at Risk Per Incident
&lt;/h3&gt;

&lt;p&gt;This is the metric CFOs understand most viscerally once the arithmetic is in front of them. Take your annualised revenue, divide by 8,760 hours, and multiply by your average incident duration. For a ₹30 crore ARR company, one hour of production downtime is worth approximately ₹3,400 in lost revenue — before accounting for SLA credits, support cost, and the soft cost of a customer who quietly decides not to renew. A Low-tier team with a five-day MTTR on serious incidents and four P1 incidents per year is carrying &lt;strong&gt;₹30–40 lakh of annual revenue exposure from downtime alone&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Low-tier DORA actually costs: a worked INR example
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Company profile:&lt;/strong&gt; B2B SaaS, ₹30 Cr ARR, 35-person engineering team, Low DORA tier.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy frequency: 2x/month&lt;/li&gt;
&lt;li&gt;Lead time: 3 weeks (21 days)&lt;/li&gt;
&lt;li&gt;Change failure rate: 45%&lt;/li&gt;
&lt;li&gt;MTTR: 5 days average for P1 incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Velocity tax — lead time versus Elite baseline:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
12 major features per year, each generating ₹3L/month incremental revenue once live. At three weeks' delay versus same-day Elite deployment, each feature is deferred 20 days. That is 20/30 × ₹3L = ₹2L per feature, unrealised.&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;₹24L in deferred revenue annually&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rework cost — change failure rate at 45%:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
24 deployments per year × 45% failure rate = 11 broken releases. Each broken release pulls 3 senior engineers for an average of 2 days to diagnose, fix, and redeploy. Fully-loaded engineer cost at ₹1.2L/month = ₹5,500/day.&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;₹3.6L in direct rework cost&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Downtime cost — MTTR 5 days, 4 P1 incidents per year:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
4 incidents × 24 hours of customer-facing impact each = 96 incident-hours. ₹30Cr ARR ÷ 8,760 hours = ₹3,425/hour.&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;₹3.3L in direct revenue loss&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Churn risk from incidents:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
3 enterprise accounts at elevated churn risk per incident year, ₹8L ARR each. Conservative 25% churn conversion.&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;₹6L in at-risk ARR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total annual cost of Low-tier DORA performance: ₹37L&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is roughly ₹3.1L per month — the cost of 2–3 senior engineers — leaving a 35-person team that already has the budget. The CFO conversation becomes: are we spending ₹3.1L/month to &lt;em&gt;not&lt;/em&gt; fix the pipeline, or are we spending ₹1.5–2L on a platform engineering sprint that eliminates most of it?&lt;/p&gt;

&lt;p&gt;The second number is a project proposal. The first number is a recurring P&amp;amp;L line the finance team doesn't know they're carrying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start here: DORA Maturity Self-Assessment
&lt;/h2&gt;

&lt;p&gt;Before fixing your DORA tier, you need to measure it accurately. Most teams significantly overestimate their lead time performance — they measure "days in sprint" rather than "commit to production" — and undercount their change failure rate by only logging P1 incidents while the P2s and P3s that consumed 40% of engineering sprint capacity go untracked.&lt;/p&gt;

&lt;p&gt;Our free &lt;strong&gt;DORA Maturity Self-Assessment&lt;/strong&gt; gives you a structured worksheet to measure all four metrics against their correct definitions, identify your current performance tier, and calculate the annual business cost using your actual ARR and team size. It takes under 20 minutes to complete and produces a one-page summary you can put in front of a CFO without a supporting slide deck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/devops-assessment.html" rel="noopener noreferrer"&gt;Download the DORA Maturity Self-Assessment →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Move your tier: DevOps &amp;amp; Platform Engineering
&lt;/h2&gt;

&lt;p&gt;Knowing you are in the Low tier is step one. Moving to High — or Elite — requires targeted infrastructure changes: deployment pipeline automation, progressive delivery with automated rollback, and observability wiring that compresses MTTR from days to minutes.&lt;/p&gt;

&lt;p&gt;Our DevOps &amp;amp; Platform Engineering service embeds DORA measurement from day one of every engagement. We instrument your four metrics accurately, build the CI/CD and observability tooling that drives them upward, and deliver a post-engagement DORA report you can put in front of your board — one that shows the tier shift, the business impact, and the cost of the work relative to the revenue recovered.&lt;/p&gt;

&lt;p&gt;The CFO who funded that project I mentioned at the start? Her team went from Low to High in one sprint. The lead time dropped from three weeks to two days. The change failure rate fell from 42% to 11%. The DORA report became the opening slide of their next board update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/book.html" rel="noopener noreferrer"&gt;Book a free 30-min Health Check →&lt;/a&gt;&lt;/strong&gt; — bring your four DORA metrics (or your best estimates), and we will show you which lever moves your tier fastest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/weekly-review.html" rel="noopener noreferrer"&gt;The 15-Minute Weekly Cloud Cost Review Every Indian Mid-Market CTO Should Run&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/k8s-cost-questions.html" rel="noopener noreferrer"&gt;The Five Kubernetes Cost Questions Nobody on Your Platform Team Can Answer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>saas</category>
      <category>engineering</category>
    </item>
    <item>
      <title>The cross-region egress mistake that costs Indian SaaS ₹4L/month</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:01:33 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/the-cross-region-egress-mistake-that-costs-indian-saas-4lmonth-ifk</link>
      <guid>https://forem.com/aicloudstrategist/the-cross-region-egress-mistake-that-costs-indian-saas-4lmonth-ifk</guid>
      <description>&lt;p&gt;Two line items. One billing export query. &lt;strong&gt;₹14 lakh recovered per year.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is the short version of Case 2 from our proof library. The longer version is this: a funded SaaS company watched their GCP bill triple over eight months. Their engineering lead was confident nothing had changed architecturally. He was technically right — no new services, no traffic spike, no notable infrastructure expansion. What had changed was &lt;strong&gt;one region flag in one Terraform file&lt;/strong&gt; , written during a late-night sprint six months earlier.&lt;/p&gt;

&lt;p&gt;Their analytics warehouse had been provisioned in &lt;code&gt;us-east1&lt;/code&gt;. Their application ran in &lt;code&gt;us-central1&lt;/code&gt;. Every dbt run, every pipeline flush, every scheduled query crossed a regional boundary. The inter-region data transfer line was &lt;strong&gt;₹12.4 lakh per year — 94% of their total GCP egress bill&lt;/strong&gt;. Nobody had noticed because it appeared as a network line item, not an infrastructure line item, and nobody was watching network costs.&lt;/p&gt;

&lt;p&gt;The fix took one weekend. The monthly egress bill dropped from &lt;strong&gt;₹1.24 lakh to under ₹7,500&lt;/strong&gt;. Annualised recovery: &lt;strong&gt;₹14 lakh&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mechanism: why cross-region egress compounds silently
&lt;/h2&gt;

&lt;p&gt;GCP charges for data leaving a region — to the internet, to another GCP region, or to another cloud. Within the same continent, inter-region transfer is priced at &lt;strong&gt;$0.01/GB&lt;/strong&gt;. That sounds negligible until you account for what actually crosses that boundary in a typical data stack.&lt;/p&gt;

&lt;p&gt;Every data pipeline has a fan-out problem. A single dbt model refresh doesn't move one file — it moves the source tables, the intermediate materialisation, the result set, and the audit logs. A 10 GB raw dataset becomes 60–80 GB in transit by the time transformations, tests, and exports complete. Run that pipeline 8 times a day and your "small 10 GB table" is generating &lt;strong&gt;480–640 GB of inter-region traffic daily&lt;/strong&gt;. At $0.01/GB, that's $4.80–$6.40 per day, $144–$192 per month — per pipeline. Add five more pipelines, a real-time Pub/Sub feed, and a Dataflow job flushing to BigQuery, and the line item scales to thousands of dollars before anyone notices.&lt;/p&gt;

&lt;p&gt;The compounding effect has three drivers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invisibility.&lt;/strong&gt; GCP's billing console groups network costs under a single service line. Without a billing export query filtering on SKU descriptions, you cannot tell whether $5,000 in "Networking" costs is internet egress, inter-region transfer, or Cloud CDN. Most teams do not run that query until something forces them to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Habituation.&lt;/strong&gt; Bills grow gradually. A 3x increase over eight months feels different from a 3x increase overnight. Engineers adapt to the new normal rather than investigating the delta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ownership gaps.&lt;/strong&gt; The team that provisioned the warehouse in the wrong region has often left. The team running dbt doesn't own the infrastructure config. Nobody holds the cross-cutting network cost line.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to detect it: the BigQuery billing export query
&lt;/h2&gt;

&lt;p&gt;If you have GCP billing export enabled to BigQuery — and you should — this query surfaces every inter-region transfer SKU ranked by cost in the last 30 days:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
  project.id                                AS project_id,
  resource.location                         AS source_region,
  sku.description                           AS sku,
  ROUND(SUM(usage.amount), 2)               AS total_gib,
  ROUND(SUM(cost), 2)                       AS cost_usd_30d,
  ROUND(SUM(cost) * 12, 2)                  AS annualised_cost_usd
FROM
  `&amp;lt;YOUR_PROJECT&amp;gt;.&amp;lt;YOUR_DATASET&amp;gt;.gcp_billing_export_v1_*`
WHERE
  DATE(usage_start_time) &amp;gt;= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND (
      LOWER(sku.description) LIKE '%inter region%'
   OR LOWER(sku.description) LIKE '%interregion%'
   OR (
        LOWER(sku.description) LIKE '%egress%'
    AND LOWER(sku.description) NOT LIKE '%internet%'
      )
  )
  AND cost &amp;gt; 0
GROUP BY
  1, 2, 3
ORDER BY
  cost_usd_30d DESC
LIMIT 50;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Replace &lt;code&gt;&amp;lt;YOUR_PROJECT&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;YOUR_DATASET&amp;gt;&lt;/code&gt; with your billing export destination. The &lt;code&gt;*&lt;/code&gt; wildcard catches both the standard and detailed export table naming conventions.&lt;/p&gt;

&lt;p&gt;What to look for: any row where &lt;code&gt;annualised_cost_usd&lt;/code&gt; exceeds $1,000 and the source region is your primary application region. If a single SKU line exceeds $5,000 annualised, you have a topology problem, not a pricing problem. Sort by &lt;code&gt;annualised_cost_usd&lt;/code&gt; descending and work top to bottom.&lt;/p&gt;

&lt;p&gt;If you haven't enabled billing export yet, do it now. GCP retains billing data in the console for 12 months; the export to BigQuery is the only way to query it programmatically and retain history beyond that window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: three options, one right answer
&lt;/h2&gt;

&lt;p&gt;Once you've identified the offending traffic, you have three remediation paths in order of preference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Warehouse co-location.&lt;/strong&gt; Move the analytics resource into the same region as the application generating the data. This eliminates the transfer entirely. It is the correct fix in 80% of cases. The weekend migration in Case 2 was this: a BigQuery dataset recreation in &lt;code&gt;us-central1&lt;/code&gt;, a pipeline repoint, a dbt &lt;code&gt;profiles.yml&lt;/code&gt; change, a &lt;code&gt;terraform apply&lt;/code&gt;. No architectural change; one flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. VPC peering with regional enforcement.&lt;/strong&gt; If your services are spread across regions for legitimate reasons, establish VPC Network Peering within a single region and route internal traffic through it. Peered VPC traffic within a region does not incur inter-region charges. This is the right fix when multi-region deployment is intentional but data transfer patterns aren't designed around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Scheduled sync with regional staging.&lt;/strong&gt; For scenarios where real-time cross-region transfer is unavoidable — a source system that cannot move, a partner integration with a fixed endpoint — implement a scheduled sync that batches transfers and lands data in a regional staging bucket. Downstream consumers read from the local copy. Transfer happens once, on schedule, not continuously for every query. This reduces both cost and latency for read-heavy workloads.&lt;/p&gt;

&lt;p&gt;What is not a fix: buying committed use discounts against a network topology that's wrong. You would be paying for the privilege of making the mistake more efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  When cross-region IS the right architecture
&lt;/h2&gt;

&lt;p&gt;Cross-region deployments are not inherently a cost mistake. There are two scenarios where they are the correct call and the egress cost is a justified line item:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Genuinely global customer base.&lt;/strong&gt; If you have users in North America, Europe, and APAC, serving them from a single region means high latency for two of the three groups. The performance cost to users exceeds the egress cost to you. Architect for proximity, monitor transfer costs per region as a known budget item, and optimise routing rather than trying to eliminate multi-region entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulatory data residency.&lt;/strong&gt; India's DPDP Act, the EU's GDPR, and several BFSI-sector mandates require specific data categories to remain within defined geographic boundaries. If your compliance posture requires &lt;code&gt;eu-west1&lt;/code&gt; to hold European customer records, that data cannot move to your application region in &lt;code&gt;us-central1&lt;/code&gt; without a legal review. Here, egress cost is the price of compliance. Model it explicitly rather than treating it as waste.&lt;/p&gt;

&lt;p&gt;The test: if cross-region transfer exists for &lt;em&gt;convenience&lt;/em&gt; — because the first engineer to provision the warehouse chose a region at random, or because a staging environment was never migrated to match production — that is waste. If it exists because your architecture requires geographic distribution, it is cost of goods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The ₹14 lakh Case 2 recovery was not from a complex optimisation. It was from reading a billing export, identifying one wrong region, and moving a dataset. The entire intervention was scoped and executed in under 72 hours.&lt;/p&gt;

&lt;p&gt;Most GCP environments we audit have at least one active inter-region topology mistake. The transfer line is small enough to miss in a high-level review and large enough to matter at year-end. The query above costs nothing to run.&lt;/p&gt;

&lt;p&gt;If you want a structured review — not just egress, but the full network cost profile, commitment coverage, and idle resource footprint — book a free Cloud Cost Health Check. We document every finding in writing before the first invoice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/book.html" rel="noopener noreferrer"&gt;Book your free Cloud Cost Health Check → aicloudstrategist.com/book.html&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cost of looking is zero. The cost of the wrong region flag is, apparently, ₹14 lakh a year.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/aws-cost-audit-india.html" rel="noopener noreferrer"&gt;AWS Cost Audit India: 7 Leaks a ₹5L/Month Bill Hides (with Real Numbers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/cost-explorer-blindspots.html" rel="noopener noreferrer"&gt;What Your AWS Cost Explorer Dashboard Is Not Showing You&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>cloud</category>
      <category>india</category>
    </item>
    <item>
      <title>Why 60% of Indian AWS accounts have RI coverage under 30%</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:55:46 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/why-60-of-indian-aws-accounts-have-ri-coverage-under-30-41hg</link>
      <guid>https://forem.com/aicloudstrategist/why-60-of-indian-aws-accounts-have-ri-coverage-under-30-41hg</guid>
      <description>&lt;p&gt;Every AWS account we audit has the same conversation.&lt;/p&gt;

&lt;p&gt;"What is our Reserved Instance coverage?"&lt;/p&gt;

&lt;p&gt;"High. Most of the fleet. Maybe 70, 75 percent."&lt;/p&gt;

&lt;p&gt;We pull the report. It is 28 percent.&lt;/p&gt;

&lt;p&gt;The gap between what engineering teams believe about their RI coverage and what the billing data actually shows is the single most expensive blind spot in mid-market AWS accounts. It is not a rounding error. It is a structural pricing penalty that compounds every month nobody looks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math, in rupees
&lt;/h2&gt;

&lt;p&gt;A t3.xlarge in ap-south-1 runs &lt;strong&gt;$0.1856 per hour On-Demand&lt;/strong&gt; and &lt;strong&gt;$0.1178 per hour on a 1-year No Upfront Reserved&lt;/strong&gt;. That is a &lt;strong&gt;36.5 percent discount&lt;/strong&gt; — for a keystroke.&lt;/p&gt;

&lt;p&gt;Here is what that means in practice. Say you run 20 t3.xlarge instances continuously. Six are RI-covered. Fourteen are not.&lt;/p&gt;

&lt;p&gt;Those fourteen On-Demand instances cost &lt;strong&gt;$1,896 a month&lt;/strong&gt;. If they were on a 1-year No Upfront RI, they would cost &lt;strong&gt;$1,202&lt;/strong&gt;. The delta is &lt;strong&gt;$694 a month. $8,330 a year.&lt;/strong&gt; For one instance family, in one region, at one commitment level.&lt;/p&gt;

&lt;p&gt;Scale that exposure across the m5, r5, and c5 families that most mid-market engineering orgs run, and the unhedged position is typically &lt;strong&gt;₹35 lakh to ₹1 crore a year&lt;/strong&gt; — sitting there, month after month, because nobody is running the coverage report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 15-minute Cost Explorer check
&lt;/h2&gt;

&lt;p&gt;You can do this now, on your own account:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS Cost Explorer → &lt;strong&gt;Reservations&lt;/strong&gt; → &lt;strong&gt;Coverage Report&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Granularity: &lt;strong&gt;Monthly.&lt;/strong&gt; Date range: &lt;strong&gt;last three months.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Group by &lt;strong&gt;Instance Type&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Filter to &lt;strong&gt;ap-south-1&lt;/strong&gt; (or whichever region is your primary)&lt;/li&gt;
&lt;li&gt;Sort by &lt;strong&gt;On-Demand Cost&lt;/strong&gt; , descending&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Any row where coverage is below 70 percent and On-Demand spend is above $500 a month is a purchase decision waiting to be made. The report also timestamps when coverage dipped — invaluable for tracing the dip back to the autoscaling change or the new service deployment that quietly outgrew the original commitment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is worth doing before the CFO asks
&lt;/h2&gt;

&lt;p&gt;Cloud costs become a board-level question about 12 months before engineering teams are ready for them. The conversation usually arrives as: &lt;em&gt;"Why has the bill grown 40 percent year-on-year?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The honest answer is almost always the same. Compute scaled. Commitments did not.&lt;/p&gt;

&lt;p&gt;Running the coverage report once a quarter, with a named owner and a documented purchase policy, is the cheapest governance you will ever implement. It takes 15 minutes. The findings are usually uncomfortable. That is exactly the point.&lt;/p&gt;

&lt;p&gt;If you want a second set of eyes on yours, our free 30-minute Cloud Cost Health Check is built for this conversation specifically. You share a 7-day Cost and Usage Report sample. We send back a two-page written summary by end of the same day, covering your top three leaks and the recoverable amount against each — founder-led, enterprise-reviewed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/book.html" rel="noopener noreferrer"&gt;Book the Health Check → aicloudstrategist.com/book.html&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · FinOps for AWS, Azure, and GCP teams. Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/ri-coverage-india-governance.html" rel="noopener noreferrer"&gt;Why 70% of Indian Mid-Market Cloud Accounts Have Reserved Instance Coverage Below 30%&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/aws-cost-audit-india.html" rel="noopener noreferrer"&gt;AWS Cost Audit India: 7 Leaks a ₹5L/Month Bill Hides (with Real Numbers)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>cloud</category>
      <category>india</category>
    </item>
    <item>
      <title>Orphaned EBS volumes: the ₹80K/month silent drain</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:54:59 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/orphaned-ebs-volumes-the-80kmonth-silent-drain-m45</link>
      <guid>https://forem.com/aicloudstrategist/orphaned-ebs-volumes-the-80kmonth-silent-drain-m45</guid>
      <description>&lt;p&gt;Every engineering team I talk to has done a cloud cost review at some point. Reserved Instance coverage, right-sizing EC2, maybe a pass at S3 storage tiers. What almost none of them have done is pull a list of every EBS volume currently sitting in &lt;code&gt;available&lt;/code&gt; state — detached, idle, billing at full price, forgotten.&lt;/p&gt;

&lt;p&gt;In one mid-market engagement in ap-south-1 last year, that list came back with &lt;strong&gt;38 volumes, 4.2 TB of storage, and a quiet ₹4.2 lakh annual bill&lt;/strong&gt; attached to infrastructure nobody had touched in months. No alert had fired. No ticket existed. The spend was just compounding in the background.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DeleteOnTermination defaults to false (and why that's the root cause)
&lt;/h2&gt;

&lt;p&gt;When you launch an EC2 instance and attach an EBS volume — whether at launch time or afterwards — AWS sets &lt;code&gt;DeleteOnTermination&lt;/code&gt; to &lt;code&gt;false&lt;/code&gt; by default for any volume that isn't the root device. The root volume defaults to &lt;code&gt;true&lt;/code&gt;; everything else defaults to &lt;code&gt;false&lt;/code&gt;. The reasoning made sense when EBS was younger: detaching a data volume and reattaching it to a replacement instance is a legitimate operational pattern. Databases, persistent logs, shared storage — there are real use cases where you want a volume to outlive its host instance. AWS was being conservative, and in 2010 that was the right call.&lt;/p&gt;

&lt;p&gt;The problem is that most workloads in 2026 are not doing any of that. They're running application servers, microservices, and ephemeral build agents that get terminated and respawned by Auto Scaling. The data volumes those instances used — 20 GB here, 100 GB there — don't get cleaned up. They enter &lt;code&gt;available&lt;/code&gt; state, and AWS keeps billing for them at the full provisioned storage rate: &lt;strong&gt;$0.096 per GB-month for gp3&lt;/strong&gt; in ap-south-1, &lt;strong&gt;$0.114 per GB-month for gp2&lt;/strong&gt;. A single forgotten 100 GB gp2 volume costs &lt;strong&gt;₹970 per month&lt;/strong&gt;. Thirty-eight of them, accumulated over two years of sprint cycles and team turnover, costs you &lt;strong&gt;₹35,000 every month for nothing&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pattern study: 38 volumes, 4.2 TB, ₹4.2 lakh per year
&lt;/h2&gt;

&lt;p&gt;The engagement referenced above was a Bengaluru-based SaaS company — Series B, 60-person engineering team, monthly AWS spend of roughly ₹18 lakh. Not a small operation, but not an enterprise with a dedicated FinOps function either. They ran a quarterly cost review, had reasonable Reserved Instance coverage on their production RDS fleet, and believed their AWS environment was reasonably tidy.&lt;/p&gt;

&lt;p&gt;When we ran the initial discovery scan, the volume list told a different story. Thirty-eight EBS volumes in &lt;code&gt;available&lt;/code&gt; state across ap-south-1a, ap-south-1b, and ap-south-1c. Total provisioned storage: &lt;strong&gt;4.2 TB&lt;/strong&gt;. The breakdown: 24 volumes were gp2 (legacy, never migrated to gp3), averaging 87 GB each. Fourteen were gp3, averaging 140 GB each. Oldest orphan: 19 months. Most recent: 11 days — from a failed staging deployment that no one had cleaned up. Annualised cost at ap-south-1 list pricing: &lt;strong&gt;₹4,18,000&lt;/strong&gt; (~$5,000 USD).&lt;/p&gt;

&lt;p&gt;Not catastrophic in isolation. Compounded with the S3 storage sprawl and unused Elastic IPs we found in the same review, the total came to &lt;strong&gt;₹11.2 lakh in recoverable annual spend — 6.2% of their total cloud bill&lt;/strong&gt;. That's real money for a company that size.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-minute detection command
&lt;/h2&gt;

&lt;p&gt;You can surface every orphaned EBS volume in your AWS account in under ten minutes. Open your terminal, configure your credentials for the right account, and run:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws ec2 describe-volumes \
  --region ap-south-1 \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{
    ID:VolumeId,
    SizeGB:Size,
    Type:VolumeType,
    AZ:AvailabilityZone,
    Created:CreateTime,
    IOPS:Iops
  }' \
  --output table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This returns every volume not currently attached to a running or stopped instance. Add &lt;code&gt;--output json&lt;/code&gt; and pipe through &lt;code&gt;jq&lt;/code&gt; if you want to calculate total provisioned GB on the spot:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws ec2 describe-volumes \
  --region ap-south-1 \
  --filters Name=status,Values=available \
  --query 'Volumes[*].Size' \
  --output json | jq 'add'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Multiply that number by $0.096 (gp3) or $0.114 (gp2) and then by 12 for your annual orphan cost. If the number surprises you, you're not alone. Run this across every region you operate in — us-east-1, eu-west-1, wherever your teams have ever spun something up — and total the figure before drawing any conclusions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The governance fix: tagging and a 30-day quarantine window
&lt;/h2&gt;

&lt;p&gt;Deletion should be deliberate, not automatic. The correct fix is a quarantine pattern, not a nightly purge.&lt;/p&gt;

&lt;p&gt;When a volume enters &lt;code&gt;available&lt;/code&gt; state, tag it immediately: &lt;code&gt;quarantine-start: &amp;lt;ISO-8601-date&amp;gt;&lt;/code&gt;. An EventBridge rule watching for EBS state-change events to &lt;code&gt;available&lt;/code&gt; handles this without any polling. Pair it with a Lambda function or a daily cron job that queries for all volumes where that tag exists and the date is more than 30 days old — then either deletes them or files a Jira ticket for manual review depending on your team's risk appetite.&lt;/p&gt;

&lt;p&gt;On the prevention side: update your launch templates and Auto Scaling group configurations to set &lt;code&gt;DeleteOnTermination: true&lt;/code&gt; for all non-root volumes unless there's a documented reason otherwise. Enforce this as a policy check in your CI pipeline using a tool like &lt;code&gt;cfn-guard&lt;/code&gt; or Open Policy Agent against your CloudFormation and Terraform plans. Tag every EBS volume at creation with &lt;code&gt;owner&lt;/code&gt;, &lt;code&gt;team&lt;/code&gt;, and &lt;code&gt;environment&lt;/code&gt;. Any volume that reaches &lt;code&gt;available&lt;/code&gt; state without those tags triggers an immediate alert. The tagging discipline pays dividends well beyond storage — it's the same metadata that makes cost allocation reports meaningful at the team and product level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CFO view: why this compounds
&lt;/h2&gt;

&lt;p&gt;The ₹4.2 lakh figure in the pattern study above is a point-in-time snapshot. It compounds in two ways that matter to finance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, it grows monotonically.&lt;/strong&gt; Every sprint cycle, every staging environment spun up and torn down, every developer testing a new database configuration — all of it generates orphaned volumes unless the deletion policy is enforced at the infrastructure layer. Without a quarantine workflow, the volume of orphaned storage in an active engineering organisation roughly tracks headcount growth. A team that adds five engineers per quarter can expect its orphaned EBS footprint to grow &lt;strong&gt;15–20% annually on spend alone&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, it obscures accountability.&lt;/strong&gt; When cost is spread across dozens of untagged, ownerless volumes, it shows up as a diffuse line in the AWS Cost Explorer rather than attributable team or product spend. Finance sees the bill; engineering sees no actionable signal. That friction — the inability to connect cloud spend to business outcomes — is what makes cost governance feel like bureaucracy instead of engineering hygiene. Fix the tagging, fix the deletion policy, and you also fix the visibility problem.&lt;/p&gt;

&lt;p&gt;The ₹4.2 lakh number is not the point. The point is that it existed for over a year without being visible to anyone in the organisation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free Cloud Cost Health Check
&lt;/h2&gt;

&lt;p&gt;If you want to run this audit across your full AWS environment — not just EBS, but idle load balancers, unattached Elastic IPs, forgotten snapshots, and Reserved Instance gaps — we offer a free 30-minute Cloud Cost Health Check. No sales follow-up. A structured review call, a written summary of what we find, and a prioritised remediation list you can action immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/book.html" rel="noopener noreferrer"&gt;Book your free Cloud Cost Health Check → aicloudstrategist.com/book.html&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cost of looking is zero. The cost of not looking is, apparently, ₹4.2 lakh a year.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/aws-cost-audit-india.html" rel="noopener noreferrer"&gt;AWS Cost Audit India: 7 Leaks a ₹5L/Month Bill Hides (with Real Numbers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/cost-explorer-blindspots.html" rel="noopener noreferrer"&gt;What Your AWS Cost Explorer Dashboard Is Not Showing You&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>The 7 Kubernetes cost questions every CTO should be able to answer</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:49:11 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/the-7-kubernetes-cost-questions-every-cto-should-be-able-to-answer-lk1</link>
      <guid>https://forem.com/aicloudstrategist/the-7-kubernetes-cost-questions-every-cto-should-be-able-to-answer-lk1</guid>
      <description>&lt;p&gt;&lt;em&gt;By Anushka B&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Here’s a situation that plays out in engineering retrospectives across India every quarter.&lt;/p&gt;

&lt;p&gt;A senior platform engineer pulls up the AWS or GCP bill. The number has climbed — again. ₹18 lakh last month, ₹22 lakh this month. Leadership wants a breakdown. The platform team goes quiet. Not because they don’t care, but because the honest answer is: &lt;em&gt;we don’t know where it went.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You have observability for everything. Grafana dashboards. Datadog traces. PagerDuty alerts at 2 AM. You know exactly when your p99 latency spikes. You can tell which pod restarted and why. But ask your platform team what it cost to serve Tenant A last week versus Tenant B, and they’ll stare at their laptops like the answer might crawl out of the terminal if they wait long enough.&lt;/p&gt;

&lt;p&gt;This is the Kubernetes cost visibility gap — and it’s not a niche problem. It’s endemic to any team running 100+ pods across shared clusters without proper cost attribution tooling. You’ve built excellent operational intelligence. You’ve built almost zero financial intelligence.&lt;/p&gt;

&lt;p&gt;The consequences compound quietly. Engineering leaders can’t make resource trade-offs on data. Product teams can’t price features accurately. FinOps reviews turn into guesswork sessions with spreadsheets. And the cloud bill keeps climbing because no one can point to &lt;em&gt;exactly&lt;/em&gt; what changed and &lt;em&gt;exactly&lt;/em&gt; what it cost.&lt;/p&gt;

&lt;p&gt;Let’s go through the five questions your platform team almost certainly cannot answer right now — and the commands and queries that start changing that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question 1: What Does One Tenant Cost to Serve?
&lt;/h2&gt;

&lt;p&gt;This is the foundational question for any SaaS platform on Kubernetes. If you’re multi-tenant with namespaces or labels per tenant, you should be able to say: “Serving Tenant Acme Corp cost us ₹47,000 last month.” Almost nobody can.&lt;/p&gt;

&lt;p&gt;Without tenant-level cost attribution, you cannot have cost-based pricing conversations. You cannot identify your most expensive customers. You cannot tell whether that enterprise discount you gave is actually margin-positive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With OpenCost:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Query cost allocation by tenant label over the last 30 days
curl -G http://opencost.kube-system.svc:9003/allocation \
  --data-urlencode 'window=30d' \
  --data-urlencode 'aggregate=label:tenant' \
  --data-urlencode 'accumulate=true' | jq '.data[] | {name: .name, totalCost: .totalCost}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This requires your pods to carry a &lt;code&gt;tenant&lt;/code&gt; label — which they should if you’re multi-tenant. If they don’t, that’s your first infrastructure debt to address.&lt;/p&gt;

&lt;p&gt;What you’ll get back: per-tenant CPU cost, memory cost, GPU cost (if applicable), network, and PV storage — all denominated in dollars by default, which you convert at the prevailing USD/INR rate. A tenant consuming ₹2.4 lakh/month on a shared ₹15 lakh cluster is information that changes pricing conversations immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question 2: Which Namespace Is Most Expensive?
&lt;/h2&gt;

&lt;p&gt;This sounds trivial. It is not. Namespace-level cost visibility is where most teams think they have coverage, and most teams are wrong.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl top pods -n payments&lt;/code&gt; gives you current resource &lt;em&gt;usage&lt;/em&gt;. It tells you nothing about cost. Your actual spend depends on resource &lt;em&gt;requests&lt;/em&gt; (what Kubernetes reserves for scheduling), usage (what the pod actually consumes), and the blended compute cost of the nodes those pods land on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With kubectl and metrics-server (usage only):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Aggregate CPU and memory requests by namespace
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | [.metadata.namespace, 
    (.spec.containers[].resources.requests.cpu // "0"), 
    (.spec.containers[].resources.requests.memory // "0")] | @tsv' | \
  sort | awk '{ns[$1]+=$2; mem[$1]+=$3} END {for (n in ns) print n, ns[n], mem[n]}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;With OpenCost (actual cost attribution):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -G http://opencost.kube-system.svc:9003/allocation \
  --data-urlencode 'window=7d' \
  --data-urlencode 'aggregate=namespace' \
  --data-urlencode 'accumulate=true' | \
  jq '.data[] | {namespace: .name, totalCost: .totalCost, cpuCost: .cpuCost, memoryCost: .memoryCost}' | \
  sort -t: -k2 -rn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In teams we’ve worked with, the answer is almost always surprising. The &lt;code&gt;data-pipeline&lt;/code&gt; namespace running batch jobs nobody monitors is frequently the most expensive. Not the customer-facing APIs everyone obsesses over.&lt;/p&gt;

&lt;p&gt;At ₹18 lakh/month cluster spend, knowing that one namespace accounts for ₹6.2 lakh — and that it could be optimised with smarter job scheduling — is actionable intelligence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question 3: What Percentage of Cluster Capacity Is Idle Overnight?
&lt;/h2&gt;

&lt;p&gt;Indian SaaS companies predominantly serve Indian customers. Traffic drops sharply between midnight and 7 AM. But most Kubernetes clusters run at their daytime provisioning levels 24/7.&lt;/p&gt;

&lt;p&gt;If your cluster is sized for peak daytime load and you’re not scaling down overnight, you’re paying full price for empty capacity — potentially 30–40% of your monthly bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check actual node utilisation right now:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Node-level CPU and memory utilisation
kubectl top nodes

# Requested vs allocatable capacity per node
kubectl describe nodes | grep -A 5 "Allocated resources"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Get a cleaner picture of request-to-allocatable ratio:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes -o json | jq -r '.items[] | 
  .metadata.name + " " + 
  .status.allocatable.cpu + " " + 
  .status.allocatable.memory'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Cross-reference this with your actual &lt;code&gt;kubectl top nodes&lt;/code&gt; output at 2 AM. The delta is your idle capacity tax. For a team on a 20-node cluster at ₹4,500/node/month (typical GKE n2-standard-4 equivalent in Mumbai region), 8 hours of idle overnight across 8 nodes is roughly ₹14,400/month in pure waste. Across a year: ₹1.7 lakh, just for leaving the lights on.&lt;/p&gt;

&lt;p&gt;Horizontal Pod Autoscaler and Cluster Autoscaler with appropriate &lt;code&gt;--scale-down-delay&lt;/code&gt; settings fix this. But you can’t prioritise what you can’t see.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question 4: Which Services Are Over-Requested vs Actual Usage?
&lt;/h2&gt;

&lt;p&gt;This is where most Kubernetes cost waste hides — not in dramatically over-provisioned nodes, but in the quiet accumulation of conservative resource requests that developers set once and never revisit.&lt;/p&gt;

&lt;p&gt;A service that requests 2 CPUs and 4Gi memory but consistently uses 0.3 CPUs and 600Mi memory is holding 1.7 CPUs and 3.4Gi memory hostage from the scheduler. Multiply this across 40 microservices and you’re paying for a cluster that’s 2–3x larger than it needs to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find the worst offenders:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Compare requests vs actual usage per pod
kubectl top pods --all-namespaces --sort-by=cpu | head -40

# For a specific namespace, get request vs limit ratio
kubectl get pods -n production -o json | jq -r '
  .items[] | .metadata.name as $name |
  .spec.containers[] | 
  [$name, .name, 
   (.resources.requests.cpu // "none"), 
   (.resources.limits.cpu // "none")] | @tsv'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;With OpenCost efficiency metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -G http://opencost.kube-system.svc:9003/allocation \
  --data-urlencode 'window=7d' \
  --data-urlencode 'aggregate=pod' \
  --data-urlencode 'accumulate=true' | \
  jq '.data[] | select(.cpuEfficiency &amp;lt; 0.3) | 
    {pod: .name, cpuEfficiency: .cpuEfficiency, 
     memorEfficiency: .ramEfficiency, waste: .totalEfficiency}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Anything with &lt;code&gt;cpuEfficiency&lt;/code&gt; below 0.3 is using less than 30% of what it requested. These are your VPA (Vertical Pod Autoscaler) candidates. For teams with 50+ services and a ₹20 lakh monthly bill, rightsizing the bottom 20% of efficiency is typically a ₹3–5 lakh monthly saving — without touching a single line of application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question 5: What Does One CI/CD Pipeline Run Cost?
&lt;/h2&gt;

&lt;p&gt;This question makes platform engineers uncomfortable because the answer exists — it’s just that nobody has ever calculated it.&lt;/p&gt;

&lt;p&gt;Your CI/CD pipelines run on your cluster (or consume cluster-adjacent compute). Every &lt;code&gt;helm upgrade&lt;/code&gt;, every &lt;code&gt;kubectl apply&lt;/code&gt;, every integration test suite has a cost. If your pipelines run 200 times a day across 15 teams, and each run consumes meaningful CPU and memory for 4–8 minutes, that’s a non-trivial monthly line item hiding inside your “general compute” bucket.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrument pipeline jobs with cost labels:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In your pipeline pod spec or job template
metadata:
  labels:
    cost-center: ci-cd
    team: payments
    pipeline: release-deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Then query:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -G http://opencost.kube-system.svc:9003/allocation \
  --data-urlencode 'window=30d' \
  --data-urlencode 'aggregate=label:pipeline' \
  --data-urlencode 'accumulate=true' | \
  jq '.data[] | {pipeline: .name, totalCost: .totalCost}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Quick approximation without OpenCost:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Find CI namespace pod resource consumption
kubectl top pods -n ci-cd --sort-by=cpu

# Get average run duration from your CI tool logs, multiply by:
# (CPU cores requested × node $/hour) + (memory GiB × memory $/hour)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If each pipeline run costs ₹12 and you’re running 6,000 runs/month, that’s ₹72,000/month on CI alone. Teams that see this number start making different decisions: caching aggressively, parallelising smarter, killing redundant test stages. The ones who don’t see it keep clicking “re-run pipeline” without consequence.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Common Thread
&lt;/h2&gt;

&lt;p&gt;Every one of these questions is answerable. The tools exist: OpenCost (open-source, CNCF sandbox), Kubecost, native cloud cost allocation tags, VPA recommendations. The kubectl commands exist. The Prometheus metrics exist.&lt;/p&gt;

&lt;p&gt;What doesn’t exist on most platform teams is the 2–3 week investment to set up the attribution framework, label consistently, configure the tooling, and build the dashboards that turn raw cost data into actionable team-level visibility.&lt;/p&gt;

&lt;p&gt;That’s not a criticism — platform teams are stretched. But the cost of &lt;em&gt;not&lt;/em&gt; building this visibility isn’t abstract. It’s ₹3–8 lakh per month in avoidable waste for a mid-size cluster, compounding every month you don’t look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want This Built for Your Cluster?
&lt;/h2&gt;

&lt;p&gt;We run cost attribution engagements for platform and DevOps teams: OpenCost or Kubecost setup, namespace and tenant labelling strategy, Grafana cost dashboards, and a rightsizing report that identifies your top 10 waste sources with specific fixes.&lt;/p&gt;

&lt;p&gt;Teams typically see 20–35% cluster cost reduction within 60 days. The engagement pays for itself before the second invoice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[Talk to us about our DevOps &amp;amp; Platform Engineering service →]&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’re running ₹10 lakh+/month on Kubernetes and can’t answer these five questions, that’s not an observability gap. It’s a revenue leak.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anushka B writes about platform engineering, FinOps, and the infrastructure decisions Indian SaaS teams avoid until they can’t. If this resonated, share it with your platform lead.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/cost-per-1000-inferences.html" rel="noopener noreferrer"&gt;Cost-Per-1000-Inferences: The One Number Every AI Product Team Should Know&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/orphaned-ebs-volumes.html" rel="noopener noreferrer"&gt;Orphaned EBS Volumes: The Quiet Compounding Cost Indian Engineering Teams Keep Missing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>finops</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>DPDPA 2023 cloud compliance: what Indian SaaS must actually do</title>
      <dc:creator>Anushka B</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:48:24 +0000</pubDate>
      <link>https://forem.com/aicloudstrategist/dpdpa-2023-cloud-compliance-what-indian-saas-must-actually-do-6nn</link>
      <guid>https://forem.com/aicloudstrategist/dpdpa-2023-cloud-compliance-what-indian-saas-must-actually-do-6nn</guid>
      <description>&lt;p&gt;The Digital Personal Data Protection Act, 2023 (DPDPA) is no longer a "coming soon" regulation. The rules are notified, the Data Protection Board is stood up, and enforcement has begun. For Indian SaaS companies storing customer personal data — which is effectively every Indian SaaS — the question has moved from "do we need to comply" to "how fast can we close the biggest gaps".&lt;/p&gt;

&lt;p&gt;The penalty math is unforgiving. A single instance of "failure to implement reasonable security safeguards" exposes a company to &lt;strong&gt;up to ₹250 crore&lt;/strong&gt;. For a ₹40 crore ARR SaaS, that is 6x annual revenue on a single finding. The largest Indian SaaS funding rounds have been wiped out for less.&lt;/p&gt;

&lt;p&gt;This post is the 30-day path. The six controls most Indian SaaS companies miss today, with the exact cloud-layer changes needed on AWS to close each one. A checklist you can hand to your engineering lead on Monday and expect results by the end of May.&lt;/p&gt;

&lt;h2&gt;
  
  
  DPDPA in one paragraph (for engineering teams)
&lt;/h2&gt;

&lt;p&gt;DPDPA regulates how organisations ("Data Fiduciaries") collect, process, store, and transfer the "Digital Personal Data" of individuals in India ("Data Principals"). It requires explicit, informed, itemised consent; data minimisation; retention limits; breach notification; and it creates new rights for individuals — access, correction, erasure, grievance redressal. Compliance is not optional and is not relative to the size of your company. A 20-person seed-stage startup handling 1,000 user records has the same legal obligations as a listed enterprise, just at smaller potential penalty scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The penalty ceiling, broken down
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Violation&lt;/th&gt;
&lt;th&gt;Max penalty&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Failure to take reasonable security safeguards&lt;/td&gt;
&lt;td&gt;₹250 crore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure to notify breach (Board + Data Principals)&lt;/td&gt;
&lt;td&gt;₹200 crore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-fulfilment of obligations re: children's data&lt;/td&gt;
&lt;td&gt;₹200 crore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-fulfilment of obligations of Significant Data Fiduciary&lt;/td&gt;
&lt;td&gt;₹150 crore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-compliance with a direction of the Data Protection Board&lt;/td&gt;
&lt;td&gt;₹50 crore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Breach by Data Principal (false claims, spam filing)&lt;/td&gt;
&lt;td&gt;₹10,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Data Protection Board may impose these penalties in aggregate; they are per-breach ceilings, not annual caps. A quarter of reported US GDPR enforcement fines have landed above ₹40 crore equivalent. DPDPA enforcement is not yet at that rhythm, but the regulator has signalled intent to prioritise consumer-facing tech companies and SaaS handling financial/health data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The six controls most Indian SaaS miss
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Control 1 — Explicit, granular consent capture (not pre-ticked boxes)
&lt;/h3&gt;

&lt;p&gt;DPDPA demands consent that is "free, specific, informed, unconditional and unambiguous with a clear affirmative action." Blanket "I agree to the terms" checkboxes do not meet this. Pre-ticked opt-ins for marketing communication do not meet this. Bundling data-processing consent with service-signup consent does not meet this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-layer fix (AWS):&lt;/strong&gt; log every consent event in a tamper-evident audit trail. We recommend a dedicated &lt;code&gt;consent_events&lt;/code&gt; DynamoDB table with stream capture to S3 via Kinesis, with S3 Object Lock in Compliance mode. The immutability is the evidence — you need to prove, six months later, the exact moment a specific Data Principal gave or withdrew consent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control 2 — Data retention with automated purge
&lt;/h3&gt;

&lt;p&gt;Data must be erased "as soon as reasonable to assume that the specified purpose is no longer being served." In practice, this means every PII field in your database needs a documented retention period and a scheduled purge process. Most Indian SaaS we audit have never set retention — user records live forever, even for churned accounts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-layer fix:&lt;/strong&gt; DynamoDB TTL for event-sourced data, RDS stored procedures for relational PII, and S3 Lifecycle policies for file uploads. The harder architectural pattern — one we implement in our &lt;a href="https://aicloudstrategist.com/blog/dpdp-act-cloud-security-checklist.html" rel="noopener noreferrer"&gt;DPDP Act checklist&lt;/a&gt; — is separating PII columns into a dedicated table so purge is a targeted DELETE, not a destructive schema change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control 3 — Breach detection, containment, and notification within 72 hours
&lt;/h3&gt;

&lt;p&gt;The DPDPA rules use the phrase "as soon as possible" for breach notification, but the Data Protection Board is moving toward a 72-hour norm modelled on GDPR Article 33. The operational gap we see: Indian SaaS detects breaches weeks after they happen, because nobody has wired alerts on the signals that matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-layer fix:&lt;/strong&gt; GuardDuty + Security Hub + a CloudWatch alarm on any IAM role assuming new permissions, any S3 bucket going public, any RDS snapshot being shared cross-account, any EC2 instance profile being modified. Route to PagerDuty or Opsgenie with a 15-minute acknowledgement SLA. The detection is the half; the documented 72-hour notification playbook with legal on speed-dial is the other half.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control 4 — Data Protection Officer / Grievance Officer appointment
&lt;/h3&gt;

&lt;p&gt;Only Significant Data Fiduciaries &lt;em&gt;must&lt;/em&gt; appoint a DPO. But every Data Fiduciary must publish a "grievance mechanism" and name a contact who can respond to Data Principal rights requests. In practice, a single email inbox monitored by a single human counts — as long as requests are logged, tracked, and resolved within the statutory timelines (30 days for most rights).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational fix:&lt;/strong&gt; a dedicated &lt;code&gt;privacy@&lt;/code&gt; inbox routed to a ticketing system (Zendesk, HelpScout, or a Jira project), with SLA enforcement and monthly reporting. Your privacy policy must list this contact. Most Indian SaaS privacy policies we audit still list a founder's personal email — that breaks at the first scale-up or personnel change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control 5 — Cross-border data transfer documentation
&lt;/h3&gt;

&lt;p&gt;DPDPA uses a negative-list model. Transfers are permitted unless the Central Government specifically restricts a destination country. This is less restrictive than many companies feared, but it still requires documentation — for every cross-border data flow, you need to record &lt;em&gt;what&lt;/em&gt; data, &lt;em&gt;where&lt;/em&gt; it goes, &lt;em&gt;why&lt;/em&gt; , and &lt;em&gt;what safeguards&lt;/em&gt; apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-layer fix:&lt;/strong&gt; a data-flow register as code (we maintain ours in a single YAML file checked into the security repo). For AWS specifically: document every S3 cross-region replication rule, every RDS read replica in a foreign region, every analytics pipeline that ships CURs to a global BigQuery instance. The register must be reviewed quarterly and updated whenever a new third-party processor is onboarded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control 6 — Children's data handling
&lt;/h3&gt;

&lt;p&gt;Data of individuals under 18 gets materially stricter treatment. Parental consent is required. Targeted advertising to children is banned. Behavioural monitoring is restricted. For edtech and gaming companies, this is a large operational lift. For B2B SaaS, the question is subtler — your product might not target children, but what if your customers onboard users under 18 (HR SaaS, healthtech, coaching platforms)?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational fix:&lt;/strong&gt; a signup-time age declaration field, consent capture flow that branches based on declared age, and an explicit policy on what happens if age is undeclared. The cloud-layer piece is that these events live in the same tamper-evident consent audit trail as Control 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pattern study: the gaps we found in an Indian healthtech audit
&lt;/h2&gt;

&lt;p&gt;Healthtech SaaS, 70 engineers, Series B, ap-south-1 primary, processing Electronic Health Records. When we ran the DPDPA-aligned cloud security audit in February 2026, the findings were illustrative of the typical mid-market exposure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control 1 — Consent:&lt;/strong&gt; consent checkbox existed on signup, but consent events were not logged anywhere after the initial UI interaction. Evidence gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control 2 — Retention:&lt;/strong&gt; patient records retained indefinitely on production RDS. No purge mechanism, no stated retention period, churned customer data still in live tables 14 months after contract termination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control 3 — Breach detection:&lt;/strong&gt; GuardDuty enabled but not routed to any alerting tier. Median alert-to-acknowledgement time in a tabletop exercise: 6 days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control 4 — Grievance mechanism:&lt;/strong&gt; privacy policy listed the founder's personal Gmail. No ticketing, no SLA enforcement, no audit trail of rights requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control 5 — Cross-border transfers:&lt;/strong&gt; two analytics pipelines shipping pseudonymised EHR data to us-east-1 Snowflake. No documented basis, no data-flow register, no vendor DPA signed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control 6 — Children's data:&lt;/strong&gt; paediatric records represented 22% of the database. No age-gating, no parental consent workflow, no differentiated handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remediation plan we delivered: a 30-day sprint (exactly what's described below), priced at ₹1.8 lakh for the Secure module engagement. Post-implementation audit at day 60 closed 5 of 6 controls; the sixth (full paediatric consent workflow rebuild) slid to a 90-day plan because it required product design work outside cloud infrastructure. Total penalty exposure reduction the company could document to its insurance carrier: ₹250 crore ceiling per control closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-day sprint plan
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Week 1 — diagnose.&lt;/strong&gt; Run a PII inventory: every field in every production database that qualifies as Digital Personal Data. Pair with a data-flow audit — where does each field go (backups, logs, analytics, third-party processors)? Output: a single living document that every engineer on the team can read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 2 — quick wins.&lt;/strong&gt; Fix Control 2 (retention) and Control 5 (cross-border register). These are the two lowest-effort items with clear compliance trails. Retention is a database change + a scheduled job; the register is a YAML file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3 — detection.&lt;/strong&gt; Stand up GuardDuty, Security Hub, and CloudWatch alarms per Control 3. Write the breach notification playbook. Run a tabletop exercise. Document the results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 4 — consent + governance.&lt;/strong&gt; Ship the consent audit trail (Control 1) and the grievance mechanism (Control 4). Update the public privacy policy. Publish the DPO/grievance officer contact. If you handle children's data (Control 6), add the age-gating flow.&lt;/p&gt;

&lt;p&gt;This is not a full compliance programme — a full programme includes ongoing DPIAs, vendor risk reviews, internal audit cadence, and board-level reporting. But the 30-day sprint above will close the penalty exposure for the six controls that the Data Protection Board is likeliest to examine first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The board-level conversation this enables
&lt;/h2&gt;

&lt;p&gt;One benefit of closing the six controls above that doesn't get enough attention: it reframes the DPDPA conversation at board level. Without documented controls, "are we DPDPA-compliant?" is a question the CTO fields with uncomfortable hedges. With documented controls — consent audit trails, retention policies, breach playbooks, data-flow registers — the answer is factual: "We have closed six of the six highest-exposure controls as of ${date}. Residual risk lives in areas X, Y, Z with a remediation roadmap through ${quarter}."&lt;/p&gt;

&lt;p&gt;Insurance carriers have started asking for the same evidence before underwriting cyber liability for Indian SaaS. Two of our customers renewed policies with DPDPA-aligned evidence packs and reported 15–22% premium reduction in the first post-renewal cycle. The controls pay for themselves in the year they ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free DPDPA compliance checklist
&lt;/h2&gt;

&lt;p&gt;We publish a full checklist — 47 items across the six controls above, mapped to AWS services, with the exact IAM policies and Terraform modules we use. It's a lead magnet on the site; no salesperson calls, no upsell, just the document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/downloads/" rel="noopener noreferrer"&gt;Download DPDPA checklist → aicloudstrategist.com/downloads/&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Secure module
&lt;/h2&gt;

&lt;p&gt;If you'd rather have someone run the 30-day sprint alongside your engineering team, our &lt;a href="https://aicloudstrategist.com/secure.html" rel="noopener noreferrer"&gt;Secure module&lt;/a&gt; is built for exactly this. Cloud security audit ₹1,00,000–₹2,00,000, DPDPA-aligned, with a 30-day implementation sprint included. First three customers at ₹40,000 under our launch cohort offer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://aicloudstrategist.com/audit.html" rel="noopener noreferrer"&gt;Start your free 24-hour Cloud Security Audit → aicloudstrategist.com/audit.html&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Founder-led by Anushka B.&lt;/strong&gt; AICloudStrategist advises Indian mid-market SaaS and fintech on cloud security and cost. DPDPA content here is operational, not legal counsel — we partner with external law firms for regulatory interpretation. See &lt;a href="https://aicloudstrategist.com/proof.html" rel="noopener noreferrer"&gt;how we prove what we claim&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related writing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/rbi-cybersecurity-aws-posture.html" rel="noopener noreferrer"&gt;RBI Cyber Security Framework on AWS: A Practical Mapping for Indian Fintech, NBFCs and Payment Aggregators&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aicloudstrategist.com/blog/dpdp-act-cloud-security-checklist.html" rel="noopener noreferrer"&gt;DPDP Act 2023 Cloud Security Checklist: What Indian SaaS and Fintech Teams Actually Need to Do Before the Rules Land&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>cloud</category>
      <category>compliance</category>
      <category>india</category>
    </item>
  </channel>
</rss>
