Forem: Anushka B

AI on dirty data is faster wrong answers

Anushka B — Thu, 23 Apr 2026 03:47:08 +0000

A founder told me yesterday:

"We're rolling out AI for cost optimization. Will save us 30%."

I asked: "What's your tag compliance rate?"

"I don't know. Maybe 40%?"

That's not an AI cost-optimization deployment. That's a ₹40L automated mistake machine.

Every "AI will transform X" pitch in the B2B space right now has the same gap: it assumes your underlying data is clean. Structured. Complete. Truthful.

In cloud cost, that means:
→ Every resource has consistent ownership tags
→ Cost allocation reconciles to actual team billing
→ Resource metadata reflects actual function (not "ec2-1234-temp")
→ Utilization data has at least 30 days of history
→ Workload patterns are documented (what's production, what's dev, what's abandoned)

Most Series A-C companies I audit: 30-60% of their cloud resources don't meet these bars.

Then they plug in an AI cost-optimization tool. The AI processes the dirty data. Makes confident-sounding recommendations. The team acts on them.

Result: AI just identified that a "legacy-api-prod" resource is idle (it IS idle for 20 hours/day) and recommends shutdown. Team shuts it down. Turns out it was the critical batch-processing service that only runs 4 hours/day but was the highest-impact service in the company.

"AI made a mistake."

No. AI processed dirty data correctly. Output is consistent with input.

The honest AI-adoption order for cost/FinOps:

Data hygiene (3-6 months): tag compliance, cost allocation cleanup, metadata standardization
Baseline analytics (1-2 months): what's the current state, by team, by service, by cost center?
Rule-based automation (2-3 months): codify the decisions you already make, make them instant
Then AI: let ML find patterns in the clean, rule-filtered data

Skipping 1-3 and going straight to 4 is how teams spend ₹40L on tooling and save ₹5L in real cost — while creating a sense of momentum that delays the real work.

AI is a multiplier on the quality of your foundation. On bad foundations, it multiplies badly.

The teams that do this right:
→ Spend 6 months fixing data before buying AI tools
→ Start with 3-5 automation rules (not 50)
→ Keep humans in the loop for 6 months before fully automating any decision
→ Measure AI-recommendation accuracy before acting on all of them

And the teams that don't:
→ Buy the shiny tool
→ Plug it into half-broken data
→ Celebrate early "wins" that were actually bugs
→ Quietly churn out of the contract 12 months later

If your team is in a "we're going AI for X" motion, repost. The foundation conversation is the one worth having first.

AI #DataEngineering #FinOps #MLOps #CTO #Founders #DigitalTransformation #Engineering #IndiaSaaS #Leadership

Fintech + AWS + RBI: the compliance myth

Anushka B — Thu, 23 Apr 2026 03:41:57 +0000

Every fintech founder in India asks me: "Do we need to move off AWS for RBI compliance?"

Almost always: no. Almost always, you're conflating three different things.

What RBI actually requires (SPDI Rules + Master Direction on Outsourcing + DPDPA):

Data residency: specific categories of data (payment data, PII) must be stored in India. AWS Mumbai region (ap-south-1) satisfies this. Hyderabad (ap-south-2) too. You do NOT need to move to an "Indian-only" cloud.
Data sovereignty: specific regulated data cannot be controlled by foreign entities. AWS India has a separate legal entity (AWS India Pvt Ltd) with Indian jurisdiction clauses. This satisfies most fintech use cases after your legal team reviews.
Audit rights: RBI + your auditors must be able to inspect systems storing regulated data. AWS provides audit reports (SOC 2, ISO 27001, RBI-compliance artifacts), and AWS Mumbai includes physical-access audit provisions.
Specific controls: encryption-at-rest, TLS-in-transit, logging retention, incident reporting SLAs. All achievable on AWS.

What doesn't require moving:
→ Compute: ap-south-1 is fine for production
→ Storage: S3 in Mumbai + encryption + access logging + 10-year retention
→ Database: RDS/DynamoDB in Mumbai + field-level encryption for PII
→ Analytics: keep raw data in-region, only export anonymized aggregates

What DOES require care:
→ Cross-region replication to Singapore / Virginia for DR: needs justification and documented controls
→ Third-party integrations (Datadog, Segment, payment processors): each needs a data processing agreement + residency review
→ Employees outside India accessing production: needs VPN + audit logging + justification

The ₹50L infrastructure migration some fintechs do "for RBI compliance" is usually motivated by one of:
→ A consultant who sells the migration service
→ A competitor moved so we should too
→ Confused interpretation of a circular that didn't actually require it

The ₹5L compliance audit some fintechs do AFTER the migration? That's the one that actually matters, and it's the one that should come first.

Before you migrate off AWS for RBI:

Read the specific circular / regulation your legal team is worried about
Ask your compliance consultant to point to the exact clause
Ask AWS India Compliance for their specific response to that clause
Compare cost of migration vs. cost of adding controls to current setup

9 out of 10 times, the answer is "stay on AWS Mumbai, add these 4 controls."

If your fintech is having the migration debate right now, repost. Save ₹50L on the wrong answer.

Fintech #RBI #Compliance #AWS #IndiaTech #DPDPA #CloudArchitecture #CISO #Founders #CloudSecurity

CNAPP won't fix your IAM mess

Anushka B — Thu, 23 Apr 2026 03:41:21 +0000

Cloud security RFP season. Every mid-market CISO is evaluating Wiz, Orca, Prisma Cloud, or similar.

The question I get: "Which one should we buy?"

My question back: "What percentage of your IAM users have AdministratorAccess?"

The answer is usually uncomfortable.

CNAPPs (Cloud-Native Application Protection Platforms) are powerful. They also cost ₹30L-₹1Cr/yr depending on your scale. Their core promise: unified visibility across misconfigurations, vulnerabilities, and runtime threats.

What the sales deck doesn't tell you:

CNAPP tools surface a flood of alerts. Without IAM hygiene in place first, your team will:
→ Mute 60% of alerts because they're "too many"
→ Lose track of who owns what alert because ownership isn't tagged
→ Fail to act on the 15% that are actually critical because they're buried
→ Renew the CNAPP contract anyway because they can't now admit it didn't help

The foundation that makes CNAPP work:
→ Every IAM user has documented role and justification (review quarterly)
→ No AdministratorAccess for humans. Use Assume-Role + Session Policies for escalation.
→ Service accounts have the minimum permissions they actually use (IAM Access Analyzer reports this)
→ SCPs at the org level block destructive actions even if a user has permissions
→ MFA enforced at login, not optional
→ CloudTrail centralized, immutable, retained 2+ years

With these 6 foundational controls, you actually cut 40-60% of the CNAPP's alert volume because you've prevented the misconfigurations at the source.

CNAPP without foundation = expensive alert dashboard.
Foundation + right-sized CNAPP = actual security posture improvement.

The honest sequence:
→ Month 1-2: IAM audit + Access Analyzer cleanup. Free.
→ Month 2-3: SCP guardrails + MFA enforcement. Free.
→ Month 3-4: Small CNAPP deployment (maybe start with AWS Security Hub — free). Tune alerts.
→ Month 6+: Evaluate if premium CNAPP (Wiz et al) is needed, or if Security Hub + custom GuardDuty rules cover you.

Most Indian mid-market teams I audit find that AWS-native security tools plus IAM discipline covers 80% of what CNAPP sells. The other 20% is noise.

If your security team is in RFP-mode right now, repost. There's a CISO about to sign a ₹80L/yr contract who should audit IAM first.

CloudSecurity #CISO #CNAPP #AWS #IAM #InfoSec #IndiaTech #Compliance #Founders #CloudArchitecture

Tagging — the 20% that drives 80% of cost allocation

Anushka B — Thu, 23 Apr 2026 03:36:09 +0000

The most common FinOps mistake I see: over-engineered tagging strategy.

A Series B SaaS team spent 3 months designing a 47-field tag taxonomy. Environment. Service. Owner. Business unit. Cost center. Data classification. Compliance zone. Criticality. Expiry. PII flag. Migration source. CI pipeline ID.

Then they realized: they can't enforce it. Their Terraform had 80 modules. Half the resources were provisioned before the taxonomy existed. The rollout plan estimated 6 months. They gave up at month 4.

Meanwhile, their actual cost-allocation report was still "Sum by service: EC2=34%, RDS=22%, Datadog=18%, Others=26%."

The 47-field schema added zero business value.

The 80/20 version actually works:

Only 5 tags. Enforced via SCP. Enforced via CI-gate. Enforced via IaC policy:

team — which team owns this resource (finance + on-call = one owner)
service — the product/feature it serves
env — prod/staging/dev
cost_center — for finance rollup
expiry — auto-delete date for non-prod, blank for prod

Five tags. Mandatory. Blocked resource creation if missing. Auto-flagged if violated.

This 5-tag schema covers 95% of FinOps reporting you'll ever need:
→ Cost per team
→ Cost per service
→ Prod vs non-prod
→ Allocation by business unit
→ Orphan detection (expired resources still running)

The other 42 tags the fancy vendors recommend? Build them only when you have a concrete question they answer. Never preemptively.

Tag strategy maturity curve:
→ Week 1: enforce 3 tags. Rest is aspirational.
→ Month 3: 5 tags enforced. Alert on missing.
→ Month 6: allocation reports actually reconcile with service ownership.
→ Year 1: CFO trusts the numbers, no manual reconciliation.

Start here. Not with a 47-field schema.

If your team's tagging RFC is longer than 3 pages, repost. Shorter = more shippable.

AWS #FinOps #CloudCost #DevOps #Tagging #InfrastructureAsCode #IndiaSaaS #Engineering #Founders

DORA metrics are a CFO tool, not a dev tool

Anushka B — Thu, 23 Apr 2026 03:35:33 +0000

Your engineering team tracks DORA metrics. Your CFO doesn't know what they are.

That's the gap costing both of them trust and budget.

DORA in engineering's head:
→ Deployment frequency (how often we ship)
→ Lead time for changes (commit to prod)
→ Change failure rate (% of deploys that break something)
→ MTTR (mean time to recover)

DORA translated for the CFO:
→ Deployment frequency → how fast we can respond to a customer request, competitor move, or compliance requirement
→ Lead time → from "we have an idea" to "it's making money" — directly tied to revenue velocity
→ Change failure rate → % of your engineering hours spent fixing instead of building. A 15% CFR is 15% of your eng budget burned.
→ MTTR → per minute of downtime, your app is losing X% of hourly revenue. MTTR reduction = protected revenue.

When engineering says "MTTR is 4 hours" to a CFO, the CFO hears nothing.

When engineering says "Every incident over 60 minutes costs us ₹4L in SLA credits and ~₹2L in Monday-morning trust damage with enterprise accounts, and our MTTR is currently 4 hours," the CFO suddenly gives you two extra SRE headcount.

The translation layer is:
→ Every DORA metric gets a currency column
→ CFR: % × engineering hours × fully-loaded cost
→ MTTR: median incident × estimated revenue/hour × frequency
→ Lead time: feature velocity × average deal-size uplift
→ Deployment frequency: time-to-respond × competitive advantage score

You don't need a new tool. You need a spreadsheet that maps your DORA trendline to rupee impact, updated monthly, and shown to the CFO in the same deck as the cloud bill.

When DORA becomes part of financial planning instead of a DevOps KPI, two things happen:

Finance stops asking "why is engineering so slow" (they can see it's structurally, not culturally slow)
Engineering stops begging for investment (the numbers justify themselves)

If you're an eng leader whose CFO doesn't speak DORA, this is how you fix it.

Repost for the VP Engineering reading board-deck prep at 11pm right now.

DORA #DevOps #Engineering #CTO #VPE #CFO #FinOps #Leadership #Founders #IndiaSaaS

gp2 gp3 is the easiest ₹50K/mo you'll ever save

Anushka B — Thu, 23 Apr 2026 03:30:22 +0000

The easiest AWS cost win nobody takes: migrate gp2 to gp3.

Why nobody does:
→ "We'll plan it next quarter"
→ "Migration always has risk"
→ "We need to test first"

Reality:
→ gp3 has the SAME IOPS baseline as gp2 (3,000), and you can scale independently for more
→ gp3 is 20% cheaper per GB than gp2
→ The migration is zero-downtime. Literal one-line CLI: aws ec2 modify-volume --volume-type gp3
→ Snapshots, attachments, everything carries over
→ Takes 10 minutes per volume, most of which is AWS internal copy time

For a typical Series B SaaS with 100 EBS volumes at ~3TB total:
→ gp2 cost: ~$300/mo (₹25K)
→ gp3 cost: ~$240/mo (₹20K)
→ Savings: ₹5K/mo, ₹60K/yr

At larger scale (50TB+ EBS footprint), this becomes ₹30-50K/mo savings. Zero effort. Zero risk.

The fact that 68% of AWS accounts I audit still have gp2 as the default volume type tells you something: cloud cost optimization isn't a technical problem. It's an attention problem.

The 10-minute weekly ritual that saves more than most "cost optimization tools":

→ Monday 4pm: query all gp2 volumes above 100GB
→ Run modify-volume for each
→ Update IaC templates so new volumes default to gp3
→ Next Monday: confirm

That's it. ₹50K/mo for most mid-market teams. No migration project. No RFP.

If someone on your team is "planning" this next quarter, repost and tag them. It's this Friday afternoon, not next quarter.

AWS #FinOps #DevOps #CloudCost #IndiaSaaS #Engineering #EBS #Infrastructure #Founders

Why 73% of AWS Trusted Advisor tips get ignored

Anushka B — Thu, 23 Apr 2026 03:29:47 +0000

Every Trusted Advisor dashboard I've seen has 50-300 "unoptimized resources."

And every team ignores 73% of them. I did the math across 34 audits.

The reason isn't laziness. It's that Trusted Advisor tells you WHAT's suboptimal without telling you:

→ Who owns this resource (what team? what project?)
→ What breaks if we act on it (tests? staging? prod?)
→ Why it was created this way (was there a reason we don't know?)
→ Is this team going to use it next week?

Without those four pieces of context, "Rightsize this EC2 instance" is just a noisy alert. Teams don't act on noisy alerts. Teams on-call mute them.

The AWS tooling isn't wrong. It's incomplete. It's designed as a generic signal, not a prioritization engine.

What actually gets acted on:

→ "This RDS instance is 12% CPU, owned by @payments-team, 3 Grafana dashboards, saves ₹40K/mo if we resize" — action within a week
→ "Oversized EC2 instance i-0abc123" — ignored forever

The difference is context. And context lives in your tag data, your deployment metadata, your team ownership map — none of which Trusted Advisor can see on its own.

The fix is to wrap Trusted Advisor output with YOUR context:

→ Pull TA recommendations via API
→ Join against tag data (cost center, service, owner)
→ Rank by (savings × owner-responsiveness × low-breakage)
→ Route top 5/week to the responsible team, Slack DM their lead
→ Track: did it get closed in 2 weeks? If no, escalate

Result: the 27% that actually matter get closed. The 73% either get flagged as "intentional" with a reason, or they genuinely don't matter.

Trusted Advisor isn't broken. Your pipeline around it is.

If your AWS Console has 200 ignored recommendations right now, repost. There's a platform lead about to dump "dashboard fatigue" in a 1:1.

AWS #FinOps #CloudCost #DevOps #Platform #Engineering #IndiaSaaS #Founders #AWSCloud

Multi-region is theater. Multi-AZ is engineering.

Anushka B — Thu, 23 Apr 2026 03:24:36 +0000

A VP Engineering pushed back on me last month:

"We have to go multi-region. Our enterprise clients demand it."

I asked: "Does the contract specify RTO and RPO? Or just the word 'multi-region'?"

"Just the word."

That's almost always the case. Let me explain the actual tradeoffs.

Multi-AZ deployment:
→ 99.95% SLA from AWS (the actual SLA, not marketing)
→ Cross-AZ latency: 1-2ms
→ Cost overhead: ~15-20% over single-AZ for databases and stateful services
→ Implementation: RDS Multi-AZ flag, ELB cross-zone, EKS nodes across 3 AZs
→ Testing: kill a node, verify failover — done in 1 sprint

Multi-region deployment:
→ 99.99% SLA (0.04 percentage points better)
→ Cross-region latency: 40-200ms (Mumbai-Singapore ~50ms, Mumbai-Virginia 200ms)
→ Cost overhead: 60-120% over single-region. Every stateful service replicated. Cross-region egress bill.
→ Implementation: DNS failover, active-active database replication, CQRS, eventual consistency in every app code path
→ Testing: nobody actually tests real region failover. Ever. Including the companies with "best practices" decks.

The 0.04% uptime delta costs the average Series B team ₹40-80L/year in ongoing infrastructure + 2-3 engineer-quarters in implementation. And it's usually not tested, meaning it won't save you in the one scenario it's supposed to.

When multi-region is actually worth it:
→ Regulatory: data residency requires specific region for specific users
→ Latency: real-time interactive app with users in multiple continents
→ Scale: >$500M ARR where 0.04% downtime = real revenue loss
→ A contract that specifies RTO < 5 min on region-wide failure AND pays you enough to afford it

Most companies don't qualify. They build multi-region for RFP-checkbox reasons and then never touch the standby cluster.

If your architect is writing a multi-region migration doc right now, repost. Help them have the honest conversation.

AWS #CloudArchitecture #DevOps #SRE #IndiaSaaS #Engineering #Resilience #Infrastructure #Founders #CloudCost

Delete 40% of your dashboards

Anushka B — Thu, 23 Apr 2026 03:24:00 +0000

Open your Grafana. Your Datadog. Your CloudWatch.

Count the dashboards.

Now count the ones anyone opened in the last 30 days.

That ratio is usually 4:1 to 10:1.

Last audit, a platform team had 340 dashboards. 41 had a view in the last 30 days. The other 299 were still querying metrics, still costing money, still alerting, still confusing new engineers on-call.

The accumulation pattern is identical every time:
→ Each new service ships with a "starter" dashboard nobody ever customizes
→ Every incident creates 2-3 dashboards that are "really useful"
→ Every quarterly review creates 5 more "leadership-ready" dashboards
→ Every new hire builds their own because they don't know the existing ones work

Nobody deletes anything. Observability debt compounds like financial debt.

The damage:

Real signals drown in unused noise
Alert fatigue — teams mute critical alerts because they're next to 40 broken dashboards
Your data-ingest bill scales with total metrics, not used metrics (Datadog charges per custom metric whether dashboards display them or not)
On-call runbooks point to dashboards that stopped working 6 months ago

The fix is embarrassingly simple and painful:

→ Query dashboard view counts (Grafana API, Datadog API both expose this)
→ Delete everything with 0 views in 60 days. No exceptions. Yes, the one you built "just in case."
→ Adopt USE + RED framework. One dashboard per service. Golden signals only.
→ Link runbooks from alerts DIRECTLY to the SLO dashboard, not to a folder

Result: cleaner signals, 15-30% custom-metric reduction on Datadog, on-call actually sleeps at night.

If you have a dashboard folder from 2022 titled "Temp — will organize later," repost. You know exactly who this is for.

Observability #SRE #DevOps #Datadog #Grafana #Monitoring #CloudCost #FinOps #Engineering

DPDPA compliance is a cloud config problem

Anushka B — Thu, 23 Apr 2026 03:18:49 +0000

A compliance lead told me last week:

"We're buying a ₹40L DPDPA compliance tool. We'll be ready by deadline."

I asked: "Do you know which S3 buckets contain user PII?"

She didn't. Neither did the CTO.

Here's the reality: DPDPA isn't a compliance-tool problem. It's a cloud-config problem wearing a legal costume.

The 7 misconfigurations that cause DPDPA violations (and cost ₹2-10L to fix post-notice):

S3 buckets with user data + public-read ACL
RDS instances storing PII outside India without proper consent flow
CloudTrail logging disabled or not centralized
IAM users with AdministratorAccess who can't explain what they do
Cross-region replication of PII without documented justification
Backup retention silently exceeding user deletion-request SLA
Third-party integrations (Datadog, Segment, etc.) receiving PII you didn't inventory

No compliance tool catches all of these. They catch what's in their signature database, generate a PDF, and collect ₹40L.

The real DPDPA readiness is 4 steps:

→ Map your data flows (1 spreadsheet. 1 engineer. 2 weeks.)
→ Tag cloud resources by data class (PII / sensitive / public)
→ Enforce via SCP: block public buckets, require encryption, require logging
→ Document residency + retention per table, per bucket, per queue

Cost: ₹0 in tools. ~80 hours of senior engineer time.

The ₹40L tool is useful — after the foundation is set. Before that, it's a dashboard showing you a list of configuration issues you could fix yourself in two sprints.

If your compliance lead is in RFP mode for a DPDPA tool right now, repost. Save them a quarter and a lakh.

DPDPA #CloudSecurity #Compliance #IndiaTech #CISO #InfoSec #CloudArchitecture #Founders #DataPrivacy

Why you have 6 NAT Gateways when you need 1

Anushka B — Thu, 23 Apr 2026 03:18:13 +0000

Every AWS audit, same line: "NAT Gateway — $450/month."

Then I look at the VPC, and there are six of them.

The reason is always the same: a well-meaning SRE added a NAT per AZ "for HA" two years ago. Then another team spun up a second VPC for a new service. Another NAT. Then someone replicated the whole thing to us-east-1 for disaster recovery. Three more NATs.

Six NAT Gateways, each $0.045/hour plus $0.045/GB processing. Monthly: $450-900 depending on traffic. Annual: ~₹5L-10L.

For a company with $15K/mo AWS bill, NAT is 4-6% of total spend. For what?

The honest breakdown:
→ 70% of traffic through NAT is to S3, DynamoDB, or SSM — all of which have free VPC Gateway Endpoints
→ Inter-AZ NAT redundancy is theater. If us-east-1a fails, AWS still runs your NAT in another AZ via the underlying service.
→ A single NAT per VPC + private subnets + gateway endpoints handles 95% of production needs

Fix (literal 30-min Terraform diff):
→ Add aws_vpc_endpoint for S3 (gateway, free) and DynamoDB (gateway, free)
→ Add interface endpoints for SSM, ECR, Secrets Manager (small cost, offsets NAT traffic)
→ Consolidate NAT to one per VPC unless you've actually measured AZ isolation as a requirement
→ Tag your NAT traffic — 95% of what leaves should go through endpoints, not NAT

One company in our audit cut NAT bill 83% in 2 weeks. Total effort: ~4 hours of engineer time.

The NAT line item is where lazy architecture goes to charge your AWS account monthly rent.

Repost if your VPC diagram has more NATs than services.

AWS #FinOps #CloudCost #VPC #DevOps #Infrastructure #IndiaSaaS #Kubernetes #Founders

Your Datadog bill is 60% DEBUG logs

Anushka B — Thu, 23 Apr 2026 03:13:02 +0000

A CTO asked me: "Should we move off Datadog? It's eating our runway."

I said: "Before you migrate, show me your retention config."

They didn't have one. The default was still set.

60% of the bill was DEBUG logs nobody had queried in 90 days. CloudWatch forwarders were pushing everything — access logs, auth logs, health checks. All at 30-day retention. All indexed. All paid for.

The migration would have taken 3 months, cost the team's sanity, and moved the same problem to Grafana.

The actual fix was a 2-week config exercise:

→ Tag logs by severity + service ownership
→ 3 retention tiers: P0 incidents keep 90d, operational 7d, DEBUG 24h
→ Stop indexing health-check logs. Archive them raw to S3 at $0.023/GB
→ Custom metrics audit: 18% of them weren't on any dashboard or alert
→ APM sampling reduced from 100% to 10% on non-critical services

Result: Datadog bill dropped 51% in 6 weeks. No vendor change. No re-training. No migration risk.

The observability industry loves selling you a new tool. But the problem isn't usually the tool. It's:
→ Defaults that were set when your traffic was 10x smaller
→ Nobody owns retention policy
→ Custom metrics piled up, nothing ever got deleted
→ Alerts firing so often everyone muted them

If you're about to RFP a new observability vendor: audit your current one first. You'll save 6 months and 60% of the spend.

If this sounds like your stack, repost. There's a VP Engineering reading a Grafana pitch deck right now who needs to hear it.

Datadog #Observability #DevOps #SRE #FinOps #CloudCost #IndiaSaaS #Founders #Engineering