Forem: Bijay Singh deo

Taming S3 Costs: Automated Reports with Lambda, Python, Athena & SES (because weekends are for coffee, not cost reports ☕)

Bijay Singh deo — Tue, 16 Sep 2025 17:54:57 +0000

Let’s be honest—S3 is cheap.... until it isn’t.

One fine day, I was staring at my AWS bill and realized S3 storage had quietly been eating away a chunk of it. With buckets spread across teams, projects, and experiments that nobody remembered, I had a problem:

How do I figure out which buckets are costing me the most?
And how do I make sure I get a report every Monday without manually pulling Athena queries (and pretending to be productive)?

So, like any lazy engineer who doesn’t want to do repetitive work, I decided to automate it.😎

Step 1: Inventory, Inventory, Inventory 📦

First, I enabled S3 Inventory reports in Parquet format (JSON would’ve been too chatty). These reports contain all the details about objects, storage classes, and sizes. Perfect for Athena queries.

Once enabled, the reports started landing in a dedicated S3 bucket. From there, I built multiple Athena tables pointing to these Parquet files—because SQL > endless scrolling through S3 console.

Step 2: Writing Athena Queries Like a Detective 🕵️

The idea was simple:

For each table (inventory per bucket/prefix), calculate storage size.
Estimate cost based on storage class pricing.
Find out the top 'N' offenders (buckets hogging most of the bill).

Here’s a flavour of one Athena query I used in my code to get cost, size (per storage class) of a bucket/table:

WITH filtered_data AS (
    SELECT 
        storage_class, 
        intelligent_tiering_access_tier,
        SUM(size) AS total_size
    FROM 
        my_bucket_1
    WHERE 
        dt = '2025-08-19-01-00'
    GROUP BY 
        storage_class, intelligent_tiering_access_tier
)
SELECT 
    storage_class,
    intelligent_tiering_access_tier,
    total_size,
    CASE 
        WHEN storage_class = 'STANDARD' THEN total_size * 0.0210 / (1024 * 1024 * 1024)
        WHEN storage_class = 'GLACIER' THEN total_size * 0.0036 / (1024 * 1024 * 1024)
        WHEN storage_class = 'DEEP_ARCHIVE' THEN total_size * 0.00099 / (1024 * 1024 * 1024)
        ELSE 0
    END AS estimated_cost_usd
FROM 
    filtered_data;

Basically, I’m asking Athena:
💡 “Hey buddy, tell me which storage class is silently burning my credits.”

Step 3: Lambda + Boto3 = Automation ❤️

Now came the fun part—wrapping it up in a Lambda function so I don’t have to run queries by hand.

What the Lambda does:

For each S3 Inventory Athena table it runs a single SQL that returns per-prefix and per-storage-class (and intelligent-tier) metrics: object_count, total_size (bytes) and an estimated_cost_usd (I hard-coded the per-GB prices for STANDARD, GLACIER, DEEP_ARCHIVE and the intelligent-tier access tiers right in the query).
It polls Athena (the classic get_query_execution loop with time.sleep(5)) until each query finishes, then fetches the CSV result written to S3.
It aggregates the query rows into a Python structure keyed by (table, prefix) — summing object counts, bytes and the estimated cost, and keeping a breakdown list for each storage class / tier.
Once all tables are processed the Lambda computes total cost per (table,prefix) and sorts descending to pick the top N (configurable, I used 15).
It then builds a detailed CSV report with a parent summary row per ranked prefix (total cost, total size, total objects) followed by the breakdown rows for each storage class/tier (object_count, size_GB, cost_USD).
The CSV is uploaded to S3 (so the report is versioned, shareable, and easy to inspect).
Finally the Lambda sends a short SES email summary (top lines + sizes/costs) and includes the S3 path to the CSV — so every Monday morning (via EventBridge schedule) I get an actionable report I can forward to teams or use to trigger cleanups.

Check the full python code in my github repo, give it a shot.💪

Now, instead of me digging into Athena on Mondays, I just sip coffee and open my inbox ☕📧.

Step 4: Scheduling with EventBridge (aka my weekend butler)

To make this truly hands-off, I scheduled the Lambda to run every Sunday midnight via EventBridge. That way, when I walk into Monday standup, I already know which buckets are the villains of the week.

(And yes, this occasionally makes me look more prepared than I actually am. 😎)

Final Report Look 📊

The email I get looks something like this:

Table: my_bucket_1 | Bucket: my-bucket-1 | Size: 12.3 TB | Cost: $257.23
Table: my_bucket_2 | Bucket: my.bucket.2 | Size: 9.1 TB | Cost: $198.72
… and so on.
Neat, simple, actionable.

Why This Matters

Visibility: S3 is easy to ignore, but costs add up fast.
Automation: No manual queries, no “oops I forgot.”
Cost Savings: Spot old projects, test buckets, or misconfigured storage classes early.

What started as a weekend experiment has now become my weekly cost sanity check.
It’s not rocket science, but it saves $$$ and brain cycles.

If you’re also haunted by “mystery buckets” on your AWS bill, give this approach a shot. You’ll thank yourself every Monday morning.

Have you tried something similar to tame your S3 costs? Or maybe you’ve found an even better way? Drop a comment—I’d love to learn (and maybe steal your idea 😅).

Masking Sensitive Data in CloudWatch Logs for APIs (and keeping your secrets safe!)

Bijay Singh deo — Tue, 16 Sep 2025 13:15:19 +0000

😬 The Problem

So, picture this: you’ve built a shiny API that people love. Life is good, until one day you peek into your CloudWatch Logs and — BAM 💥 — staring right back at you are… user credentials. Passwords, tokens, maybe even PAN numbers. Not exactly the kind of surprise you want in your logs, especially when regulators (and your security team) are watching.

As much as I love CloudWatch for debugging and monitoring, nobody wants to see their production logs looking like an open diary of sensitive user data. In fintech, that’s a big no-no (think PCI, GDPR, SOC2 nightmares).

🎯 The Mission

My mission was clear: 👉 Mask sensitive data (like passwords, tokens, card details) in CloudWatch Logs — while still keeping logs useful for troubleshooting.
Bonus challenge: make the solution scalable, beginner-friendly, and fun.

🛠️ The Fix

1. First, a refresher

What are CloudWatch Logs?
CloudWatch Logs is like your app’s black box recorder — every API call, error, and debug line can land here. - You can stream logs from API Gateway, Lambda, ECS, EC2, you name it. - You can search them in CloudWatch Insights or ship them to OpenSearch for SQL-style queries. - You can even set alarms when things go sideways.
But — and it’s a big BUT — by default, CloudWatch logs whatever you give it. If you send passwords, it happily stores passwords. 🙈

2. My “Aha!” moment

CloudWatch Data Protection Policies
Instead of writing messy regex scripts or re-engineering logging middleware, AWS now gives us Data Protection Policies for log groups.

Think of it like a secret filter on your logs:

You tell CloudWatch which sensitive data patterns to look for.
It uses managed data identifiers (e.g., credit cards, AWS keys, emails, etc.).
When a match is found → it’s automatically masked (like ****).
Only users with the special logs:Unmask permission can see raw values.

When you create a data protection policy in the AWS Console, you’ll go through a few options:

Managed Data Identifiers

AWS provides a long list of preconfigured data types (AWS keys, financial numbers, PII, PHI, etc.) you can select via checkboxes.
You can also define custom data identifiers using regex if you need something AWS doesn’t provide.

Audit vs Mask (Deidentify)

Audit: Detect and record sensitive data findings without altering the logs. Great for discovery — you can see where secrets show up before deciding to mask.
Deidentify (Mask): Actually redacts those values, so they show up as **** everywhere logs are consumed. This ensures secrets never persist unmasked in CloudWatch.

Findings Destination

If you choose Audit, you can direct audit findings to another log group, S3 bucket, or Kinesis Firehose. Useful for compliance reports.
If you don’t specify, it can be left empty ({}), which means no special destination.

Apply Policy

After you save the policy, only new logs ingested are scanned and masked. Old logs are not retroactively changed.
Masking only works on Standard log class groups.
Also, the DataIdentifier arrays in Audit and Deidentify statements must match exactly; otherwise AWS rejects the policy.

Here’s a sample policy combining both audit and mask:

`{
"Name": "data-protection-policy",
"Description": "",
"Version": "2021-06-01",
"Statement": [
{
"Sid": "audit-policy",
"DataIdentifier": [
"arn:aws:dataprotection::aws:data-identifier/AwsSecretKey",
"arn:aws:dataprotection::aws:data-identifier/BankAccountNumber-FR"
],
"Operation": {
"Audit": {
"FindingsDestination": {}
}
}
},
{
"Sid": "redact-policy",
"DataIdentifier": [
"arn:aws:dataprotection::aws:data-identifier/AwsSecretKey",
"arn:aws:dataprotection::aws:data-identifier/BankAccountNumber-FR"
],
"Operation": {
"Deidentify": {
"MaskConfig": {}
}
}
}
]
}`

In practice, I like to start with Audit only so I can measure where sensitive data actually shows up. Once I’m confident, I enable Deidentify to make sure those values are masked everywhere.

Boom 💥 — no more plaintext credentials in logs!

Using the logs after masking Once logs are masked, you can still consume them as usual: - CloudWatch Logs Insights → run queries like:

fields @timestamp, @message
  | filter @message like /ERROR/
  | sort @timestamp desc
  | limit 20

OpenSearch SQL → even fancier! You can connect your log group to OpenSearch and query like:

SELECT
    requestId,
    httpMethod,
    status,
    message
  FROM apigateway_logs
  WHERE status >= 500
  ORDER BY timestamp DESC
  LIMIT 10;

Masked fields still show up as **** , which means you keep debugging power without exposing secrets.

✅ The Outcome

No more sensitive data (passwords, tokens, PAN) leaking into logs.
Security and compliance team = happy campers.
Developers still get useful logs for troubleshooting.
And me? I sleep better at night knowing I won’t wake up to a compliance ticket.

😂 Fun Takeaway

Think of CloudWatch Data Protection like an automatic censor beep on live TV. Your logs may still be dramatic, but at least they’re PG-13 instead of Rated R for Regulatory Nightmares.

So if you’re running a fintech API (or really any API), do yourself a favor — let CloudWatch keep the logs, but mask the secrets. After all, what happens in production logs should stay in production logs... safely masked.🥺

What do you think? Have you tried CloudWatch Data Protection yet, or do you still rely on custom masking? I’d love to hear how you’re handling log hygiene in your stack!