Forem: Satyam Gupta

Pandas Series – Part 3: The Power of groupby()

Satyam Gupta — Sat, 25 Oct 2025 08:19:00 +0000

This is Part 3 of the Pandas Series, where we explore common Pandas gotchas that can level up your interview answers and your daily data work.

Imagine you're in a FAANG interview. The hiring manager says, "We have a petabyte-scale dataset of user transactions. I need you to generate a daily summary report showing, for each country, the total revenue, the average order value, and the number of unique customers who made a purchase. How would you approach this with Pandas?"

This is the quintessential groupby problem. Your ability to answer it cleanly and efficiently is a massive signal of your skill level.

The Core Concept: Split-Apply-Combine
This is the mental model for any groupby operation. It's a three-step process:

Split: The data is broken into smaller groups based on the criteria you specify (e.g., all rows for 'USA', all rows for 'India', etc.).
Apply: A function is applied to each of these smaller groups independently (e.g., sum(), mean(), count()).
Combine: The results from each group are collected and combined into a new DataFrame or Series.

From Basic Aggregations to .agg()

The simplest groupby is applying one function to one column: df.groupby('country')['revenue'].sum()

But what if you need the sum, mean, AND count? You could run the code three times, but that's inefficient. The answer is the .agg() method.

Level 1: Multiple functions on one column df.groupby('country')['revenue'].agg(['sum', 'mean', 'count'])

Level 2: Different functions on different columns. This is where it gets powerful. You use a dictionary to specify which function to apply to which column.

summary = df.groupby('country').agg(
    {'revenue': 'sum',      # Sum of the revenue column
     'order_id': 'count',   # Count of all orders
     'customer_id': 'nunique'} # Count of unique customers
)

Level 3: Named Aggregations: The dictionary method is great, but the column names in the output can be messy. The modern, preferred, and most readable method is Named Aggregation. It lets you control the output column names directly.

summary = df.groupby('country').agg(
    total_revenue=('revenue', 'sum'),
    number_of_orders=('order_id', 'count'),
    unique_customers=('customer_id', 'nunique')
)

This syntax, new_column_name=('source_column', 'function'), is clean, explicit, and produces a perfectly formatted DataFrame. This is what interviewers love to see.
⚡ For petabyte-scale data, this logic translates directly to PySpark’s groupBy().agg() — same concept, distributed execution.

And that’s a wrap for Part 3 of the Pandas Mastery Series 🎯
Next up: Should we dive into .apply() magic or merging/joining DataFrames next? Drop your pick below 👇

Pandas Series – Part 2: Common Gotchas Around Indexing

Satyam Gupta — Tue, 14 Oct 2025 08:46:53 +0000

This is Part 2 of the Pandas Series, where we explore common Pandas gotchas that can level up your interview answers and your daily data work.

The Core Concepts

1. What is an Index?
Think of the DataFrame index not as a row number, but as a set of labels or addresses for your rows. Like a dictionary key, it's optimized for fast lookups. When you write df[df['column'] == 'value'], Pandas has to scan the entire column. When you use a well-structured index with df.loc['label'], Pandas can jump directly to the data.

set_index() and reset_index() These are your primary tools for shaping your index.

df.set_index('column_name'): Promotes one or more columns to become the index.

df.reset_index(): Demotes the index level(s) back to being regular columns.

Use these wisely — a well-structured index can cut query time dramatically, especially on large datasets.

3. The MultiIndex (Hierarchical Indexing)
This is the "power move." A MultiIndex allows you to have multiple levels of indexing. Imagine a book's table of contents with Chapters, then Sections within Chapters. This structure lets you "drill down" to your data with incredible speed and precision.

Let's look at an example. A DataFrame with an index on (store, product):

		sales
store	product
Store_A	Apples	100
	Oranges	150
Store_B	Apples	80
	Bananas	120
	Oranges	90

With this structure, you can easily select:

All data for Store_A:
df.loc['Store_A']

Data for Apples in Store_B:
df.loc[('Store_B', 'Apples')]

Pro Tip: For range-based slicing (like with dates), the index must be sorted for optimal performance. You can do this with df.sort_index().

And that’s it for Day 2 👏
Indexing is one of those Pandas topics most people use without really understanding — but once you do, you unlock serious performance and clarity.

What’s your favourite indexing trick or gotcha you’ve run into while working with Pandas?

Pandas Series – Part 1: .loc, SettingWithCopyWarning, and Chained Indexing

Satyam Gupta — Thu, 02 Oct 2025 11:28:25 +0000

This is part 1 of my Pandas Gotchas series — short, sharp lessons on mistakes that trip up even experienced developers (and show up in interviews).

Let’s start with a classic:

df[df['category'] == 'electronics']['price'] *= 0.9
At first glance, it looks fine. You’re applying a 10% discount to electronics.
But here’s the catch: sometimes this works, sometimes it doesn’t.

Did the original DataFrame change?
Or did your code silently fail?

If you’re unsure, welcome to one of Pandas’ most confusing traps: views vs. copies' and the dreaded chained indexing.

This question isn't just about syntax; it's about your understanding of how Pandas works under the hood.

The Core Problem: Views vs. Copies
When you select a subset of a DataFrame, Pandas might return one of two things:

A View: This is a "window" into the original DataFrame. If you modify a view, the original DataFrame is also modified. This is memory-efficient.
A Copy: This is a brand new DataFrame, completely independent of the original. Modifying the copy will not change the original.

The danger is that Pandas doesn't guarantee which one you'll get. This ambiguity is the source of the problem.

The Culprit: Chained Indexing
The problematic code df[df['category'] == 'electronics']['price'] *= 0.9 is an example of chained indexing. Let's break it down:

df[df['category'] == 'electronics']: This is the first operation. Pandas executes this and returns a DataFrame. Is it a view or a copy? We don't know for sure. Let's call this temporary result temp_df.

['price'] *= 0.9: This second operation is performed on temp_df.

If temp_df was a copy, you just modified a temporary object that is immediately discarded. The original df remains unchanged. This is a silent failure – the worst kind of bug.

The [SettingWithCopyWarning](https://pandas.pydata.org/docs/reference/api/pandas.errors.SettingWithCopyWarning.html) is Pandas' way of telling you, "Hey, you're trying to modify something that might be a copy. I can't be sure if this will work as you expect."

The Solution: Use .loc for Assignments
The correct, unambiguous way to perform this operation is with the .loc indexer. .loc allows you to specify the rows and columns you want to access or modify in a single operation.

The syntax is .loc[row_indexer, column_indexer].

Incorrect (Chained Indexing):
df[df['category'] == 'electronics']['price'] *= 0.9

Correct (Using .loc):
df.loc[df['category'] == 'electronics', 'price'] *= 0.9

This code guarantees that the modification happens on the original df. You are explicitly telling Pandas: "In the DataFrame df, find the rows where category is electronics, select the price column for those rows, and update it."
No ambiguity, no warning, no silent failures.

Takeaway:

🚨 Never trust chained indexing in Pandas. 
✅ Always use .loc when modifying data.

It’s not just cleaner — it saves you from nasty bugs and makes your intent unambiguous.

🚀 Stay tuned — this is just Part 1 of the Pandas Gotchas series.
In the upcoming parts, we’ll cover more subtle traps, performance quirks, and interview-style puzzles that every Pandas user should know.

ACID vs BASE: The Foundations of Reliable Databases

Satyam Gupta — Tue, 23 Sep 2025 09:39:51 +0000

When I first started working with databases, I built a small app that updated a database, ran backend logic, and exported a CSV. Everything looked fine—until multiple users ran transactions at once.

One day, a transaction went through halfway: the database update in progress, but the logic continued, producing a broken CSV. That was my wake-up call.

It taught me why ACID properties aren’t just academic—they’re the silent contract that keeps systems reliable.

🔹 The ACID Breakdown

Atomicity → All or nothing. If one step fails, the entire transaction rolls back.

Consistency → The database always moves from one valid state to another.

Isolation → Transactions don’t interfere. Running in parallel should look like they happened sequentially.

Durability → Once a transaction is committed, it persists—even after a crash.

These guarantees make relational databases rock-solid for financial, healthcare, and mission-critical workloads.

🔹 Enter BASE in Distributed Systems

But as systems scale, strict ACID becomes hard (and expensive) to maintain across distributed nodes. That’s where BASE comes in:

Basically Available → The system guarantees availability even under failures.

Soft state → Data doesn’t have to be instantly consistent.

Eventually consistent → Data replicas converge to the same state over time.

BASE sacrifices immediacy of correctness for speed and resilience.

🔹 Practical Examples

ACID use cases → banking, inventory management, booking systems.

BASE use cases → social media feeds, recommendation engines, analytics dashboards.

Think about your Instagram “likes” count. It may not update instantly on all devices, but it eventually reflects the true number. That’s BASE in action.

🔹 Choosing Between ACID and BASE

It’s never black-and-white. Many modern systems blend both:

Use ACID guarantees for the critical path (e.g., recording a payment).

Use BASE for derived, less critical flows (e.g., analytics, user engagement metrics).

Takeaway: As engineers, our job is not just to know ACID and BASE definitions—but to make conscious trade-offs when designing architectures. Correctness and availability are both valuable—it’s about knowing where to lean on each.

👉 Where have you seen ACID vs BASE trade-offs in your projects?

Communicating with Data: A Simple Framework That Changed My Approach

Satyam Gupta — Wed, 10 Sep 2025 19:48:12 +0000

As engineers and analysts, we spend a lot of time building dashboards, pipelines, and reports. But here’s the uncomfortable truth I’ve learned:

📊Even the best-looking dashboard can still fail.

Why? Because if the audience doesn’t know what to do with it, the insight is wasted.

This happened to me multiple times — polishing a dashboard, sending it off, and then being asked: “Cool, but… what now?”

That’s when I started applying the Who, What, How framework (from Storytelling with Data). It’s simple, but powerful.

🔹 Who

Be crystal clear about your audience. Is it a VP making a budget decision? A PM prioritizing a feature? Another engineer debugging performance? Each requires a different lens.

🔹 What

Always tie data to an action. Don’t just show that user churn increased — recommend what should be done. Decide, approve, invest, support — make it actionable.

🔹 How

Pick the right channel:
Live deck = your voice carries nuance.
Written doc/email = more detail, less control.
Slideument = mix of both, often overused.

And don’t underestimate tone: urgent vs. celebratory vs. exploratory makes a difference.

Tools for Structuring Your Story

3-Minute Story: If you can’t explain it in 3 minutes, you probably don’t understand it well enough. This forces you to distill the essence.
Big Idea: Write down one sentence that combines your unique perspective + what’s at stake. That becomes the anchor for your narrative.
Storyboarding: Don’t open PowerPoint first. Use paper, a whiteboard, or Post-its to lay out the flow. It saves time and gets stakeholder buy-in before you over-invest in slides.

Why this matters for developers

We often think communication is “extra.” But if our work doesn’t drive decisions, it’s just numbers on a screen. By clarifying Who, What, and How, I’ve seen my work get adopted faster and conversations move from “interesting” to “decisive.”

Building an End-to-End Data Engineering Pipeline with DuckDB and Python

Satyam Gupta — Mon, 01 Sep 2025 17:39:42 +0000

Tldr: An end‑to‑end data engineering + analytics walkthrough that takes a public dataset from raw → cleaned → star schema (fact + dims) → data quality checks → business marts → charts using a single Jupyter notebook.

This article is also available on Medium and LinkedIn, but here I’ve included more code.

Data engineering isn’t just about moving data around — it’s about building pipelines that make data usable. In this tutorial, I’ll walk through how I turned a raw dataset into BI-ready marts and visualizations, all inside a single Jupyter Notebook.

We’ll follow the Medallion Architecture pattern:

Bronze → Raw data
Silver → Cleaned and standardized
Gold → Star schema (fact + dimensions)
QA → Quality checks
Marts → Business-friendly aggregates
Visualization → Final charts

👉 Full repo: https://github.com/Satyam-gupta20/data-engineering-endToend

Ingestion (Bronze Layer)

At this stage, everything is “as-is” — messy but captured.

train_ds = load_dataset('Hacker0x01/disclosed_reports', split = "train").to_pandas()
test_ds = load_dataset('Hacker0x01/disclosed_reports', split = 'test').to_pandas()
validate_ds = load_dataset('Hacker0x01/disclosed_reports', split = 'validation').to_pandas()

df = pd.concat([train_ds, test_ds, validate_ds], ignore_index = True)

# for any null dict fields, add {} to them instead of them being blank
for c in ['reporter', 'team', 'weakness', 'structured_scope']:
  df[c] = df[c].apply(lambda x : x if isinstance(x, dict) else {})

  df[c + "_json"] = df[c].apply(
      lambda d : json.dumps(
          {k : (v.tolist() if isinstance(v, np.ndarray) else v) for k,v in d.items()},
          sort_keys = True,
          ensure_ascii =  False
      )
  )

#convert date fields to datetime
df['created_at'] = pd.to_datetime(df.get("created_at"), errors = "coerce")
df['disclosed_at'] = pd.to_datetime(df.get("disclosed_at"), errors = "coerce")

bronze_cols = [ "id", "title", "created_at", "disclosed_at", "substate", "visibility", "has_bounty?", "vote_count",
               "original_report_id", "reporter_json", "team_json", "weakness_json", "structured_scope_json", "vulnerability_information" ]

bronze = df[[c for c in bronze_cols if c in df.columns]]
bronze.to_parquet("bronze_hackerone_reports.parquet", index = False)
bronze.to_csv("raw_data.csv")

Transformation (Silver Layer)

Silver = cleaned, consistent, ready for modeling.

# create staging table with clean/standardized scalar columns
con.sql("""
    CREATE OR REPLACE TABLE stg_reports AS
    SELECT
    CAST(id as BIGINT) as report_id,
    title,
    LOWER(NULLIF(substate,'')) as substate, --normalised casing
    visibility,
    "has_bounty?" as has_bounty,
    CAST(vote_count AS INTEGER) as vote_count,
    CAST(created_at AS TIMESTAMP) as created_at,
    CAST(disclosed_at AS TIMESTAMP) as disclosed_at,
    CAST(original_report_id AS BIGINT) as original_report_id,
    reporter_json,
    weakness_json,
    team_json,
    structured_scope_json,
    vulnerability_information
    FROM bronze;
""")

#stage normalised tables ie, flatten JSON into typed columns
con.sql("""
    CREATE OR REPLACE TABLE stg_reporter AS
    SELECT DISTINCT
    reporter_json,
    json_extract_string(reporter_json, '$.username') as username,
    CAST(json_extract(reporter_json, '$.verified') AS BOOLEAN) as verified
    from bronze;
""")

con.sql("""
    CREATE OR REPLACE TABLE stg_team AS
    SELECT DISTINCT
    team_json,
    json_extract_string(team_json, '$.handle') as handle,
    CAST(json_extract(team_json, '$.id') AS BIGINT) as id,
    CAST(json_extract(team_json,'$.offers_bounties') AS BOOLEAN) AS offers_bounties
    from bronze;
""")

con.sql("""
    CREATE OR REPLACE TABLE stg_weakness AS
    SELECT DISTINCT
    weakness_json,
    json_extract_string(weakness_json,'$.name') AS weakness_name,
    CAST(json_extract(weakness_json, '$.id') AS BIGINT) as id,
    FROM bronze;
""")

con.sql("""
CREATE OR REPLACE TABLE stg_asset AS
SELECT DISTINCT
  structured_scope_json,
  json_extract_string(structured_scope_json,'$.asset_identifier') AS asset_identifier,
  json_extract_string(structured_scope_json,'$.asset_type')       AS asset_type,
  json_extract_string(structured_scope_json,'$.max_severity')     AS max_severity
FROM bronze;
""")

Star Schema (Gold Layer)

We separate fact and dimension tables:

Goal: Build a Source of Truth model (star schema) with conformed dimensions and a single fact table.

Steps:

Generate Surrogate Keys:
Apply hash(JSON) on each entity’s raw JSON to generate a stable, privacy-safe surrogate key (reporter_id, team_id, weakness_id, asset_id).
Why: Maintains join consistency across all downstream systems while protecting sensitive identifiers (handles, usernames, etc.).
Dimension Tables:
dim_reporter, dim_team, dim_weakness, dim_structured_scope — one row per unique entity, keyed by surrogate ID.
Conformed (consistent) across all marts and use cases.
Fact Table:
fact_report — one row per report, with:
Natural key (report_id)
Foreign keys to all four dimensions
Core measures and attributes (has_bounty, vote_count, created_at, disclosed_at, substate)

Why Star Schema:

Easy to query for BI tools (Looker/Tableau/Power BI).
Clear separation of measures (facts) and descriptive attributes (dims).
Facilitates incremental loads and SCD handling in production.

How we do this in production:

Generate surrogate keys once in a controlled transformation job to guarantee stability.
Enforce PK/FK relationships via schema constraints or dbt tests.
Store Gold layer in a governed warehouse (Snowflake/BigQuery/Redshift) with strict access control, making it the single source of truth for all analytics & AI workloads.

AI tie-in: Clean, normalized entity attributes make it easier to build safe, PII-free ML feature sets downstream.

con.sql("""
    CREATE OR REPLACE TABLE dim_reporter as
    SELECT DISTINCT
    hash(reporter_json) as reporter_id, -- surrogate_key
    username,
    verified
    FROM stg_reporter;
""")

con.sql("""
    CREATE OR REPLACE TABLE dim_team as
    SELECT DISTINCT
    hash(team_json) as team_id, -- surrogate_key
    id,
    handle,
    offers_bounties
    FROM stg_team;
""")

con.sql("""
    CREATE OR REPLACE TABLE dim_weakness as
    SELECT DISTINCT
    hash(weakness_json) as weakness_id, -- surrogate_key
    id,
    weakness_name
    FROM stg_weakness;
""")

con.sql("""
    CREATE OR REPLACE TABLE dim_structured_scope as
    SELECT DISTINCT
    hash(structured_scope_json) as asset_id, -- surrogate_key
    asset_identifier,
    asset_type,
    max_severity
    FROM stg_asset;
""")

QA Layer

Goal: Validate that Gold layer tables meet data integrity, completeness, and consistency standards before they are exposed to BI tools or AI models.

Steps:

Record Counts:
Ensure no unexpected row loss or duplication between Bronze → Silver → Gold.
Example: COUNT(DISTINCT report_id) in fact_report should match original dataset count (minus intentional filters).
Key Integrity Checks:
All fact_report foreign keys (reporter_id, team_id, weakness_id, asset_id) must exist in their respective dimension tables.
Null and Data Type Checks:
Confirm mandatory fields (e.g., created_at, substate) are non-null.
Verify correct data types for dates, booleans, and integers.
Referential Consistency:
No orphaned dimension entries (dims without a matching fact) unless intentional for slowly changing dimensions.

Why it matters:

Guarantees trust in the Source of Truth.
Prevents BI dashboards or ML pipelines from producing misleading insights.

How we do this in production:

Use dbt tests (unique, not_null, relationships) or Great Expectations to automate QA.
Set up CI/CD checks so broken data never reaches production.
Implement data quality alerts (Slack/Email) when thresholds fail.

Phase 5 – Aggregation & Marts (Analytics Layer)

Goal: Create pre-aggregated, business-friendly datasets optimized for consumption by BI tools, APIs, and AI feature stores.

Why Marts Matter:

Simplify queries for analysts and business users.
Improve dashboard performance by avoiding heavy aggregations at runtime.
Provide feature-ready datasets for AI/ML models.

How we do this in production:

Create materialized views or incremental tables in the warehouse.
Store in BI schema separate from operational schemas.
Automate refreshes using orchestration tools (Airflow/Prefect) on a schedule or event trigger.
For AI, register these marts in a feature store (Feast/Tecton) so ML teams can use them without re-engineering features.

Key Takeaways

DuckDB is a game-changer for local SQL + analytics.
The Medallion approach keeps data modeling organized.
Star schemas still matter — they power BI/analytics-friendly datasets.

👉 Full notebook: GitHub – data-engineering-end-to-end

Would love feedback: What tools do you use for building marts — DuckDB, dbt, or something else?

Build a Fast URL Shortener with Flask and Redis (Beginner-Friendly 15 minute Tutorial)

Satyam Gupta — Sun, 22 Jun 2025 07:55:00 +0000

Project Overview

This project implements a basic URL shortening service using Python, the Flask web framework, and Redis as its primary data store. It demonstrates how to build a modular RESTful API that leverages Redis's speed as an in-memory key-value store for efficient URL mapping management and atomic ID generation. This is an excellent project for understanding fundamental backend development, database integration, and the practical application of Redis in a web context.

Features

Shorten URL: Accepts a long URL via a POST request and returns a unique, shortened URL.
Redirect: Redirects users from the generated short URL to its original long URL.
Fast Lookups: Utilizes Redis for extremely fast storage and retrieval of URL mappings.
Unique Code Generation (Custom): Generates unique short codes by atomically incrementing a counter in Redis and converting the resulting ID to a Base62 string.
Modular Design: Organized into separate Python files for application setup, route handling, and business logic (Redis interactions) using Flask Blueprints.
Basic Error Handling: Provides appropriate responses for missing URLs or invalid requests.

Technologies Used

Python 3.x
Flask: A lightweight and flexible web framework for building the API.
Redis: An open-source, in-memory data structure store used for fast storage of URL mappings and atomic counter operations.
redis-py: The official Python client library for interacting with Redis.

Project Structure
This project follows a common Flask application structure to promote modularity and separation of concerns:

url_shortener/
├── app.py                          # Main Flask application entry point, Redis initialization, blueprint registration.
├── routes.py                       # Defines all Flask application routes (e.g., '/', '/shorten', '/<short_code>') within a Blueprint.
├── services.py                     # Contains the core business logic and Redis interaction functions (e.g., generate/store/retrieve URLs).
├── templates/                      # Flask's default directory for HTML templates.
│   └── index.html                  # The simple homepage HTML for 
└── README.md                       # This project documentation.

How It Works

Application Initialization (app.py):

The create_app() factory function initializes the Flask application instance.

It establishes the connection to the Redis server by calling services.init_redis(), ensuring Redis is ready to serve requests.

It registers the main_bp Blueprint (defined in routes.py), which integrates all defined routes into the Flask application.

URL Shortening (routes.py & services.py):

A client sends a POST request to the /shorten endpoint (handled by a route in routes.py) with a JSON body containing the long_url.

The route handler performs basic URL validation and then delegates the core logic to services.generate_and_store_url(long_url).

Inside services.py:

It uses r.incr('next_short_id') to atomically increment a counter in Redis. This ensures that every generated ID is globally unique, even with concurrent requests, preventing collisions.

The unique numeric ID is then converted into a compact, alphanumeric Base62 short_code using a custom encoding function.

Finally, r.set(short_code, long_url) stores this mapping directly in Redis, with the short_code as the key and the long_url as its value.

The route then constructs the full shortened URL and returns it as a JSON response to the client.

URL Redirection (routes.py & services.py):

When a user accesses a shortened URL (e.g., http://localhost:5000/1), the short_code route handler (in routes.py) extracts the short_code from the URL path.

It calls services.get_long_url(short_code) to retrieve the original long_url from Redis.

If the long_url is found in Redis, the Flask application performs an HTTP redirect to the original URL.

If the short_code is not found (meaning it's invalid or never existed), a 404 Not Found error is returned.

detailed code and step by step implementation can be found on my github here: URL Shortener

A Beginner’s Guide to Building Your First Serverless Data Pipeline on AWS

Satyam Gupta — Fri, 20 Jun 2025 20:03:05 +0000

Stepping into the world of cloud computing, especially AWS, can feel daunting with its vast array of services.
This article will walk you through building a simple, cost-effective serverless data ingestion pipeline that fetches real-time weather data and stores it in AWS S3. It's perfect for anyone with basic Python knowledge looking to get their hands dirty with AWS Lambda and S3 on the free tier.

Why Serverless and AWS?
Serverless computing, like AWS Lambda, allows you to run code without provisioning or managing servers. You only pay for the compute time you consume, making it incredibly cost-effective and scalable for many use cases. Combining this with AWS S3, a highly durable and scalable object storage service, gives you a powerful foundation for data pipelines, microservices, and more.

For a beginner, this project is ideal because it:

Uses services with generous AWS Free Tier limits.
Introduces fundamental AWS concepts: Compute (Lambda), Storage (S3), Permissions (IAM), and Automation (EventBridge).
Leverages Python, a language many data professionals are familiar with.

Project Goal & Architecture
Our goal is simple: automatically fetch current weather data for a specific location and store it as a historical record in an AWS S3 bucket.

+----------------+     +-------------------+     +--------------------+
| AWS EventBridge| --> | AWS Lambda        | --> | AWS S3 Bucket      |
| (Scheduled Rule)|     | (Python Function) |     | (Weather Data JSON)|
+----------------+     +-------------------+     +--------------------+

The Core Components

AWS S3 (Simple Storage Service): Our data lake! This is where all the fetched weather data (as JSON files) will be stored. S3 is designed for 99.999999999% durability, making it extremely reliable.
AWS IAM (Identity and Access Management): To manage who can do what in our AWS account. We'll create a specific role for our Lambda function, granting it only the necessary permissions (e.g., to write to S3).
AWS Lambda: The heart of our serverless magic. Our Python code will run here, fetching data, processing it, and pushing it to S3. We don't worry about servers, patching, or scaling; Lambda handles it all.
Open-Meteo API: Our external data source. This is a free and open-source API that provides weather data without requiring an API key for basic usage.
AWS EventBridge (formerly CloudWatch Events): This service will act as our scheduler. We'll set up a rule to trigger our Lambda function at regular intervals (e.g., daily).
AWS CloudWatch: Lambda automatically integrates with CloudWatch for logging and monitoring, which is essential for debugging and observing our function's execution.

Do Check out the full project README on my GitHub

Step 1: Set up Your AWS Account (If you haven't already)

Go to aws.amazon.com.
Click "Create an AWS Account" and follow the instructions. You will need a credit card, but you won't be charged for free-tier usage. Important: Note down your AWS Account ID.

Step 2: Create an S3 Bucket (Storage)
This is where your weather data files will be stored.

Log in to the AWS Management Console.
Search for "S3" in the search bar at the top and click on "S3". Click "Create bucket".

Bucket name: Choose a unique, globally unique name (e.g., yourname-weather-data-bucket). Rule: Bucket names must be lowercase, no spaces, and unique across all of AWS.
AWS Region: Choose a region close to you (e.g., us-east-1 (N. Virginia), eu-west-1 (Ireland)). This impacts latency and some pricing, but usually stays within free tier for small usage.
Object Ownership: Keep the default "ACLs disabled (recommended)".
Block Public Access settings for this bucket: Keep "Block all public access" checked. This is crucial for security. Your Lambda function will access it, not the public internet.
Leave other settings as default for now.
Click "Create bucket".

Step 3: Create an IAM Role for Lambda (Permissions)
Your Lambda function needs permission to interact with other AWS services (like S3 for putting objects and CloudWatch for logging).

In the AWS Management Console, search for "IAM" and click on "IAM".
In the left navigation pane, click "Roles".
Click "Create role".
Trusted entity type: Select "AWS service".
Use case: Select "Lambda" from the list. Click "Next".
Permissions policies:
Search for AWSLambdaBasicExecutionRole and select it. This grants permission to write logs to CloudWatch.
Search for AmazonS3FullAccess and select it. This grants permission to put objects into S3.
Click "Next".
Role name: Give it a descriptive name, e.g., lambda-weather-s3-role.
Leave other settings as default.
Click "Create role".

Step 4: Create Your Lambda Function (Compute)
This will host your Python code.

In the AWS Management Console, search for "Lambda" and click on "Lambda".
Click "Create function".
Author from scratch is usually the best option for custom code.
Function name: e.g., WeatherDataFetcher
Runtime: Choose Python 3.10 (or the latest Python 3.x available).
Architecture: x86_64 (default).
Permissions:
Under "Change default execution role", select "Use an existing role".
Choose the IAM role you just created: lambda-weather-s3-role.
Click "Create function".

Step 5: Write Lambda Python Code and Configure Environment
Now you'll add your Python code to the Lambda function.

Once your Lambda function is created, you'll be on its configuration page. Scroll down to the "Code source" section.

You'll see a default lambda_function.py file. Replace its content with the following Python code.

import json
import os
import datetime
import logging
import urllib.request

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

import boto3
s3 = boto3.client('s3')

# Configuration (make sure these are set in the Lambda environment variables)
S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME')
CITY_LATITUDE = os.environ.get('CITY_LATITUDE')
CITY_LONGITUDE = os.environ.get('CITY_LONGITUDE')

WEATHER_API_URL = "https://api.open-meteo.com/v1/forecast"

def lambda_handler(event, context):
    logger.info(f"Lambda function triggered at {datetime.datetime.now()}")

    if not S3_BUCKET_NAME or not CITY_LATITUDE or not CITY_LONGITUDE:
        logger.error("Missing environment variables: S3_BUCKET_NAME, CITY_LATITUDE, or CITY_LONGITUDE.")
        return {
            'statusCode': 500,
            'body': json.dumps('Configuration error: Missing environment variables.')
        }

    try:
        # Build API URL with parameters
        params = (
            f"latitude={CITY_LATITUDE}&longitude={CITY_LONGITUDE}"
            "&current_weather=true"
            "&temperature_unit=fahrenheit"
            "&windspeed_unit=mph"
            "&timezone=auto"
        )
        full_url = f"{WEATHER_API_URL}?{params}"
        logger.info(f"Fetching weather data from: {full_url}")

        with urllib.request.urlopen(full_url) as response:
            weather_data = json.loads(response.read().decode())

        logger.info(f"Weather data: {weather_data}")

        current_weather = weather_data.get('current_weather', {})
        logger.info(f"Current weather: {current_weather}")

        # Parse timestamp safely
        time_value = current_weather.get('time')

        if time_value:
            try:
                # Open-Meteo returns ISO8601 string like "2025-06-20T18:30"
                timestamp = datetime.datetime.fromisoformat(time_value).isoformat()
            except ValueError:
                # If format isn't ISO, assume UNIX timestamp
                timestamp = datetime.datetime.fromtimestamp(float(time_value)).isoformat()
        else:
            timestamp = datetime.datetime.now().isoformat()

        formatted_data = {
            "timestamp": timestamp,
            "latitude": CITY_LATITUDE,
            "longitude": CITY_LONGITUDE,
            "temperature": current_weather.get('temperature'),
            "windspeed": current_weather.get('windspeed'),
            "winddirection": current_weather.get('winddirection'),
            "weathercode": current_weather.get('weathercode'),
            "is_day": current_weather.get('is_day'),
            "source_api": "Open-Meteo"
        }

        current_utc_time = datetime.datetime.utcnow()
        s3_key_prefix = current_utc_time.strftime("data/%Y/%m/%d/")
        s3_file_name = current_utc_time.strftime("weather_%Y-%m-%d-%H-%M-%S.json")
        s3_object_key = s3_key_prefix + s3_file_name
        logger.info(f"Uploading data to S3 key: {s3_object_key}")

        s3.put_object(
            Bucket=S3_BUCKET_NAME,
            Key=s3_object_key,
            Body=json.dumps(formatted_data, indent=2),
            ContentType='application/json'
        )
        logger.info(f"Successfully uploaded to s3://{S3_BUCKET_NAME}/{s3_object_key}")

        return {
            'statusCode': 200,
            'body': json.dumps(f'Weather data saved to S3 bucket {S3_BUCKET_NAME} as {s3_object_key}!')
        }

    except Exception as e:
        logger.error(f"Error occurred: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps(f'An error occurred: {e}')
        }

Deploy the code: After pasting the code, click the "Deploy" button at the top right of the "Code source" section.
Configure Environment Variables:
Below the "Code source" section, click on the "Configuration" tab.
In the left sidebar, click "Environment variables".
Click "Edit" and then "Add environment variable".
Add the following three variables:
Key: S3_BUCKET_NAME, Value: yourname-weather-data-bucket(your actual S3 bucket name) Key: CITY_LATITUDE, Value: YOUR_CITY_LATITUDE (e.g., 28.61 for New Delhi) Key: CITY_LONGITUDE, Value: YOUR_CITY_LONGITUDE (e.g., 77.20 for New Delhi)
Click "Save".
Adjust Basic Settings (Memory & Timeout):
- Still under the "Configuration" tab, click "General configuration".
- Click "Edit".
- Memory: Keep it at 128 MB (default, lowest cost).
- Timeout: Increase it slightly to 30 seconds (or 1 minute) to allow enough time for API calls and S3 uploads.
Click "Save".

Step 6: Test Your Lambda Function

On your Lambda function page, click the "Test" tab (or "Test" button near the top right).
Click the "Create new event" dropdown.
Event name: test-invoke
Event template: "hello-world" (the content of the JSON doesn't matter for this function as it doesn't use the event payload).
Click "Save".
Click the "Test" button.

You should see "Execution results" indicating Status: Succeeded.
Check the "Log output" for messages from logger.info() which will confirm the API call and S3 upload.

Verify in S3: Go back to your S3 bucket in the AWS console. You should see a new folder data/ and inside it, subfolders for the current year/month/day, containing a new JSON file (e.g., weather_YYYY-MM-DD-HH-MM-SS.json).

Step 7: Configure a Trigger (EventBridge Schedule)
This makes your function run automatically.

On your Lambda function page, click the "Configuration" tab.
In the left sidebar, click "Triggers".
Click "Add trigger".
Select a source: Choose "EventBridge (CloudWatch Events)".
Rule: "Create a new rule".
Rule name: e.g., daily-weather-fetcher
Rule type: "Schedule expression".
Schedule expression: cron(0 0 * * ? *) (This will run the function once every 24 hours at midnight UTC). Tip: You can use rate(1 day) for daily, or cron(0 12 ? * MON-FRI *) for 12 PM UTC, Monday-Friday. Be mindful of frequency to stay within free tier! For testing, you could use rate(5 minutes) but change it back after testing.
Leave other settings as default.
Click "Add". Your Lambda function will now automatically run according to your schedule, fetching and storing weather data daily!

Step 8: Clean Up Your AWS Resources (Crucial for Free Tier)
Always remember to clean up resources after your learning project to avoid unexpected charges, especially if you step outside the free tier. EventBridge rule, Lambda function, S3 bucket (after emptying its contents), and the IAM role.

I hope this article provides a clear guide and inspires others to get hands-on with AWS. The cloud journey is full of exciting possibilities!

What is Caching

Satyam Gupta — Thu, 19 Jun 2025 07:11:52 +0000

Caching is a fundamental technique in computing that aims to improve performance by storing frequently accessed data in a faster, more accessible location. Think of it like having a small, super-fast notepad next to your main, much larger bookshelf. When you need a book, you first check your notepad. If it's there, great, you grab it instantly. If not, you go to the bookshelf, find the book, and then also write down its name on your notepad for next time.

The Core Idea: Speeding Up Access

Problem: Accessing data from its original source (e.g., a database, a remote server, a hard drive) can be slow.
Solution: Store copies of this data in a "cache" – a temporary storage area that is faster to access.

Key Concepts:

Cache Hit: When the requested data is found in the cache. This is the desired outcome, as it means fast access.
Cache Miss: When the requested data is not found in the cache. In this case, the system has to go to the original (slower) source to retrieve the data. Once retrieved, the data is often then added to the cache for future requests.
Cache Eviction Policy: What happens when the cache is full and new data needs to be added? An eviction policy determines which existing data to remove to make space. Common policies include:
LRU (Least Recently Used): Discard the item that hasn't been accessed for the longest time.
LFU (Least Frequently Used): Discard the item that has been accessed the fewest times.
FIFO (First-In, First-Out): Discard the oldest item in the cache.
MRU (Most Recently Used): Discard the item that was accessed most recently. (Less common for general caching, sometimes used in specific scenarios).
Cache Invalidation: How do you ensure the data in the cache is up-to-date with the original source? This is a critical and often complex aspect. If the original data changes, the cached copy becomes "stale" and needs to be updated or removed. Invalidation strategies include: a. Time-based expiration: Data expires after a certain period. b. Event-driven invalidation: When the original data changes, a notification is sent to invalidate the cache entry. c. Manual invalidation: Explicitly removing data from the cache. Cache Coherence: In distributed systems, ensuring all caches have a consistent view of the data. This is a very challenging problem.

Where is Caching Used?
Caching is ubiquitous in computing, from hardware to software:

CPU Caches (L1, L2, L3): Small, extremely fast caches built directly into the CPU to store frequently used instructions and data.
Web Browsers: Your browser caches static assets (images, CSS, JavaScript) from websites you visit to speed up subsequent visits. DNS Caching: Your operating system and local DNS servers cache IP address lookups to resolve domain names faster.
Content Delivery Networks (CDNs): Geographically distributed servers that cache content (images, videos, web pages) closer to users, reducing latency.
Database Caching: Caching query results or frequently accessed data in memory to reduce the load on the database server. Examples include Redis, Memcached.
Application-Level Caching: Developers implement caching within their applications to store computed results or frequently accessed data.
Operating System Caches: Disk caches, file system caches, etc., to speed up file access.

Benefits of Caching:

Improved Performance/Speed: Faster data retrieval, leading to a more responsive user experience.
Reduced Latency: Less time waiting for data from slow sources.
Reduced Load on Backend Systems: By serving requests from the cache, the original data source (e.g., database, API) is hit less often, reducing its workload.
Cost Savings: Fewer requests to backend systems can lead to lower infrastructure costs (e.g., database queries, bandwidth).

Challenges of Caching

Cache Invalidation: The biggest challenge. Ensuring cached data is fresh and consistent with the source.

"There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors."

Phil Karlton

Cache Coherence: In distributed systems, maintaining consistency across multiple caches.
Cache Miss Penalties: If the cache hit rate is low, the overhead of checking the cache plus going to the original source can sometimes be slower than just going to the original source directly.
Memory Usage: Caches consume memory, which needs to be managed efficiently.
Complexity: Implementing robust caching can add complexity to system design.

When to Use Caching:

When data is accessed frequently.
When the data changes infrequently.
When the cost of retrieving data from the original source is high (e.g., slow network, expensive computation).
When you need to reduce load on backend systems.

When Not to Use Caching:

When data is constantly changing (high churn).
When the data is rarely accessed.
When the overhead of caching (memory, invalidation logic) outweighs the benefits.
When strict real-time consistency is paramount and simple caching mechanisms can't guarantee it.

In essence, caching is a trade-off: you gain speed and efficiency by sacrificing some degree of real-time consistency and adding a layer of complexity. Understanding these trade-offs is crucial for effective system design.

Activation Functions in Neural Networks — The Real MVPs of Deep Learning

Satyam Gupta — Thu, 29 May 2025 18:39:36 +0000

If you've ever dipped your toes into the world of neural networks, you’ve probably heard about activation functions. At first glance, they sound like some secret weapon only AI wizards know about. But once you understand what they are and why they matter, it all clicks.

Let me break it down in a way that feels less like a math textbook and more like a conversation over coffee.

What is an Activation Function?

Imagine you’re building a neural network. You pass in some inputs (like pixel values from an image), multiply them by some weights, add a bias, and boom — you get a number.

But then what?

That’s where activation functions come in. They act like gatekeepers or decision-makers for each neuron. Once your neuron computes a value, the activation function decides whether it should "fire" or not — and by how much.

Without this decision-making layer, your entire neural network would just be a glorified linear equation. And that means it wouldn’t be able to understand complex things like recognizing faces, translating languages, or even recommending you cat videos.

Why Do We Need Them?

In a nutshell:

They bring non-linearity to the model.
They help the network learn complex patterns in data.
They allow backpropagation by being differentiable.

Think of it like this: If you’re trying to model something complex like "is this a dog or a cat?", you need a network that can think in curves, not straight lines.

Common Activation Functions

Let’s take a tour of the most popular ones, minus the jargon:

1. Sigmoid (𝜎)

f(x) = 1 / (1 + e^-x)

Range: (0, 1)

Good for: Binary classification (e.g., yes/no problems)

Analogy: Like turning a dimmer switch between 0 and 1.

Downside: Can lead to the vanishing gradient problem.

2. ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Range: [0, ∞)

Default choice for most hidden layers.

Fast, simple, and introduces non-linearity beautifully.

Problem: Sometimes neurons "die" and only output 0.

3. Softmax

Used only in the output layer for multi-class classification problems.

softmax(x_i) = e^(x_i) / sum(e^(x_j) for all j)

Example: If you're building a model to detect if an image is a cat, dog, or rabbit — softmax gives you something like:

Cat: 0.7

Dog: 0.2

Rabbit: 0.1

Choosing the right activation function?

Here’s a cheat sheet I wish someone gave me earlier:

Layer Type	Recommended Activation
Hidden Layers	ReLU or Leaky ReLU
Output (Binary)	Sigmoid
Output (Multi-class)	Softmax

Machine Learning: A Quick Intro to different types of ML

Satyam Gupta — Wed, 07 May 2025 06:59:31 +0000

Machine learning is everywhere these days—from the recommendations on Netflix to self-driving cars cruising down the streets. But if you’ve ever tried to dive deeper, you’ve probably come across terms like supervised, unsupervised, and reinforcement learning. It can get overwhelming, especially if you're just starting out.

Let’s break it down in simple terms, without the jargon overload, so you can actually understand what these types of machine learning are all about.

1. Supervised Learning: Learning with a Teacher
Imagine you’re learning math in school. Your teacher gives you a bunch of problems and the answers. Over time, you start figuring out patterns—"Oh, whenever I see this kind of equation, I should solve it this way." That’s exactly what supervised learning is like.

In supervised learning, the machine is given input data and the correct output (the “label”). It learns from these examples so that, later, when it sees new data, it can predict the right answer.

Common uses:

Spam detection in emails
Predicting house prices
Recognizing handwritten digits

Example algorithms: Linear regression, decision trees, random forests, support vector machines.

2. Unsupervised Learning: Figuring it Out Alone
Now, imagine you’re handed a bunch of puzzles, but no one tells you what the final picture should look like. You have to sort things out on your own—group similar pieces together, notice patterns, and maybe discover some hidden structure.

That’s unsupervised learning in a nutshell. The machine is only given data—no answers, no labels. Its job is to find patterns or groups within the data.

Common uses:

Customer segmentation (grouping customers based on buying behavior)
Market basket analysis (what products are bought together)
Anomaly detection (spotting unusual activity in bank transactions)

Example algorithms: K-means clustering, hierarchical clustering, principal component analysis.

3. Reinforcement Learning: Learning by Trial and Error
Picture a child learning to ride a bike. At first, they fall a lot. But every time they manage to stay balanced a little longer, they get excited and try again. Over time, they figure out how to keep going without falling. That’s the spirit of reinforcement learning.

In this type of learning, an agent (the learner) interacts with an environment. It takes actions, gets rewards or penalties, and uses that feedback to improve over time. The goal? Maximize the reward.

Common uses:

Training AI to play video games
Robotics (teaching robots to walk or grasp objects)
Dynamic pricing models

Example algorithms: Q-learning, Deep Q Networks (DQNs), Policy gradients.

✨ Wrapping It Up
At its core, machine learning is about enabling computers to learn patterns from data. The different types—supervised, unsupervised, reinforcement—just describe how they’re learning.

Each type has its strengths and is suited for different problems. Knowing which one to use is like picking the right tool for the job. And as machine learning continues to evolve, hybrid approaches and new techniques are being developed every day.

Whether you’re a curious beginner or someone considering a career in AI, understanding these basics is the perfect first step.

So next time you hear “machine learning,” you won’t just nod along—you’ll actually get it!

AWS S3 Storage Classes Explained: Choosing the Right One

Satyam Gupta — Sat, 26 Apr 2025 19:16:57 +0000

Did you know that AWS S3 offers 8 major storage classes, each optimized for different use cases? Choosing the right one can save you thousands in cloud costs! In this article, we'll break down S3 storage classes and help you pick the best one for your needs.

What is Amazon S3?

Amazon S3 (Simple Storage Service) is a highly scalable object storage service that enables you to store and retrieve data from anywhere on the web. Unlike block storage, S3 uses a flat structure, meaning if you modify even a single character in a 1GB file, the entire file must be updated.

Key Features of S3:

Object Storage: Data is stored as objects in a flat structure.
Buckets: Objects are stored in buckets—logical containers where you define a region and a unique name.
Scalability: Automatically scales to accommodate vast amounts of data.

AWS S3 Storage Classes & Their Use Cases

AWS offers multiple storage classes to balance cost, performance, and durability. Here’s a breakdown:

Storage Class	Best For	Retrieval Time	Cost
S3 Standard	Frequently accessed data	Milliseconds	High
S3 Intelligent-Tiering	Data with unpredictable access patterns	Milliseconds	Medium (monitoring & automation fees apply)
S3 Standard-IA	Infrequently accessed, needs quick retrieval	Milliseconds	Lower than Standard; 30-day minimum storage
S3 One Zone-IA	Infrequent access (Single AZ)	Milliseconds	Lower than Standard-IA; 30-day minimum
S3 Glacier Instant Retrieval	Archival data needing fast access	Milliseconds	Low; 90-day minimum storage
S3 Glacier Flexible Retrieval	Archive with flexible retrieval time	Expedited: 1-5 min Standard: 3-5 hrs Bulk: 5-12 hrs	Lower than Instant Retrieval; 90-day minimum
S3 Glacier Deep Archive	Long-term archival (rarely accessed)	Standard: 12 hrs Bulk: 48 hrs	Lowest; 180-day minimum storage
S3 Express One Zone	High-performance AI/ML applications	Milliseconds	Medium; Single AZ storage

(source: AWS S3 Documentation)

Deep Dive into Each Storage Class

1. Standard (General Purpose)

Best for frequently accessed data (e.g., cloud apps, websites, gaming, and big data analytics).
High availability and low-latency retrieval.

2. Intelligent-Tiering

Automatically moves data between two tiers: frequent and infrequent access.
Ideal when access patterns are unpredictable.

3. Standard-IA (Infrequent Access)

For data that’s accessed less often but needs rapid retrieval.
Lower storage cost than Standard, but higher retrieval cost.
Suitable for backups, disaster recovery, and long-term data storage.

4. One Zone-IA

Like Standard-IA but stores data in a single Availability Zone (AZ), reducing costs by ~20%.
Best for non-critical data where high availability isn’t required.

5. Glacier Instant Retrieval

Low-cost archival storage with millisecond retrieval times.
Good for long-term storage of data that’s rarely accessed but still needs fast retrieval.

6. Glacier Flexible Retrieval

Designed for data accessed 1-2 times per year.
Lower retrieval cost than Glacier Instant Retrieval, but retrieval times vary from minutes to hours.

7. Glacier Deep Archive

Lowest-cost storage class for long-term retention.
Ideal for compliance and regulatory archives.
Retrieval time: Hours (best for data you rarely need to access).

8. S3 Express One Zone

Amazon’s newest storage class, optimized for fast, low-latency workloads that don’t require high availability across multiple regions.
Retrieval time: Faster than Standard S3 due to its single-zone architecture

Final Thoughts

Choosing the right AWS S3 storage class depends on how frequently you access your data and your cost constraints.

Which S3 storage class do you use the most? Let me know in the comments!