Forem: Salma Aga Shaik

Understand Hadoop and Apache Spark

Salma Aga Shaik — Sun, 15 Mar 2026 17:41:49 +0000

Imagine a company that runs a very popular online platform. Every day, millions of users visit the website, make purchases, click on products, and generate application logs. All these activities produce a very large amount of data.

After some time, the company collects terabytes of data. This data includes customer transactions, website clicks, machine logs, and system events.

Now the company wants to analyze this data to answer questions like:

Which products are selling the most?
What time do customers visit the website?
Are there any system errors?
How can the company improve its services?

At first, the company tries to process the data using one computer, but the data is too large. The computer becomes slow and cannot process the data efficiently.

To solve this problem, the company decides to use a distributed system, where many machines work together to store and process the data.

This is where Hadoop and Apache Spark come into the picture.

Hadoop: Storing and Processing Large Data

The company first starts using Hadoop.

Hadoop is a big data framework that helps companies store and process large datasets using multiple machines.

One important part of Hadoop is HDFS (Hadoop Distributed File System).

Instead of storing a large file on one machine, Hadoop splits the file into smaller blocks and stores those blocks across many machines in the cluster. This allows the system to store huge amounts of data reliably.

Hadoop also uses a processing model called MapReduce.

MapReduce processes the data step by step across the cluster. However, during processing it writes intermediate data to disk many times, which makes the processing slower.

Hadoop works well for batch processing, where large data is processed in stages.

Spark: Faster Data Processing

Later, the company learns about Apache Spark.

Spark is a fast distributed data processing engine designed to process large datasets quickly.

Like Hadoop, Spark also processes data across multiple machines in a cluster. However, Spark has a major advantage.

Spark performs in-memory computation, which means it processes data in memory (RAM) instead of repeatedly writing data to disk.

Because memory is much faster than disk, Spark can process data much faster than Hadoop MapReduce.

How Spark Works

In a Spark system, many machines work together.

At the center of the system is the Driver Program. The driver acts like the manager of the Spark application. It starts the job, creates the execution plan, and manages the processing.

The actual data processing happens in Executors. Executors run on worker machines in the cluster and perform the real computation.

When a Spark job starts, the driver creates a plan called a DAG (Directed Acyclic Graph). This plan shows how the data will be processed step by step.

Spark then divides the job into smaller tasks and sends those tasks to executors. The executors process the data in parallel and return the results to the driver.

Transformations and Actions in Spark

Spark operations are divided into two types.

The first type is Transformations. These operations modify the data but do not execute immediately. Examples include filtering rows or selecting columns.

The second type is Actions. Actions trigger the actual execution of the Spark job. Examples include counting records or saving results.

Spark waits until an action is called before executing the full computation. This concept is called lazy evaluation, which helps improve performance.

Where Spark Is Used

Spark is widely used in data engineering and analytics pipelines.

For example:

Data Sources
→ Streaming systems or APIs
→ Spark processing
→ Data lake (Amazon S3 or HDFS)
→ Data warehouse (Redshift or Snowflake)
→ BI tools like Power BI or Tableau

Spark processes and transforms the data so that companies can analyze it and generate insights.

Difference Between Hadoop and Spark

Feature	Hadoop	Spark
What it is	Hadoop is a big data framework used to store and process large data.	Spark is a fast data processing engine used to process large data quickly.
How it processes data	Hadoop processes data using MapReduce and writes data to disk many times.	Spark processes data mostly in memory (RAM).
Speed	Hadoop is slower because it reads and writes data to disk frequently.	Spark is faster because it processes data in memory.
Main use	Hadoop is mainly used for storing large data and batch processing.	Spark is used for fast data processing and analytics.
Type of processing	Hadoop mostly supports batch processing.	Spark supports batch processing, streaming, machine learning, and SQL.
Ease of coding	Hadoop MapReduce requires more code and is harder to write.	Spark is easier to use because it has APIs like Python, Java, and SQL.
Where it is used	Hadoop is often used for distributed storage using HDFS.	Spark is used for ETL pipelines, real-time analytics, and big data processing.

Conclusion

Hadoop and Spark are both technologies used to process very large datasets using multiple machines.

Hadoop is mainly used for distributed storage and batch processing, while Spark is designed for fast data processing using in-memory computation.

Today, many companies use Spark with cloud platforms such as AWS EMR, AWS Glue, and Databricks to build modern data engineering and analytics systems.

Modern Data Engineering Architecture Across AWS, GCP, and Azure

Salma Aga Shaik — Sun, 15 Mar 2026 16:58:47 +0000

In modern data platforms, organizations build end-to-end data pipelines to collect, process, store, and analyze large volumes of data.

Although different cloud providers offer different services, the core architecture pattern remains the same.

A typical data engineering architecture contains the following stages:

Data Generation
Data Ingestion
Data Processing
Data Lake Storage
SQL Query Layer
Data Warehouse Analytics
Business Intelligence Visualization

End-to-End Data Pipeline Architecture

The diagram above represents a typical enterprise data pipeline architecture used by modern companies.

The goal of this architecture is to move data from operational systems into analytics platforms where it can generate business insights.

Cloud Data Engineering Architecture Comparison

Architecture Layer	What Happens in This Layer	AWS Implementation	GCP Implementation	Azure Implementation
1. Data Sources	Data is generated from applications, IoT devices, databases, logs, and user transactions.	Applications, RDS databases, server logs, IoT sensors	Applications, Cloud SQL, logs, IoT devices	Applications, Azure SQL, logs, IoT devices
2. Data Ingestion (Streaming)	Real-time data is continuously collected and streamed into the data pipeline.	Amazon Kinesis or Managed Kafka (MSK)	Google Cloud Pub/Sub	Azure Event Hubs
3. Batch Data Ingestion	Batch data from files, APIs, or databases is periodically ingested.	AWS Glue	Cloud Dataflow	Azure Data Factory
4. Data Processing (ETL / Big Data Processing)	Data is cleaned, transformed, and enriched using distributed processing frameworks.	Amazon EMR running Apache Spark	Dataproc	Azure Databricks
5. Data Lake Storage	Raw and processed data is stored in scalable object storage systems.	Amazon S3	Google Cloud Storage	Azure Data Lake Storage
6. Metadata & Catalog	Stores metadata information such as schema definitions and table structures.	AWS Glue Data Catalog	Data Catalog	Azure Purview
7. SQL Query Engine	Engineers and analysts run SQL queries on large datasets stored in the data lake.	Amazon Athena	BigQuery	Azure Synapse Analytics
8. Data Warehouse	Processed data is loaded into a data warehouse optimized for analytics queries.	Amazon Redshift	BigQuery	Azure Synapse Analytics
9. Workflow Orchestration	Pipelines are scheduled and automated to manage dependencies.	AWS Step Functions / Managed Airflow	Cloud Composer	Azure Data Factory Pipelines
10. Monitoring & Logging	Pipeline performance and failures are tracked using monitoring tools.	Amazon CloudWatch	Cloud Monitoring	Azure Monitor
11. Visualization / BI	Business teams analyze data using dashboards and reports.	Amazon QuickSight	Looker	Power BI

Data Pipeline Flow

A typical data engineering pipeline works like this:

Data Sources: Applications,transaction systems, and log systems generate raw data.
Streaming Ingestion: Streaming platforms like Apache Kafka or Amazon Kinesis capture real-time events.
Data Processing: Processing engines such as Apache Spark perform data cleaning, transformation, and aggregation.
Data Lake Storage: Data is stored in scalable Data Lakes such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage.
SQL Query Layer: Tools like Amazon Athena, BigQuery, or Azure Synapse allow engineers to run SQL queries on big data.
Data Warehouse Analytics: Structured analytics data is stored in Amazon Redshift,BigQuery, or Synapse Analytics.
BI Dashboards: Visualization tools such as Power BI, Looker, or Amazon QuickSight create interactive dashboards and reports.

Understanding Hadoop Architecture

Salma Aga Shaik — Sun, 15 Mar 2026 15:52:27 +0000

Imagine a company that collects a large amount of data every day, such as website logs, transactions, or user activity. After some time, the data becomes too large for a single computer to store and process. This is where Hadoop helps.

Hadoop is a big data framework designed to store and process very large datasets across many machines. Instead of using one powerful computer, Hadoop uses multiple machines working together, which are called nodes. These machines together form a Hadoop cluster.

Storage Layer — HDFS

The storage system used by Hadoop is called HDFS (Hadoop Distributed File System).

When a large file is stored in Hadoop, it is automatically split into smaller pieces called blocks. These blocks are then distributed across multiple machines in the cluster.

For example:

Block 1 → Machine 1
Block 2 → Machine 2
Block 3 → Machine 3

This approach is called distributed storage, and it allows Hadoop to store very large datasets efficiently.

Hadoop Nodes

In a Hadoop cluster, there are two important types of nodes.

NameNode (Master Node)

The NameNode acts like the manager of the system. It stores metadata, which includes:

file names
block locations
which machine stores each block

The NameNode does not store actual data. It only manages the file system and keeps track of the data.

DataNodes

DataNodes are the machines that store the actual data blocks.

For example:

DataNode 1 → Block 1
DataNode 2 → Block 2
DataNode 3 → Block 3

This layer is known as the storage layer of Hadoop.

Processing Layer — MapReduce

After the data is stored, Hadoop needs to process the data. This is done using MapReduce.

MapReduce is a distributed data processing framework that works in two main steps.

Map Phase

In the Map phase, a large task is divided into smaller tasks.
Each machine processes a small part of the data.

Reduce Phase

In the Reduce phase, the results from all machines are combined to produce the final output.

This process allows Hadoop to process huge datasets quickly by using parallel processing.

Example

Imagine a company wants to analyze millions of website log records to see how many users visited from each country.

The log data is stored in HDFS.
Hadoop splits the logs into blocks and stores them across multiple DataNodes.
MapReduce processes the data in parallel on different machines.
The Reduce phase combines the results and generates the final report.

Architecture

Conclusion

Hadoop architecture works with two main components:

HDFS → for distributed storage
MapReduce → for distributed processing

By splitting data across multiple machines and processing it in parallel, Hadoop allows organizations to store and analyze massive datasets efficiently.

AWS S3 Storage Classes (Start to End)

Salma Aga Shaik — Sun, 22 Feb 2026 19:34:16 +0000

1) What is Amazon S3?

Amazon S3 (Simple Storage Service) is an AWS service used to store files like images, videos, logs, backups, datasets, and reports as objects inside buckets.

Bucket = main container (like a top-level folder)
Object = the actual file (data + metadata)

S3 is widely used for:

Data lakes
Backups and disaster recovery
Application logs
Static website files
Analytics and machine learning datasets
Long-term archiving and compliance

2) Why does S3 have multiple storage classes?

Not all data is used in the same way:

Some data is used daily (hot data)
Some data is used sometimes (cold data)
Some data is almost never used (archive data)

So AWS provides different S3 storage classes to help you balance:

Cost – how much you pay for storage
Speed – how fast you can read data
Availability – how often data is accessible
Risk – multi-AZ vs single-AZ
Retrieval fee – extra cost when you download data in some classes

3) Key Terms

Term	Simple Meaning	Easy Example
Durability	How safe your data is from being lost	Even if disks fail, AWS still keeps your file safe
11 nines durability (99.999999999%)	Extremely high safety	“Almost never lost”
Availability	How often data is accessible	99.99% means very little downtime
Latency	How fast you can access data	Milliseconds = very fast
Throughput	How much data can be read/written per second	Important for big analytics jobs
Retrieval fee	Extra cost when you download data	Some classes charge when you read
Availability Zone (AZ)	One data center inside a region	Multi-AZ is safer than single AZ

4) There are 8 Storage Classes

4.1 S3 Standard – Hot Data

Used for frequently accessed and business-critical data.

Key Features:

Very fast access (milliseconds): Suitable for real-time applications and user-facing systems.
High availability: Designed to be available almost all the time for applications.
Multi-AZ durability: Data is safely stored across multiple data centers.
No retrieval fee: You don’t pay extra when reading or downloading data.

Use Cases:

Website images and videos served to users
Daily application logs used by engineers
Active analytics datasets queried many times per day
Frequently used ML training and inference data

Example: Today’s sales data used every hour → S3 Standard

Remember: Standard = Hot + Fast

4.2 S3 Intelligent-Tiering – AWS Decides Automatically

For data where you don’t know how often it will be accessed.

Key Features:

Automatic movement between tiers: AWS moves objects to cheaper tiers when access reduces.
No performance impact: Applications access data the same way.
Small monitoring fee: Charged for AWS to track access patterns.

Use Cases

Data lakes where new data is hot and old data becomes cold
ML datasets where some features are used more than others
Analytics history that changes in access frequency

Example: Some months of logs are queried often, others not → Intelligent-Tiering

Remember: Intelligent = “I don’t know access pattern”

4.3 S3 Standard-IA – Cold but Fast

For data accessed rarely, but must be accessed immediately when needed.

Key Features:

Lower storage cost than Standard: Helps save money for infrequently used data.
Fast access: Still milliseconds when you retrieve data.
Retrieval fee applies: Extra cost when you download data.
Multi-AZ durability: Safe across multiple data centers.

Use Cases:

Backups used only during failures
Disaster recovery data
Old reports accessed occasionally

Example: Weekly backups restored only during failure → Standard-IA

Remember: IA = Rare, but fast

4.4 S3 One Zone-IA – Cheaper but Risky

Same as Standard-IA, but stored in one Availability Zone only.

Key Features:

Cheaper than Standard-IA: Cost saving for non-critical data.
Single AZ risk: If that AZ goes down, data can be unavailable.
Fast access: Still millisecond latency.
Retrieval fee applies.

Use Cases:

Re-creatable ETL outputs
Temporary pipeline files
Secondary backups

Example: Temporary pipeline files → One Zone-IA

Remember: One Zone = Cheap + Risk

4.5 S3 Glacier Instant Retrieval – Archive + Fast

S3 Glacier Instant Retrieval is a storage class for archived data that is rarely accessed, but when you need it, you can open it immediately. It is mainly used for long-term storage where data is kept for compliance or record-keeping, but still needs instant access sometimes.

Key Features

Very low storage cost
Instant (milliseconds) access
Retrieval fee applies
Multi-AZ durability

Use Cases:

Compliance documents that must open quickly
Audit logs needed during investigations

Example: Legal docs opened only during audits → Glacier Instant

Remember: Glacier Instant = Archive + Fast

4.6 S3 Glacier Flexible Retrieval – Archive + Wait

S3 Glacier Flexible Retrieval is used for archived data that is almost never accessed, and when it is accessed, you are okay to wait some time before getting the data back. This class is mainly for long-term backups and historical data.

Key Features:

Very low cost for long-term storage
Multiple retrieval speeds: expedited, standard, bulk
Suitable for large archive restores

Use Cases:

Old backups
Historical logs

Remember: Flexible = Waiting is okay

4.7 S3 Glacier Deep Archive – Cheapest + Slowest

S3 Glacier Deep Archive is the lowest-cost storage class in Amazon S3. It is used for data that must be kept for many years and is almost never accessed. This is mainly for legal, regulatory, and compliance requirements.

Key Features:

Cheapest storage class
Retrieval time 12–48 hours
Best for compliance and legal retention

Use Cases:

Financial records
Government data

Remember: Deep Archive = Coldest + Slowest + Cheapest

4.8 S3 Express One Zone – Extra Fast, Single AZ

S3 Express One Zone is a storage class designed for very high-performance workloads. It is used when applications need very low latency and very high request rates for reading and writing data. Data is stored in only one Availability Zone, so it is faster but less resilient compared to multi-AZ classes.

Key Features :

Ultra-fast performance for request-heavy workloads
High throughput for many small reads/writes
Stored in one AZ only (less resilient)

Use Cases:

Real-time analytics
ML feature stores
Hot ETL intermediate data

Example: Pipeline reading millions of small files → Express One Zone

Remember: Express = Extra fast, One Zone = Single AZ

5) Comparision table for all 8 S3 Storage Classes

Storage Class	Access Pattern	Retrieval Speed	Storage Cost	Extra Cost	Availability / Risk	Best For
S3 Standard	Frequently accessed	Milliseconds	High	No	Multi-AZ, very safe	Hot data, websites, active logs
S3 Intelligent-Tiering	Unknown / changing	Milliseconds	Medium	Monitoring fee	Multi-AZ	Unpredictable workloads
S3 Standard-IA	Infrequent but fast needed	Milliseconds	Lower	Retrieval fee	Multi-AZ	Backups, DR
S3 One Zone-IA	Infrequent, non-critical	Milliseconds	Cheaper	Retrieval fee	Single AZ risk	Re-creatable data
S3 Glacier Instant Retrieval	Rare but instant needed	Milliseconds	Very low	Retrieval fee	Multi-AZ	Compliance archives
S3 Glacier Flexible Retrieval	Very rare access	Minutes → Hours	Very low	Retrieval fee	Multi-AZ	Old backups, logs
S3 Glacier Deep Archive	Almost never accessed	12–48 hours	Lowest	Retrieval fee	Multi-AZ	Legal & long-term records
S3 Express One Zone	Very frequent, high-performance	Ultra-fast	Higher	Request-based pricing	Single AZ	High-performance analytics, ML

6) How to Choose Quickly

Ask yourself these 3 simple questions:

i) How often will the data be accessed?

Daily or many times a day → S3 Standard
Not sure / changes over time → S3 Intelligent-Tiering
Rarely → Use IA or Glacier classes

ii) When needed, how fast must I get the data?

Instant (milliseconds) → Standard, Standard-IA, Glacier Instant
Can wait minutes or hours → Glacier Flexible
Can wait 1–2 days → Glacier Deep Archive

iii) Is the data critical or can it be recreated?

Critical data → Choose multi-AZ classes
Non-critical or re-creatable data → Choose single-AZ classes

Quick Mapping Table

Scenario	Best Choice
App serving images every second	S3 Standard
Logs with changing access patterns	Intelligent-Tiering
Weekly backups	Standard-IA
Temporary ETL output	One Zone-IA
Compliance docs needing instant access	Glacier Instant
Large archive restores	Glacier Flexible
10-year legal retention	Glacier Deep Archive
High-performance ML feature reads	S3 Express One Zone

7) How to Remember

Hot → Standard
Unknown → Intelligent
Cold → IA
Very Cold → Glacier
Coldest → Deep Archive
Ultra-fast hot data → Express One Zone

8) What is Amazon S3 and What is a Bucket?

Amazon S3 (Simple Storage Service) is a cloud storage service provided by AWS. It is used to store files and data such as images, videos, logs, backups, datasets, and documents.

An Amazon S3 bucket is the main container where all your files (objects) are stored. You cannot upload a file directly to S3 without a bucket. Every file must be inside a bucket.

Bucket is like a main folder
Object is like a file inside the folder

Example You create a bucket named company-data-bucket.
Inside this bucket, you store:

logs/app-logs-2026.json
reports/sales-jan.csv
images/profile.png

Here, company-data-bucket is the bucket, and each file is an object.

9) Basic Structure of Amazon S3

Term	Meaning in Simple Words	Example
Bucket	The top-level container	company-analytics-bucket
Object	The actual file stored	2026/jan/sales.csv
Key	The full path of the file inside the bucket	2026/jan/sales.csv
Region	The AWS location where the bucket lives	us-east-1, ap-south-1

Important points:

Each bucket belongs to one AWS region
Your data is physically stored in that region
You can access the bucket from anywhere if permissions allow it

10) Why Do We Need Amazon S3 Buckets?

Amazon S3 buckets are used to store and manage almost all types of data in the cloud.

Common real-world use cases:

Data lakes Store raw data, logs, CSV, JSON, and Parquet files
Backups Store database backups, server backups, and application backups
Application files Store images, videos, and documents used by web and mobile apps
Analytics and Big Data Store data for Athena, Glue, EMR, and Redshift Spectrum
Static website hosting Store HTML, CSS, and JavaScript files for static websites

In short, Amazon S3 buckets are the foundation of data storage in AWS.

11) Amazon S3 Bucket Naming Rules

S3 bucket names follow strict global rules. These rules exist because bucket names are used in URLs and must work with the internet DNS system.

Rule 1: Globally Unique Name

Every bucket name must be globally unique across all AWS accounts and regions. If someone else has already created a bucket with a name, you cannot use that name.

Example:

mybucket may already be taken
mycompany-analytics-2026 is more likely to be available

Rule 2: Length Rules

Bucket name length must be between 3 and 63 characters.

Rule 3: Allowed Characters

You can use only: lowercase letter from a to z,numbers from 0 to 9,hyphens,dots
You cannot use: uppercase letters,underscores,spaces,special characters

Examples: my-data-bucket,company.logs.backup,analytics2026

Invalid examples:

MyBucket
my_bucket
my bucket

Rule 4: Start and End with Letter or Number

Bucket name must start and end with a letter or number. It should not start or end with a hyphen or dot.

Rule 5: No IP Address Format

Bucket names cannot look like an IP address such as 192.168.1.1. This is because bucket names are used in URLs.

12) Why These Rules Exist

Amazon S3 buckets are accessed using web URLs like:

https://my-data-bucket.s3.amazonaws.com/file.csv

To make sure these URLs work correctly with:

internet routing, DNS system, SSL certificates

AWS enforces strict bucket naming rules.

13) Important Features of Amazon S3 Buckets

Region

When you create a bucket, you select a region. Your data stays in that region. This helps with low latency, cost control, and legal compliance.

Access Control

By default, buckets are private. You control access using:

IAM users and roles, Bucket policies

Public access is usually used only for public website content.

Versioning

Versioning keeps multiple versions of the same file. If someone overwrites or deletes a file, older versions are still stored. This helps with data recovery and mistake protection.

Encryption

Amazon S3 supports encryption to protect your data. Data can be encrypted:

at rest, in transit

Encryption is important for security and compliance requirements.

Lifecycle Rules

Lifecycle rules help you automate storage management. You can move old data to cheaper storage classes or delete data after a fixed time. This helps reduce storage cost automatically.

14) Real-Life Example from Data Engineering

In a real data engineering project:

New logs come every day
Old logs are accessed rarely
Compliance rules require keeping data for many years

You may create different buckets:

company-raw-logs for daily logs
company-processed-data for transformed data
company-archive-data for long-term storage

Lifecycle rules can move old files automatically to cheaper storage classes.

15) How to Remember Amazon S3 Bucket Rules

Use the word BUCKET as a memory trick:

B means Bucket is the main container
U means Unique globally
C means Characters allowed are lowercase letters, numbers, hyphens, and dots
K means Keep name length between 3 and 63
E means End with a letter or number
T means Tied to one AWS region

Data Engineering Basics: From What is Data to Modern Lakehouse Architecture

Salma Aga Shaik — Sun, 22 Feb 2026 05:19:08 +0000

This post explains data fundamentals, databases, data warehousing, data lakes, and modern lakehouse architecture.

What is Data?

Data is raw facts or raw information collected from applications, users, and machines. On its own, data has little meaning. When we process, clean, and analyze data, it becomes useful information for business decisions.

Examples of data:

Customer name, email,Order amount, order time,Website clicks, error logs,Sensor readings

An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.

Types of Data (Structured, Semi-Structured, Unstructured)

Aspect	Structured Data	Semi-Structured Data	Unstructured Data
What it means	Data stored in rows and columns with a fixed schema.	Data with some structure (keys/tags), but no fixed table schema.	Data with no predefined structure.
Where it is stored	Relational Databases, Data Warehouses	Data Lakes, modern warehouses	Object storage, file systems
How easy to query	Very easy with SQL	Needs parsing/flattening	Needs preprocessing/AI-ML
Examples	Customer table, Orders table	JSON from APIs, Web logs, Avro/Parquet	Images, videos, PDFs, emails

Databases and Data Storage

Databases are systems used to store and manage structured data for applications.
Data storage includes databases plus file systems and cloud object storage (for raw files).

Examples:

Databases: PostgreSQL, MySQL, Oracle
Object Storage: AWS S3, Azure ADLS, Google GCS

SQL & Relational Databases

What is SQL?

SQL (Structured Query Language) is used to read and write data in relational databases.

What is a Relational Database (RDBMS)?

An RDBMS stores data in tables with relationships (primary keys and foreign keys).

Examples: PostgreSQL, MySQL, SQL Server

DDL vs DML

Category	What it does	Examples
DDL (Data Definition Language)	Defines or changes table structure	CREATE, ALTER, DROP
DML (Data Manipulation Language)	Reads and modifies data inside tables	INSERT, UPDATE, DELETE, SELECT

Example (DDL):

CREATE TABLE customers (id INT, name VARCHAR(100));

Example (DML):

INSERT INTO customers VALUES (1, 'Salma');
SELECT * FROM customers;

OLTP vs OLAP (Databases vs Analytics)

Aspect	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Main purpose	Run daily transactions for apps	Run analytics and reporting
Query pattern	Many small, fast writes/reads	Large scans and aggregations
Data	Current operational data	Historical, aggregated data
Typical systems	PostgreSQL, MySQL	Snowflake, BigQuery, Redshift
Example	Placing an order	Yearly sales analysis

ACID Transactions (Why Databases are Reliable)

Term	Meaning
Atomicity	A transaction is all-or-nothing.
Consistency	Data stays valid and correct.
Isolation	Parallel users do not interfere.
Durability	Once saved, data will not be lost.

Example:
If a payment fails halfway, Atomicity ensures the whole transaction is rolled back.

Data Warehouse vs Data Lake

Aspect	Data Warehouse	Data Lake
Data types	Structured only	Structured, semi-structured, unstructured
Schema	Schema-on-write (define before load)	Schema-on-read (define at query time)
Cost	Higher storage cost	Lower storage cost
Main use	BI reports, dashboards	Raw storage, ML/AI, exploration
Examples	Snowflake, Redshift, BigQuery	AWS S3, Azure ADLS, Google GCS

Data Formats: Avro vs Parquet vs ORC

Format	Storage Style	Best for	Example use
Avro	Row-based	Streaming, fast writes	Kafka pipelines
Parquet	Column-based	Analytics, fast reads	BI queries in Spark
ORC	Column-based	Analytics with compression	Hive/Spark

Row-Based vs Column-Based Storage

Aspect	Row-Based Storage	Column-Based Storage
How data is stored	Entire rows together	Same columns together
Best for	OLTP transactions	OLAP analytics
Typical systems	PostgreSQL, MySQL	BigQuery, Redshift, Parquet
Example	Fetch one customer record	Aggregate one column across millions of rows

RDBMS (Row-Based) vs Columnar Databases

Aspect	RDBMS (Row-Based)	Columnar Databases
Workload	Transactions	Analytics
Writes	Fast	Slower
Reads (aggregations)	Slower	Very fast
Example	PostgreSQL	BigQuery, Redshift

Data Warehousing Concepts: Facts and Dimensions

What are Fact Tables?

Fact tables store measurable numbers (metrics).

Examples: sales_amount, quantity, revenue

What are Dimension Tables?

Dimension tables store descriptive attributes to analyze facts.

Examples: customer, product, date, location

Types of Facts

Type	Meaning	Example
Transactional Fact	One row per transaction	Each order
Snapshot Fact	State at a point in time	Daily inventory
Accumulating Fact	Tracks process over time	Order lifecycle

Characteristics of Fact vs Dimension Tables

Aspect	Fact Table	Dimension Table
What it stores	Metrics (numbers)	Descriptions (attributes)
Size	Very large	Smaller
Keys	Foreign keys to dimensions	Primary keys
Example	Sales fact	Customer dimension

Data Lakehouse Architecture

Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics

What the Lakehouse layer adds:

ACID transactions for reliability
Indexing for faster queries
Metadata for governance and discovery
Performance optimizations for analytics

Examples of Lakehouse Technologies:

Delta Lake (Databricks)
Apache Iceberg
Apache Hudi

What is Informatica?

Informatica is an enterprise ETL tool used to extract, transform, and load data from source systems into data warehouses or data lakes.

Example:
Move sales data from PostgreSQL → clean it → load into Snowflake.

Final End-to-End Summary

OLTP databases run daily business transactions.
OLAP systems (data warehouses) support analytics and reporting.
Data lakes store raw data of all types.
Lakehouse architecture combines low-cost storage with fast analytics.
Facts and dimensions organize data for reporting.
Avro/Parquet/ORC and row vs column storage decide performance.

Traditional vs Modern Data Architecture

Salma Aga Shaik — Sun, 22 Feb 2026 00:47:51 +0000

1. Introduction

In many companies, data comes from different systems like ERP, CRM, application databases, and web logs. This data is used for reports, dashboards, and business decisions. To use this data properly, we need a data architecture.

There are two main types of data architecture:

Traditional Data Architecture (ETL + Data Warehouse)
Modern Data Architecture (ELT + Data Lake + Lakehouse)

This document explains both approaches. It also explains why we use tools like Data Lake, Data Warehouse, Spark, Databricks, Delta Lake, Iceberg, Snowflake, BigQuery, Redshift, ADLS, GCS, S3, and Datadog.

2. High-Level Data Flow

Data Sources → Ingestion → Data Lake → Processing → Lakehouse Tables → Data Warehouse → BI & Reports → Monitoring

This means:

Data comes from source systems.
Data is ingested (copied) into the platform.
Raw data is stored in a data lake.
Data is cleaned and transformed using processing tools.
Clean and reliable tables are created.
Final data is loaded into a data warehouse for reports.
The full system is monitored using monitoring tools.

3. Data Sources (Where data comes from)

Examples:

ERP systems: Finance, HR, inventory data
CRM systems: Customer and sales data
OLTP databases: Application transaction data (orders, payments)
Web logs: Website or app activity (clicks, errors, requests)

Why we use them:
These systems run the business. They create the data that we later analyze.

When we use them:
All the time. These are live systems used daily by the business.

4. Traditional Data Architecture (ETL)

4.1 What is Traditional Architecture?

In traditional architecture, data is transformed before it is loaded into the data warehouse.

Flow:
Sources → ETL Tool → Data Warehouse → BI/Reports

4.2 What is ETL?

ETL = Extract → Transform → Load

Extract: Take data from source systems.
Transform: Clean the data, fix formats, remove duplicates, join tables, and calculate metrics.
Load: Put the clean data into the data warehouse.

4.3 Why Traditional Architecture was used

Data warehouses were expensive.
Storage and compute were limited.
Only clean data was allowed in the warehouse.

4.4 Limitations

Not easy to scale for big data.
Raw data is lost after transformation.
Not flexible for machine learning and advanced analytics.

5. Modern Data Architecture (ELT + Data Lake + Lakehouse)

5.1 What is Modern Architecture?

In modern architecture, raw data is first stored in a data lake. Transformations happen later using powerful compute engines.

Flow:
Sources → Ingestion → Data Lake → Transform (Spark/Databricks) → Lakehouse Tables → Data Warehouse → BI & ML → Monitoring

5.2 What is ELT?

ELT = Extract → Load → Transform

Extract: Take data from sources.
Load: Store raw data directly in the data lake.
Transform: Clean and process data later using Spark or Databricks.

5.3 Why Modern Architecture is used

Cloud storage is cheap and scalable.
We can store raw data and use it later for new use cases.
We can support both analytics and machine learning.
Compute can scale up and down based on need.

6. Data Lake (S3, ADLS, GCS)

What is a Data Lake?
A data lake is a storage system that stores raw data in any format (CSV, JSON, images, logs).

Why we use it:

Cheap storage
Store raw data for future use
Useful for big data and machine learning

Where we use it:

AWS S3 (AWS cloud)
Azure ADLS (Azure cloud)
Google GCS (GCP cloud)

7. Data Warehouse (Snowflake, BigQuery, Redshift, Synapse)

What is a Data Warehouse?
A data warehouse stores clean, structured data for analytics and reporting.

Why we use it:

Fast SQL queries
Business reports and dashboards
Used by analysts and managers

Examples:

Snowflake
Google BigQuery
AWS Redshift
Azure Synapse

8. Data Lakehouse (Delta Lake, Apache Iceberg)

What is a Lakehouse?
A lakehouse combines the low-cost storage of a data lake with the reliability of a data warehouse.

Why we use Delta Lake and Iceberg:

ACID transactions (safe updates and deletes)
Schema changes without breaking pipelines
Time travel (see old versions of data)

Where we use it:
On top of the data lake, usually with Databricks and Spark.

9. Processing Layer (Spark and Databricks)

What is Spark?
Spark is a fast distributed engine to process large data.

What is Databricks?
Databricks is a platform that manages Spark and provides notebooks, clusters, and job scheduling.

Why we use them:

To clean and transform large data
To run batch and streaming jobs
To build machine learning pipelines

10. File Formats (Avro, Parquet, ORC)

Avro:
Used for data movement and streaming. Good for schema evolution.

Parquet:
Column-based format. Very fast for analytics queries.

ORC:
Column-based format. Used in big data systems like Hive and Spark.

11. OLTP vs OLAP

OLTP:
Used by applications for daily transactions (orders, payments).

OLAP:
Used for analytics and reporting (data warehouse queries).

12. Monitoring with Datadog

What is Datadog?
Datadog is a monitoring and observability tool.

Why we use it:

Monitor data pipelines
Monitor Spark jobs
Monitor servers and applications
Get alerts when something fails

When we use it:
In production environments to keep the system healthy and reliable.

13. ETL vs ELT

Feature	ETL (Traditional)	ELT (Modern)
Transform	Before load	After load
Storage	Data Warehouse	Data Lake
Scalability	Limited	High
Cost	Higher	Lower
Use Cases	Reports	Reports + ML

14. Example End-to-End Use Case

Data from ERP and CRM systems and web logs is ingested into a data lake on AWS S3. Raw data is stored in Parquet format. Spark on Databricks processes and cleans the data. Clean tables are stored using Delta Lake. Final analytics data is loaded into Snowflake. Business users use dashboards to view reports. Datadog monitors the pipelines and sends alerts when jobs fail.

15. Key Takeaways

Traditional architecture uses ETL + Data Warehouse.
Modern architecture uses ELT + Data Lake + Lakehouse.
Data Lake stores raw data.
Data Warehouse stores clean data for reporting.
Spark and Databricks handle large-scale processing.
Delta Lake and Iceberg make data lakes reliable.
Datadog monitors the entire system.

How I Learned AI by Building an Offline PDF Chatbot with Local LLMs

Salma Aga Shaik — Mon, 23 Jun 2025 17:33:14 +0000

Hey everyone! I’m Shaik Salma Aga, and I love learning by building. Instead of just reading theory, I built something that helped me understand AI practically and also prepare better for my interviews.

Let me walk you through what I built, how it works, and how you can try it too.

Project Goal: Learning AI by Building, Not Just Reading

I didn’t just want to use AI tools. I wanted to build one from scratch and see what happens under the hood.

I was exploring concepts like embeddings, vector search, and local LLMs but theory alone wasn’t sticking. So I built this project an Offline PDF Analyzer to learn how documents are split, embedded, searched, and how local models generate smart responses.

This project became my practical journey into AI and now it helps others too, especially those preparing for interviews.

What This Project Does

Upload one or more PDFs through a simple web UI.
Ask your questions in simple English.
The system reads and understands the content, then gives you a relevant answer from the document.
Everything runs locally no internet or API keys needed.
It can also count how many questions are in the PDF useful for exam prep.

Example:

Upload a PDF on Machine Learning and ask: “What is the difference between supervised and unsupervised learning?”
You get a clear, to-the-point answer pulled directly from the relevant section of the document instantly.

How It Works

Below is the complete flow of how the Offline PDF Analyzer works behind the scenes:

PDF Upload: The user uploads one or more PDF files through the UI.
Text Extraction: The app reads all pages using PyMuPDF and extracts clean text.
Chunking: Long text is split into overlapping chunks using RecursiveCharacterTextSplitter to preserve context.
Embeddings: Each chunk is converted into a vector (a list of numbers) using OllamaEmbeddings.
FAISS Vector Search: When a question is asked, similar chunks are searched using fast cosine similarity.
Answer Generation: The selected chunks are passed to a local LLM (like phi, mistral, or llama2) to generate the final answer.

Choose Your Local AI Model

You can select models like phi, mistral, or llama2 all running locally on your laptop using Ollama for fast and efficient results.

System Design Diagram: How PDF Analyzer Works

Tech Stack I Used

Streamlit: For building a user-friendly frontend with just a few lines of Python.
PyMuPDF (fitz): To extract text from all pages of uploaded PDFs.
LangChain: To handle end-to-end chaining from query to retrieval to LLM response.
RecursiveCharacterTextSplitter: Breaks the text into chunks with overlaps, so context is preserved.
Ollama: Runs local LLMs (phi, mistral, llama2) directly on your machine without internet.
FAISS: A super-fast vector search library to retrieve relevant chunks.
Python: For the backend logic, caching, state management, and pre-processing.

Challenges I Faced & How I Solved Them

1. Wrong Answers from Wrong Sections

In the beginning, it showed answers from the wrong part of the PDF, which didn’t match the question and made things confusing.
Fix: I adjusted the chunk overlap size, used better metadata like page numbers and source file names, and added tagging.

2. Answers Coming from Previous PDF.

Even after uploading a different PDF, it still showed answers from the old one.
Fix: I added file hashing to detect newly uploaded PDFs. If the incoming file is different from the previous one, the system discards the old data and processes the new file from scratch.

3. Short Queries Gave Confusing Answers

If I typed "types?" or "examples?", the app didn’t understand what I meant.
Fix: I made a way to automatically turn short questions into full ones. For example, if someone types "types?", it changes to "What are the different types mentioned in the document?" so the model understands better.

4. No Info on Where the Answer Came From

I wasn’t sure if the answer was right because it didn’t show where in the PDF it found the info.
Fix: Now it shows the PDF name and page number where the answer came from, and you can click to see more details if you want.

Techniques I Used

@st.cache_data: To avoid reloading the same PDF again and again.
File Hashing: So that the app resets only when a new PDF is uploaded.
Session State: Used in Streamlit to store user-uploaded files and questions.
Regex Matching: To support question formats like “How many questions are in this PDF?”
Prompt Templates: Help the model understand and answer better when the user's question is short or unclear.

Any Frontend?

Yes! I made a clean and user-friendly interface using Streamlit that makes it easy to upload PDFs and get answers quickly.

Choose your preferred LLM (phi / mistral / llama2)
Upload one or more PDFs
Ask your question
See the answer + source (page number + filename)

No delays, no registration everything happens on your own system.

What’s Next?

Here’s what I want to add next:

PDF Summarizer: Get a quick summary of the whole PDF.
Export Chat History: Save your Q&A for later.
Find All Questions: List all questions found inside the PDF.

Tech Terms

Chunking: Breaking a big document into small, readable parts.
Embeddings: Turning text into numbers so that the model understands meaning.
FAISS: Finds the best match for your question from the chunks.
Local LLMs: Small AI models running on your laptop (no internet needed).
LangChain: Connects everything PDFs, questions, answers — in one neat pipeline.

Interview Questions You Can Expect

How does chunk overlap affect retrieval quality?
What’s the role of FAISS in a RAG pipeline?
Why are prompt templates useful in real-world applications?
How do you make vector indexes update-safe when files change?
What are the trade-offs of using local LLMs vs cloud APIs?

Final Thoughts

This project started as a way to learn AI deeply by building something useful. It taught me how to use embeddings, vector search, local LLMs, and chaining tools all while helping me with interview prep.

If you want to learn by doing start small, build real, and break things.

Let’s keep learning. Let’s keep building.

✍️Shaik Salma Aga

🔗 GitHub Repo

Why Agentic AI Changed My Job Prep Journey

Salma Aga Shaik — Tue, 03 Jun 2025 05:02:15 +0000

Hi everyone! I'm Salma, a student and software engineer preparing for full-time roles. While applying for jobs and preparing for interviews, I realized something big:

"Just knowing how to code is no longer enough."

Today’s tech world is changing fast. We see AI everywhere, and one term you’ll hear again and again is Agentic AI. Some people know what it is, many don’t. But if you’re a student or professional looking for a job, understanding Agentic AI gives you a huge advantage.

Let’s Imagine you're building your own travel assistant app.

Traditional AI (like ChatGPT):

You: "Book a flight to Delhi."

AI: "Sure. Please tell me the date, airline, timing, etc."

Agentic AI:

You: "I need to be in Delhi next week for a conference."

AI:

Checks your calendar for free days
Suggests flight options
Books your ticket
Adds it to your calendar
Sends you a reminder and even books your cab

This isn’t just AI that responds. It’s AI that acts on its own.

What is Agentic AI?

Agentic AI is artificial intelligence that sets goals, makes decisions, takes action, and learns all on its own.

It doesn’t wait for your prompt. It’s like hiring a junior employee who knows what to do next.

Traditional AI vs Agentic AI

Traditional AI works based on prompts. You give it instructions, it gives an output.

Agentic AI works based on goals. You give it a goal, and it figures out how to reach it.

Difference Between Traditional AI and Agentic AI

Feature	Traditional AI	Agentic AI
Needs prompts	Yes	No
Can act on goals	No	Yes
Decision-making	Basic logic	Complex reasoning
Example	Chatbot	Calendar + Travel Manager

Lifecycle of Agentic AI

Perceive – Collects data (emails, APIs, sensors)
Reason – Understands the task and plans next steps
Act – Executes using APIs and tools
Learn – Evaluates and improves its performance
Collaborate – Works with humans or other agents

How Agentic AI Solves Customer Support Issues

Perceive: Reads an angry customer email
Reason: Understands it’s about a delayed shipment
Act: Sends an apology and discount coupon
Learn: Tracks response from customer
Collaborate: Notifies human agent if unresolved

Types of Agentic AI

Single-Agent System : One agent handles everything.

Example: Budget manager bot that tracks, predicts, and alerts.

Multi-Agent System : Several agents with different responsibilities.

Example: Email agent one reads, another replies, another logs.

Goal-Oriented Agent : Given a goal, it plans and acts.

Example: “Grow Instagram to 5K followers.”

Reactive Agent : Reacts quickly but doesn’t plan ahead.

Example: Auto-braking system in cars.

Deliberative Agent : Thinks and reasons before acting.

Example: Schedules meetings based on mood, urgency, and history.

How to Build Agentic AI

To build an Agentic AI system, you begin with a frontend that accepts input from users. The request is handled by a backend which forwards the data to a language model (LLM) such as GPT-4 or Claude. The LLM reasons about the task and initiates actions. These actions may include calling APIs or updating systems. Context or memory is stored using vector databases. Results and state changes are saved in a storage system like PostgreSQL.

How Agentic AI Can Automate Resume Screening

Recruiter uploads resumes on the web interface
Backend forwards data to the LLM
LLM ranks the candidates based on fit
Memory layer remembers past hiring preferences
Action layer sends top resumes to HR
PostgreSQL stores rankings and history

Components Used

Frontend: HTML, React
Backend: Python (Flask, FastAPI)
LLM: GPT-4, Claude, LLaMA
Memory: FAISS, Pinecone
Actions: APIs, Zapier, CRMs
Storage: PostgreSQL, Redis

How LangChain + Agentic AI Works

This diagram shows how an Agentic AI system works when you build it using LangChain:

User Input : The user gives a request.

Example: “Remind me about my meeting and send a message if I’m late.”

Reasoning / Planning : The system now goes into thinking mode. It uses a smart model (like GPT-4 or Claude) to figure out what to do next.

Action : Based on the plan, it performs the actual work:

Checks your calendar
Sends messages
Searches the web

Uses Tools : To complete tasks, the AI uses different tools:

Web Search to gather new information
API Calls to apps like your calendar or email
Databases, Zapier, or CRMs to interact with systems

Memory / Storage : After doing the task, it stores what happened for future reference so it can learn and improve next time.

Back to User or Move to Next Task : It either

Updates the user about the result
Or starts working on the next goal

This full loop User → Plan → Act → Tools → Back to User is what makes Agentic AI powerful. It’s not just replying like a chatbot. It’s doing real work for you like a smart digital assistant.

Advantages of Agentic AI

Proactive and autonomous
Learns and adapts over time
Integrates with tools and systems
Can collaborate with other agents or humans
Reduces repetitive human work

Drawbacks of Agentic AI

Risk of incorrect actions due to bad data
Hard to debug errors in multi-step logic
Requires safeguards and human override
Complexity in design and testing

What If Agentic AI Fails?

Failures can occur. Here's how to make systems robust:

Task Queues: Split large tasks into traceable chunks
Session Tokens: Avoid confusion between user sessions
Prompt Templates: Keep communication consistent
Escalation Paths: Alert humans when automation fails

Failure Example:

If a meeting booking fails due to calendar API error:

Retry booking
On failure again, send alert to user and log the error

Where Is Agentic AI Used Today?

Salesforce – AI customer support agents
Hippocratic AI – Medical virtual assistants
Ema AI – Business workflow automation
Juna – Factory control agents
Jasper + HubSpot – AI-powered marketing

AI Agents vs Agentic AI

AI Agent : Acts only after manual user input.

Example: Gmail Smart Reply you click it, it sends.

Difference Between AI Agents and Agentic AI

Feature	AI Agents	Agentic AI
User initiated	Yes	No
Goal planning	No	Yes
Multi-step task	No	Yes
Learning ability	No	Yes

Evolution of AI : AI has progressed in 3 major stages:

Predictive AI : Forecasting the future

Example: Credit scoring, fraud detection

Generative AI : Creating new content

Example: ChatGPT, DALL·E, MidJourney

Agentic AI : Thinking, planning, acting

Example: AI assistant managing tasks and meetings

Final Thoughts

Agentic AI is not just a buzzword. It’s a career game-changer.

If you're a student or developer:

Learn the lifecycle of agentic systems
Build a real mini-project (e.g. with LangChain)
Write about it or share your GitHub

- Talk about it in interviews

✍️ Written by Shaik Salma Aga

How I Built My Own RAG Chatbot with Local LLMs (And the Roadblocks That Taught Me More Than the Code)

Salma Aga Shaik — Sat, 31 May 2025 20:25:53 +0000

A while back, I wrote a beginner-to-expert guide on Retrieval-Augmented Generation (RAG). That article was all theory. How RAG works, the difference between sparse and dense embeddings, and why it’s powerful.

This time, I wanted to get my hands dirty. I wanted to build something real.

So I built a working RAG chatbot. Completely offline. Locally.

Let me walk you through the full journey what I built, how it works, and what went wrong (and how I fixed it).

Why I Ran It Locally

This wasn’t about saving money or staying private. It was about learning raw, hands-on, deep learning.

I didn’t want to just connect APIs and feel like a builder. I wanted to:

• Understand how text becomes vectors
• Debug retrieval when it breaks
• Run a model myself and see how it responds

I wanted to learn the hard way and local was the best way to make sure I did.

My Project: PDF Q&A Chatbot (All Offline)

I had one clear goal: Ask questions from a PDF and get meaningful answers without internet.

I used a document called Evolution_of_AI.pdf. I asked questions like:

"What are the phases in AI development?"

The chatbot searched the PDF, found the right section, fed it to a local LLM, and gave me a perfect answer.

All offline.

System Design Diagram: How Offline RAG Chatbot Works

Here’s the process:

User sends a question to the RAG chatbot.
The chatbot uses PyPDFLoader to load the PDF.
It splits the text using RecursiveCharacterTextSplitter.
Text chunks are converted to vectors via HuggingFace Embeddings.
Vectors are stored and retrieved using FAISS.
The top relevant chunks are passed to a local LLM via Ollama.
The final answer is shown to the user.

Tech Stack I Used

PyPDFLoader: Used for extracting raw text from the PDF so the bot can "read" it.

RecursiveCharacterTextSplitter: It ensures that even long paragraphs are broken into manageable, overlapping pieces that preserve meaning.

HuggingFaceEmbeddings: Converts those text chunks into number lists (vectors) that reflect context, not just words.

FAISS: A lightning-fast search tool that finds which vectors (chunks) are closest to the question vector.

Ollama: Runs lightweight models like phi on your machine, no cloud needed.

LangChain: The backbone. It handles all connections from question to document to model and back.

The Hidden Struggles and My Fixes

Empty Answers or Garbage Output
My initial PDF had just one sentence not enough for meaningful retrieval.
Fix: I created a structured PDF (Evolution_of_AI.pdf) with real content.
Wrong Chunks Being Retrieved
Asked about AI phases, but got results about NLP techniques.
Fix: Added more chunk overlap, changed embedding model, and tagged the chunks with extra metadata.
Deprecation Warnings in LangChain
The .run() method stopped working.
Fix: Switched to the .invoke() method per latest LangChain docs.
Ollama Crashes with Heavy Models
Running models like Mistral overloaded my RAM.
Fix: Downgraded to phi, a lighter model that worked well locally.
No Change After Updating PDF
I changed the PDF but still got answers from the old one.
Fix: I cleared the FAISS index and re-embedded everything.
Short or Vague Queries Confused the Bot
“Phases?” returned irrelevant content.
Fix: I used prompt templates to expand such queries into full sentences automatically.

Technical Bits Explained Simply

Chunking
Breaks large documents into overlapping sections so important parts aren’t lost during processing.

Embeddings
Turns sentences into numbers that represent meaning. That way, "vacation" and "holiday" look nearly the same to the machine.

Cosine Similarity
A math trick to check how similar two vectors (questions and chunks) are. Smaller angle = better match.

FAISS
A tool that finds which chunks are most similar to the question super quickly.

LangChain

LangChain simplifies the complex plumbing. It:

• Takes your question
• Converts it to a vector
• Finds the most relevant document chunks via FAISS
• Sends it all to the LLM
• Collects and returns the final answer

All without you needing to manually stitch the logic together.

Evaluation Techniques I Used

Manually compared answers with the PDF
Asked intentionally vague or tricky questions
Checked that the answers didn’t hallucinate
Made sure important info wasn’t skipped (avoided the “lost in the middle” issue)

Any Frontend?

Not yet, but I’m planning:

A Streamlit-based UI for chatting with the bot
A FastAPI backend to make it modular
A desktop wrapper so anyone can use it easily

What’s Next?

Multi-PDF support
Chunk summaries for quick previews
Using ragas for automated evaluation
Feedback-based learning loop

Interview Questions

How does chunk overlap affect retrieval quality in RAG systems?
What are the benefits of local embeddings over API-based ones?
How do you debug wrong or missing retrievals in vector search?
What’s the trade-off between dense and sparse embeddings?
How do you handle stale or outdated indexes in a vector DB like FAISS?

Final Thoughts

Building this RAG chatbot wasn’t just about code it was about transforming theory into practice. Every bug I fixed and every wrong answer I debugged helped me grow.

If you’ve read about RAG and want to really learn it build something.

Let’s keep learning, building, and breaking things together.

Shaik Salma Aga

[🔗 GitHub: https://github.com/ShaikSalmaAga/rag-chatbot](url)

Mastering Retrieval-Augmented Generation (RAG): From Beginner to Expert

Salma Aga Shaik — Tue, 27 May 2025 05:50:52 +0000

Why Should You Care About RAG?

Imagine you work in the HR department of a company that has a 100-page PDF filled with employee policies. One day, an intern walks up to your desk and asks:

“How many work-from-home days are allowed for interns?”

You open the document, press Ctrl+F, and type “work-from-home.” But the PDF uses the term “remote work flexibility.” You scroll endlessly, read random sections, and still can’t find a clear answer. It’s frustrating.

Now imagine a smart chatbot that can read the entire PDF, understand the meaning, and say in 3 seconds:

“Interns are eligible for 5 remote working days per month.”

That’s the power of RAG: Retrieval-Augmented Generation. It gives real answers from real documents — not guesses.

Why Large Language Models Alone Aren’t Enough

LLMs like GPT-4 are powerful but have key limitations:

They sometimes hallucinate — they make up answers that sound real but aren’t true.
Their knowledge is frozen. For example, GPT-4 was trained only until 2023, so anything after that is unknown.
They can’t access private documents like your PDFs or internal policies.
They don’t search, they just generate responses from memory.

The Better Way: Use RAG

RAG (Retrieval-Augmented Generation) connects a smart language model to external documents. Instead of guessing, it retrieves the correct information and generates accurate responses.

So if the intern asks the same question again, the system will search the HR policy and respond:

“Interns are eligible for 5 remote working days per month.”

It’s fast, trustworthy, and grounded in real content.

Step 1: Prepare the Data (Ingestion)

Chunking

Your document is like a big cake. Chunking is slicing it into small parts — 256 or 512 tokens — so it’s easier to search.

Embedding

Each chunk is turned into a vector (a list of numbers) using models like:

OpenAI Embeddings
BERT (Bidirectional Encoder Representations from Transformers)

Why? Because computers understand numbers, not words. Embeddings help the machine capture the meaning behind the text.

Example: “holiday leave” and “paid vacation” are different phrases but mean the same thing. Embeddings can tell.

Storing in a Vector Database

These vectors go into special databases built for fast search:

Chroma – beginner friendly and local
Pinecone – cloud-based and scalable
FAISS – open-source tool by Facebook for high-speed search

What Are Embeddings and Why Do They Matter?

Embeddings convert text into vectors so we can search and compare meaning — not just words.

Sparse Embeddings

Tools: TF-IDF (Term Frequency–Inverse Document Frequency), BM25
Fast, matches exact terms
Doesn’t understand deeper meaning

Example: If you ask about “holiday” and the doc says “vacation,” sparse embedding will miss it.

Dense Embeddings

Tools: BERT, Sentence-BERT, OpenAI Embeddings
Understands context and meaning
Better match, even if exact words differ

Example: Ask about “vacation policy,” and the doc says “30 days paid leave.” Dense embeddings will match it.

Dense embeddings are ideal for RAG because meaning matters more than words.

Step 2: Retrieval (Find the Right Chunks)

When a user asks something:

It is converted into a vector
Compared with all document vectors
The closest matches are selected

Similarity Techniques

Cosine Similarity: Measures the angle between vectors. Smaller angle = more similar.
Euclidean Distance: Measures the distance between points. Closer = more similar.

Retrieval Methods

Standard Retrieval

Just pick the top-matching chunk and send to the model. Fast but might lack context.
Sentence-Window Retrieval

Picks the match and adds surrounding sentences — so the model understands context better.
Ensemble Retrieval

Tries multiple chunk sizes (128, 256, 512), combines best chunks, and sorts them with a Re-Ranker.

Step 3: Re-Ranking

You might get 10 chunks — but LLMs can only handle 3-5. So, we sort them by importance.

Types of Re-Ranking:

Lexical: Based on keywords (TF-IDF, BM25)
Semantic: Based on meaning (BERT, Cohere)
LTR (Learning to Rank): ML model trained to choose best
Hybrid: Combines keyword and meaning-based methods

Think of re-ranking like a judge picking the best answers to pass to the LLM.

Problems You Might Face

Lost in the Middle

LLMs focus more on the start and end of the input — often skipping the middle.

Fix: Move key content to start/end, limit total chunks, and use better re-ranking.

Example: If the answer is in paragraph 3 of 5, reorder the chunk or split it to push the key info up.

Wrong Retrieval

Sometimes irrelevant chunks get retrieved — leading to wrong answers.

Fix:

Improve chunking (e.g., avoid breaking sentences)
Use better embeddings (dense vs sparse)
Add filters to improve search accuracy

Example: A policy question brings in finance data? You likely need to refine your vector store or chunk size.

Fine-Tuning vs RAG

Fine-Tuning

You retrain the LLM to speak in a specific tone or style.

Great for personalization or branding
Expensive, slow, needs lots of data

Example: You fine-tune a model to sound like Shakespeare.

RAG

You don’t touch the model. Just add your documents, and the model uses them for answering.

Easy to update
No retraining needed
Works out-of-the-box

Example: You upload HR policies. Now the chatbot answers HR questions instantly.

Start with RAG — fine-tune only if your use case demands personality or tone changes.

Common Tools and Full Forms

RAG: Retrieval-Augmented Generation
LLM: Large Language Model
BERT: Bidirectional Encoder Representations from Transformers
FAISS: Facebook AI Similarity Search
TF-IDF: Term Frequency-Inverse Document Frequency
LTR: Learning to Rank
NLTK: Natural Language Toolkit

Final Thoughts

RAG is a game changer. It connects LLMs to real, updated knowledge — making AI assistants smarter and more trustworthy.

I’m currently preparing for software engineering interviews, and AI is everywhere. I thought, if I’m learning this, why not help others too?

That’s why I wrote this post in order to make RAG simple and useful for anyone interested in AI.

I'll be posting more content around AI, tools, and interview prep. Stay connected

5 Must-Know RAG Interview Questions

What is Retrieval-Augmented Generation (RAG), and how is it different from traditional LLMs?
What is the difference between sparse and dense embeddings? When should you use each?
Explain the “Lost in the Middle” problem and how to handle it.
How do cosine similarity and Euclidean distance help in finding relevant document chunks?
When should you choose fine-tuning over RAG, and what trade-offs come with it?

🖊️ Written by Shaik Salma Aga