<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Salma Aga Shaik</title>
    <description>The latest articles on Forem by Salma Aga Shaik (@salma_aga).</description>
    <link>https://forem.com/salma_aga</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3211368%2F9cfb9755-78f8-443d-9b01-43b5f8a5bc97.png</url>
      <title>Forem: Salma Aga Shaik</title>
      <link>https://forem.com/salma_aga</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/salma_aga"/>
    <language>en</language>
    <item>
      <title>Understand Hadoop and Apache Spark</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:41:49 +0000</pubDate>
      <link>https://forem.com/salma_aga/understand-hadoop-and-apache-spark-f74</link>
      <guid>https://forem.com/salma_aga/understand-hadoop-and-apache-spark-f74</guid>
      <description>&lt;p&gt;Imagine a company that runs a very popular online platform. Every day, millions of users visit the website, make purchases, click on products, and generate application logs. All these activities produce &lt;strong&gt;a very large amount of data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After some time, the company collects &lt;strong&gt;terabytes of data&lt;/strong&gt;. This data includes customer transactions, website clicks, machine logs, and system events.&lt;/p&gt;

&lt;p&gt;Now the company wants to &lt;strong&gt;analyze this data&lt;/strong&gt; to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which products are selling the most?&lt;/li&gt;
&lt;li&gt;What time do customers visit the website?&lt;/li&gt;
&lt;li&gt;Are there any system errors?&lt;/li&gt;
&lt;li&gt;How can the company improve its services?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first, the company tries to process the data using &lt;strong&gt;one computer&lt;/strong&gt;, but the data is too large. The computer becomes slow and cannot process the data efficiently.&lt;/p&gt;

&lt;p&gt;To solve this problem, the company decides to use a &lt;strong&gt;distributed system&lt;/strong&gt;, where many machines work together to store and process the data.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;Hadoop&lt;/strong&gt; and &lt;strong&gt;Apache Spark&lt;/strong&gt; come into the picture.&lt;/p&gt;




&lt;h1&gt;
  
  
  Hadoop: Storing and Processing Large Data
&lt;/h1&gt;

&lt;p&gt;The company first starts using &lt;strong&gt;Hadoop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop is a &lt;strong&gt;big data framework&lt;/strong&gt; that helps companies &lt;strong&gt;store and process large datasets using multiple machines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One important part of Hadoop is &lt;strong&gt;HDFS (Hadoop Distributed File System)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of storing a large file on one machine, Hadoop &lt;strong&gt;splits the file into smaller blocks&lt;/strong&gt; and stores those blocks across many machines in the cluster. This allows the system to &lt;strong&gt;store huge amounts of data reliably&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop also uses a processing model called &lt;strong&gt;MapReduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;MapReduce processes the data step by step across the cluster. However, during processing it &lt;strong&gt;writes intermediate data to disk many times&lt;/strong&gt;, which makes the processing slower.&lt;/p&gt;

&lt;p&gt;Hadoop works well for &lt;strong&gt;batch processing&lt;/strong&gt;, where large data is processed in stages.&lt;/p&gt;




&lt;h1&gt;
  
  
  Spark: Faster Data Processing
&lt;/h1&gt;

&lt;p&gt;Later, the company learns about &lt;strong&gt;Apache Spark&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Spark is a &lt;strong&gt;fast distributed data processing engine&lt;/strong&gt; designed to process large datasets quickly.&lt;/p&gt;

&lt;p&gt;Like Hadoop, Spark also processes data across &lt;strong&gt;multiple machines in a cluster&lt;/strong&gt;. However, Spark has a major advantage.&lt;/p&gt;

&lt;p&gt;Spark performs &lt;strong&gt;in-memory computation&lt;/strong&gt;, which means it processes data in &lt;strong&gt;memory (RAM)&lt;/strong&gt; instead of repeatedly writing data to disk.&lt;/p&gt;

&lt;p&gt;Because memory is much faster than disk, Spark can process data &lt;strong&gt;much faster than Hadoop MapReduce&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Spark Works
&lt;/h1&gt;

&lt;p&gt;In a Spark system, many machines work together.&lt;/p&gt;

&lt;p&gt;At the center of the system is the &lt;strong&gt;Driver Program&lt;/strong&gt;. The driver acts like the &lt;strong&gt;manager&lt;/strong&gt; of the Spark application. It starts the job, creates the execution plan, and manages the processing.&lt;/p&gt;

&lt;p&gt;The actual data processing happens in &lt;strong&gt;Executors&lt;/strong&gt;. Executors run on worker machines in the cluster and perform the real computation.&lt;/p&gt;

&lt;p&gt;When a Spark job starts, the driver creates a plan called a &lt;strong&gt;DAG (Directed Acyclic Graph)&lt;/strong&gt;. This plan shows how the data will be processed step by step.&lt;/p&gt;

&lt;p&gt;Spark then divides the job into smaller tasks and sends those tasks to executors. The executors process the data in parallel and return the results to the driver.&lt;/p&gt;




&lt;h1&gt;
  
  
  Transformations and Actions in Spark
&lt;/h1&gt;

&lt;p&gt;Spark operations are divided into two types.&lt;/p&gt;

&lt;p&gt;The first type is &lt;strong&gt;Transformations&lt;/strong&gt;. These operations modify the data but do not execute immediately. Examples include filtering rows or selecting columns.&lt;/p&gt;

&lt;p&gt;The second type is &lt;strong&gt;Actions&lt;/strong&gt;. Actions trigger the actual execution of the Spark job. Examples include counting records or saving results.&lt;/p&gt;

&lt;p&gt;Spark waits until an action is called before executing the full computation. This concept is called &lt;strong&gt;lazy evaluation&lt;/strong&gt;, which helps improve performance.&lt;/p&gt;




&lt;h1&gt;
  
  
  Where Spark Is Used
&lt;/h1&gt;

&lt;p&gt;Spark is widely used in &lt;strong&gt;data engineering and analytics pipelines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Data Sources&lt;br&gt;
→ Streaming systems or APIs&lt;br&gt;
→ Spark processing&lt;br&gt;
→ Data lake (Amazon S3 or HDFS)&lt;br&gt;
→ Data warehouse (Redshift or Snowflake)&lt;br&gt;
→ BI tools like Power BI or Tableau&lt;/p&gt;

&lt;p&gt;Spark processes and transforms the data so that companies can analyze it and generate insights.&lt;/p&gt;




&lt;h1&gt;
  
  
  Difference Between Hadoop and Spark
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Hadoop&lt;/th&gt;
&lt;th&gt;Spark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it is&lt;/td&gt;
&lt;td&gt;Hadoop is a big data framework used to store and process large data.&lt;/td&gt;
&lt;td&gt;Spark is a fast data processing engine used to process large data quickly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How it processes data&lt;/td&gt;
&lt;td&gt;Hadoop processes data using MapReduce and writes data to disk many times.&lt;/td&gt;
&lt;td&gt;Spark processes data mostly in memory (RAM).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Hadoop is slower because it reads and writes data to disk frequently.&lt;/td&gt;
&lt;td&gt;Spark is faster because it processes data in memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main use&lt;/td&gt;
&lt;td&gt;Hadoop is mainly used for storing large data and batch processing.&lt;/td&gt;
&lt;td&gt;Spark is used for fast data processing and analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type of processing&lt;/td&gt;
&lt;td&gt;Hadoop mostly supports batch processing.&lt;/td&gt;
&lt;td&gt;Spark supports batch processing, streaming, machine learning, and SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ease of coding&lt;/td&gt;
&lt;td&gt;Hadoop MapReduce requires more code and is harder to write.&lt;/td&gt;
&lt;td&gt;Spark is easier to use because it has APIs like Python, Java, and SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where it is used&lt;/td&gt;
&lt;td&gt;Hadoop is often used for distributed storage using HDFS.&lt;/td&gt;
&lt;td&gt;Spark is used for ETL pipelines, real-time analytics, and big data processing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Hadoop and Spark are both technologies used to process very large datasets using multiple machines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hadoop is mainly used for &lt;strong&gt;distributed storage and batch processing&lt;/strong&gt;, while Spark is designed for &lt;strong&gt;fast data processing using in-memory computation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Today, many companies use Spark with cloud platforms such as &lt;strong&gt;AWS EMR, AWS Glue, and Databricks&lt;/strong&gt; to build modern data engineering and analytics systems.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Modern Data Engineering Architecture Across AWS, GCP, and Azure</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 16:58:47 +0000</pubDate>
      <link>https://forem.com/salma_aga/modern-data-engineering-architecture-across-aws-gcp-and-azure-14o3</link>
      <guid>https://forem.com/salma_aga/modern-data-engineering-architecture-across-aws-gcp-and-azure-14o3</guid>
      <description>&lt;p&gt;In modern data platforms, organizations build &lt;strong&gt;end-to-end data pipelines&lt;/strong&gt; to &lt;strong&gt;collect, process, store, and analyze large volumes of data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Although different cloud providers offer different services, the &lt;strong&gt;core architecture pattern remains the same&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A typical &lt;strong&gt;data engineering architecture&lt;/strong&gt; contains the following stages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Generation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Ingestion&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Processing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Lake Storage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SQL Query Layer&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Warehouse Analytics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Intelligence Visualization&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  &lt;strong&gt;End-to-End Data Pipeline Architecture&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrmmg045ub72gtwa1tt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrmmg045ub72gtwa1tt.png" alt="Image" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g0itmz2dgxwg1jninzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g0itmz2dgxwg1jninzo.png" alt="Image" width="720" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwsoa30hqzihlr3qyofnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwsoa30hqzihlr3qyofnc.png" alt="Image" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqao5135srzre2tsqzlcp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqao5135srzre2tsqzlcp.png" alt="Image" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above represents a &lt;strong&gt;typical enterprise data pipeline architecture&lt;/strong&gt; used by modern companies.&lt;/p&gt;

&lt;p&gt;The goal of this architecture is to move data from &lt;strong&gt;operational systems&lt;/strong&gt; into &lt;strong&gt;analytics platforms&lt;/strong&gt; where it can generate &lt;strong&gt;business insights&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  &lt;strong&gt;Cloud Data Engineering Architecture Comparison&lt;/strong&gt;
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Architecture Layer&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;What Happens in This Layer&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;AWS Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;GCP Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Azure Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Data Sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data is generated from &lt;strong&gt;applications, IoT devices, databases, logs, and user transactions&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;RDS databases&lt;/strong&gt;, server logs, IoT sensors&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;Cloud SQL&lt;/strong&gt;, logs, IoT devices&lt;/td&gt;
&lt;td&gt;Applications, &lt;strong&gt;Azure SQL&lt;/strong&gt;, logs, IoT devices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Data Ingestion (Streaming)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Real-time data&lt;/strong&gt; is continuously collected and streamed into the data pipeline.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Amazon Kinesis&lt;/strong&gt; or &lt;strong&gt;Managed Kafka (MSK)&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Pub/Sub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Event Hubs&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Batch Data Ingestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Batch data from &lt;strong&gt;files, APIs, or databases&lt;/strong&gt; is periodically ingested.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Dataflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Factory&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Data Processing (ETL / Big Data Processing)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data is &lt;strong&gt;cleaned, transformed, and enriched&lt;/strong&gt; using distributed processing frameworks.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon EMR running Apache Spark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Dataproc&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Databricks&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Data Lake Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw and processed data is stored in &lt;strong&gt;scalable object storage systems&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Lake Storage&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6. Metadata &amp;amp; Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stores &lt;strong&gt;metadata information&lt;/strong&gt; such as schema definitions and table structures.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Data Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Purview&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7. SQL Query Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineers and analysts run &lt;strong&gt;SQL queries on large datasets&lt;/strong&gt; stored in the data lake.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon Athena&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8. Data Warehouse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Processed data is loaded into a &lt;strong&gt;data warehouse optimized for analytics queries&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon Redshift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;9. Workflow Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipelines are &lt;strong&gt;scheduled and automated&lt;/strong&gt; to manage dependencies.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Step Functions / Managed Airflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Composer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Data Factory Pipelines&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;10. Monitoring &amp;amp; Logging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline performance and failures are tracked using &lt;strong&gt;monitoring tools&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon CloudWatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Azure Monitor&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11. Visualization / BI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business teams analyze data using &lt;strong&gt;dashboards and reports&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon QuickSight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Looker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Power BI&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Data Pipeline Flow
&lt;/h1&gt;

&lt;p&gt;A typical &lt;strong&gt;data engineering pipeline&lt;/strong&gt; works like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Sources&lt;/strong&gt;: Applications,transaction systems, and log systems generate raw data.&lt;br&gt;
&lt;strong&gt;Streaming Ingestion&lt;/strong&gt;: Streaming platforms like Apache Kafka or Amazon Kinesis capture real-time events.&lt;br&gt;
&lt;strong&gt;Data Processing&lt;/strong&gt;: Processing engines such as Apache Spark perform data cleaning, transformation, and aggregation.&lt;br&gt;
&lt;strong&gt;Data Lake Storage&lt;/strong&gt;: Data is stored in scalable Data Lakes such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage.&lt;br&gt;
&lt;strong&gt;SQL Query Layer&lt;/strong&gt;: Tools like Amazon Athena, BigQuery, or Azure Synapse allow engineers to run SQL queries on big data.&lt;br&gt;
&lt;strong&gt;Data Warehouse Analytics&lt;/strong&gt;: Structured analytics data is stored in Amazon Redshift,BigQuery, or Synapse Analytics.&lt;br&gt;
&lt;strong&gt;BI Dashboards&lt;/strong&gt;: Visualization tools such as Power BI, Looker, or Amazon QuickSight create interactive dashboards and reports.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding Hadoop Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 15 Mar 2026 15:52:27 +0000</pubDate>
      <link>https://forem.com/salma_aga/understanding-hadoop-architecture-16al</link>
      <guid>https://forem.com/salma_aga/understanding-hadoop-architecture-16al</guid>
      <description>&lt;p&gt;Imagine a company that collects a &lt;strong&gt;large amount of data&lt;/strong&gt; every day, such as &lt;strong&gt;website logs, transactions, or user activity&lt;/strong&gt;. After some time, the data becomes &lt;strong&gt;too large for a single computer&lt;/strong&gt; to store and process. This is where &lt;strong&gt;Hadoop&lt;/strong&gt; helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hadoop&lt;/strong&gt; is a &lt;strong&gt;big data framework&lt;/strong&gt; designed to &lt;strong&gt;store and process very large datasets across many machines&lt;/strong&gt;. Instead of using one powerful computer, Hadoop uses &lt;strong&gt;multiple machines working together&lt;/strong&gt;, which are called &lt;strong&gt;nodes&lt;/strong&gt;. These machines together form a &lt;strong&gt;Hadoop cluster&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Storage Layer — &lt;strong&gt;HDFS&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;The storage system used by Hadoop is called &lt;strong&gt;HDFS (Hadoop Distributed File System)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a &lt;strong&gt;large file&lt;/strong&gt; is stored in Hadoop, it is automatically &lt;strong&gt;split into smaller pieces called blocks&lt;/strong&gt;. These &lt;strong&gt;blocks&lt;/strong&gt; are then &lt;strong&gt;distributed across multiple machines&lt;/strong&gt; in the cluster.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Block 1 → Machine 1&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Block 2 → Machine 2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Block 3 → Machine 3&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach is called &lt;strong&gt;distributed storage&lt;/strong&gt;, and it allows Hadoop to store &lt;strong&gt;very large datasets efficiently&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Hadoop Nodes
&lt;/h1&gt;

&lt;p&gt;In a Hadoop cluster, there are two important types of nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;NameNode (Master Node)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;NameNode&lt;/strong&gt; acts like the &lt;strong&gt;manager of the system&lt;/strong&gt;. It stores &lt;strong&gt;metadata&lt;/strong&gt;, which includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;file names&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;block locations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;which machine stores each block&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;NameNode does not store actual data&lt;/strong&gt;. It only &lt;strong&gt;manages the file system and keeps track of the data&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;DataNodes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DataNodes&lt;/strong&gt; are the machines that &lt;strong&gt;store the actual data blocks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DataNode 1 → Block 1&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DataNode 2 → Block 2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DataNode 3 → Block 3&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer is known as the &lt;strong&gt;storage layer&lt;/strong&gt; of Hadoop.&lt;/p&gt;




&lt;h1&gt;
  
  
  Processing Layer — &lt;strong&gt;MapReduce&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;After the data is stored, Hadoop needs to &lt;strong&gt;process the data&lt;/strong&gt;. This is done using &lt;strong&gt;MapReduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MapReduce&lt;/strong&gt; is a &lt;strong&gt;distributed data processing framework&lt;/strong&gt; that works in two main steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Map Phase&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the &lt;strong&gt;Map phase&lt;/strong&gt;, a &lt;strong&gt;large task is divided into smaller tasks&lt;/strong&gt;.&lt;br&gt;
Each machine processes a &lt;strong&gt;small part of the data&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Reduce Phase&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the &lt;strong&gt;Reduce phase&lt;/strong&gt;, the &lt;strong&gt;results from all machines are combined&lt;/strong&gt; to produce the &lt;strong&gt;final output&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This process allows Hadoop to &lt;strong&gt;process huge datasets quickly&lt;/strong&gt; by using &lt;strong&gt;parallel processing&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Example
&lt;/h1&gt;

&lt;p&gt;Imagine a company wants to analyze &lt;strong&gt;millions of website log records&lt;/strong&gt; to see &lt;strong&gt;how many users visited from each country&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;log data is stored in HDFS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Hadoop &lt;strong&gt;splits the logs into blocks&lt;/strong&gt; and stores them across &lt;strong&gt;multiple DataNodes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MapReduce processes the data in parallel&lt;/strong&gt; on different machines.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Reduce phase combines the results&lt;/strong&gt; and generates the &lt;strong&gt;final report&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h1&gt;
  
  
  Architecture
&lt;/h1&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn82eqotbsoliwv04ih1m.png" alt=" " width="800" height="533"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Hadoop architecture&lt;/strong&gt; works with two main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HDFS → for distributed storage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MapReduce → for distributed processing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By &lt;strong&gt;splitting data across multiple machines and processing it in parallel&lt;/strong&gt;, Hadoop allows organizations to &lt;strong&gt;store and analyze massive datasets efficiently&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>AWS S3 Storage Classes (Start to End)</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 19:34:16 +0000</pubDate>
      <link>https://forem.com/salma_aga/aws-s3-storage-classes-start-to-end-258c</link>
      <guid>https://forem.com/salma_aga/aws-s3-storage-classes-start-to-end-258c</guid>
      <description>&lt;h2&gt;
  
  
  1) What is &lt;strong&gt;Amazon S3&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 (Simple Storage Service)&lt;/strong&gt; is an AWS service used to store files like &lt;strong&gt;images, videos, logs, backups, datasets, and reports&lt;/strong&gt; as &lt;strong&gt;objects&lt;/strong&gt; inside &lt;strong&gt;buckets&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket&lt;/strong&gt; = main container (like a top-level folder)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object&lt;/strong&gt; = the actual file (data + metadata)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;S3 is widely used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data lakes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backups and disaster recovery&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application logs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Static website files&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analytics and machine learning datasets&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-term archiving and compliance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2) Why does S3 have &lt;strong&gt;multiple storage classes&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;Not all data is used in the same way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some data is used &lt;strong&gt;daily&lt;/strong&gt; (hot data)&lt;/li&gt;
&lt;li&gt;Some data is used &lt;strong&gt;sometimes&lt;/strong&gt; (cold data)&lt;/li&gt;
&lt;li&gt;Some data is &lt;strong&gt;almost never&lt;/strong&gt; used (archive data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So AWS provides different &lt;strong&gt;S3 storage classes&lt;/strong&gt; to help you balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; – how much you pay for storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; – how fast you can read data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability&lt;/strong&gt; – how often data is accessible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt; – multi-AZ vs single-AZ&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval fee&lt;/strong&gt; – extra cost when you download data in some classes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3) Key Terms
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Simple Meaning&lt;/th&gt;
&lt;th&gt;Easy Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How safe your data is from being lost&lt;/td&gt;
&lt;td&gt;Even if disks fail, AWS still keeps your file safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11 nines durability (99.999999999%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extremely high safety&lt;/td&gt;
&lt;td&gt;“Almost never lost”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How often data is accessible&lt;/td&gt;
&lt;td&gt;99.99% means very little downtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How fast you can access data&lt;/td&gt;
&lt;td&gt;Milliseconds = very fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How much data can be read/written per second&lt;/td&gt;
&lt;td&gt;Important for big analytics jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval fee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extra cost when you download data&lt;/td&gt;
&lt;td&gt;Some classes charge when you read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability Zone (AZ)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One data center inside a region&lt;/td&gt;
&lt;td&gt;Multi-AZ is safer than single AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4) There are 8 Storage Classes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1  &lt;strong&gt;S3 Standard&lt;/strong&gt; – Hot Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Used for &lt;strong&gt;frequently accessed&lt;/strong&gt; and &lt;strong&gt;business-critical&lt;/strong&gt; data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very fast access (milliseconds):&lt;/strong&gt; Suitable for real-time applications and user-facing systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High availability:&lt;/strong&gt; Designed to be available almost all the time for applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ durability:&lt;/strong&gt; Data is safely stored across multiple data centers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No retrieval fee:&lt;/strong&gt; You don’t pay extra when reading or downloading data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Website images and videos served to users&lt;/li&gt;
&lt;li&gt;Daily application logs used by engineers&lt;/li&gt;
&lt;li&gt;Active analytics datasets queried many times per day&lt;/li&gt;
&lt;li&gt;Frequently used ML training and inference data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Today’s sales data used every hour → &lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Standard = &lt;strong&gt;Hot + Fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.2  &lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt; – AWS Decides Automatically
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For data where you &lt;strong&gt;don’t know&lt;/strong&gt; how often it will be accessed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic movement between tiers:&lt;/strong&gt; AWS moves objects to cheaper tiers when access reduces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No performance impact:&lt;/strong&gt; Applications access data the same way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small monitoring fee:&lt;/strong&gt; Charged for AWS to track access patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes where new data is hot and old data becomes cold&lt;/li&gt;
&lt;li&gt;ML datasets where some features are used more than others&lt;/li&gt;
&lt;li&gt;Analytics history that changes in access frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Some months of logs are queried often, others not → &lt;strong&gt;Intelligent-Tiering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Intelligent = &lt;strong&gt;“I don’t know access pattern”&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.3  &lt;strong&gt;S3 Standard-IA&lt;/strong&gt; – Cold but Fast
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For data accessed &lt;strong&gt;rarely&lt;/strong&gt;, but must be accessed &lt;strong&gt;immediately&lt;/strong&gt; when needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower storage cost than Standard:&lt;/strong&gt; Helps save money for infrequently used data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast access:&lt;/strong&gt; Still milliseconds when you retrieve data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval fee applies:&lt;/strong&gt; Extra cost when you download data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ durability:&lt;/strong&gt; Safe across multiple data centers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backups used only during failures&lt;/li&gt;
&lt;li&gt;Disaster recovery data&lt;/li&gt;
&lt;li&gt;Old reports accessed occasionally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Weekly backups restored only during failure → &lt;strong&gt;Standard-IA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; IA = &lt;strong&gt;Rare, but fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.4  &lt;strong&gt;S3 One Zone-IA&lt;/strong&gt; – Cheaper but Risky
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Same as Standard-IA, but stored in &lt;strong&gt;one Availability Zone only&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cheaper than Standard-IA:&lt;/strong&gt; Cost saving for non-critical data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single AZ risk:&lt;/strong&gt; If that AZ goes down, data can be unavailable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast access:&lt;/strong&gt; Still millisecond latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval fee applies.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-creatable ETL outputs&lt;/li&gt;
&lt;li&gt;Temporary pipeline files&lt;/li&gt;
&lt;li&gt;Secondary backups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Temporary pipeline files → &lt;strong&gt;One Zone-IA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; One Zone = &lt;strong&gt;Cheap + Risk&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.5 &lt;strong&gt;S3 Glacier Instant Retrieval&lt;/strong&gt; – Archive + Fast
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Instant Retrieval is a storage class for archived data that is rarely accessed, but when you need it, you can open it immediately. It is mainly used for long-term storage where data is kept for compliance or record-keeping, but still needs instant access sometimes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very low storage cost&lt;/li&gt;
&lt;li&gt;Instant (milliseconds) access&lt;/li&gt;
&lt;li&gt;Retrieval fee applies&lt;/li&gt;
&lt;li&gt;Multi-AZ durability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance documents that must open quickly&lt;/li&gt;
&lt;li&gt;Audit logs needed during investigations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Legal docs opened only during audits → &lt;strong&gt;Glacier Instant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Glacier Instant = &lt;strong&gt;Archive + Fast&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.6 &lt;strong&gt;S3 Glacier Flexible Retrieval&lt;/strong&gt; – Archive + Wait
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Flexible Retrieval is used for archived data that is almost never accessed, and when it is accessed, you are okay to wait some time before getting the data back. This class is mainly for long-term backups and historical data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very low cost for long-term storage&lt;/li&gt;
&lt;li&gt;Multiple retrieval speeds: expedited, standard, bulk&lt;/li&gt;
&lt;li&gt;Suitable for large archive restores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old backups&lt;/li&gt;
&lt;li&gt;Historical logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Flexible = &lt;strong&gt;Waiting is okay&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.7 &lt;strong&gt;S3 Glacier Deep Archive&lt;/strong&gt; – Cheapest + Slowest
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Glacier Deep Archive is the lowest-cost storage class in Amazon S3. It is used for data that must be kept for many years and is almost never accessed. This is mainly for legal, regulatory, and compliance requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cheapest storage class&lt;/li&gt;
&lt;li&gt;Retrieval time 12–48 hours&lt;/li&gt;
&lt;li&gt;Best for compliance and legal retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial records&lt;/li&gt;
&lt;li&gt;Government data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Deep Archive = &lt;strong&gt;Coldest + Slowest + Cheapest&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4.8 &lt;strong&gt;S3 Express One Zone&lt;/strong&gt; – Extra Fast, Single AZ
&lt;/h3&gt;

&lt;p&gt;S3 Express One Zone is a storage class designed for very high-performance workloads. It is used when applications need very low latency and very high request rates for reading and writing data. Data is stored in only one Availability Zone, so it is faster but less resilient compared to multi-AZ classes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features :&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ultra-fast performance for request-heavy workloads&lt;/li&gt;
&lt;li&gt;High throughput for many small reads/writes&lt;/li&gt;
&lt;li&gt;Stored in one AZ only (less resilient)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time analytics&lt;/li&gt;
&lt;li&gt;ML feature stores&lt;/li&gt;
&lt;li&gt;Hot ETL intermediate data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Pipeline reading millions of small files → &lt;strong&gt;Express One Zone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Express = &lt;strong&gt;Extra fast&lt;/strong&gt;, One Zone = &lt;strong&gt;Single AZ&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5) Comparision table for all 8 S3 Storage Classes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Class&lt;/th&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;th&gt;Retrieval Speed&lt;/th&gt;
&lt;th&gt;Storage Cost&lt;/th&gt;
&lt;th&gt;Extra Cost&lt;/th&gt;
&lt;th&gt;Availability / Risk&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Frequently accessed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Multi-AZ, very safe&lt;/td&gt;
&lt;td&gt;Hot data, websites, active logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown / changing&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Monitoring fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Unpredictable workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard-IA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrequent but fast needed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Backups, DR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 One Zone-IA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrequent, non-critical&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Cheaper&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Single AZ risk&lt;/td&gt;
&lt;td&gt;Re-creatable data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Instant Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rare but instant needed&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Compliance archives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Flexible Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very rare access&lt;/td&gt;
&lt;td&gt;Minutes → Hours&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Old backups, logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier Deep Archive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Almost never accessed&lt;/td&gt;
&lt;td&gt;12–48 hours&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Retrieval fee&lt;/td&gt;
&lt;td&gt;Multi-AZ&lt;/td&gt;
&lt;td&gt;Legal &amp;amp; long-term records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Express One Zone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very frequent, high-performance&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ultra-fast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;td&gt;Request-based pricing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Single AZ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-performance analytics, ML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6) How to Choose Quickly
&lt;/h2&gt;

&lt;p&gt;Ask yourself these &lt;strong&gt;3 simple questions&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  i) How often will the data be accessed?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Daily or many times a day&lt;/strong&gt; → &lt;strong&gt;S3 Standard&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not sure / changes over time&lt;/strong&gt; → &lt;strong&gt;S3 Intelligent-Tiering&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rarely&lt;/strong&gt; → Use IA or Glacier classes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ii) When needed, how fast must I get the data?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instant (milliseconds)&lt;/strong&gt; → Standard, Standard-IA, Glacier Instant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can wait minutes or hours&lt;/strong&gt; → Glacier Flexible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can wait 1–2 days&lt;/strong&gt; → Glacier Deep Archive&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  iii) Is the data critical or can it be recreated?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Critical data&lt;/strong&gt; → Choose &lt;strong&gt;multi-AZ classes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-critical or re-creatable data&lt;/strong&gt; → Choose &lt;strong&gt;single-AZ classes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick Mapping Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;App serving images every second&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;S3 Standard&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs with changing access patterns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Intelligent-Tiering&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weekly backups&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Standard-IA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporary ETL output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;One Zone-IA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance docs needing instant access&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Instant&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large archive restores&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Flexible&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-year legal retention&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Glacier Deep Archive&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-performance ML feature reads&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;S3 Express One Zone&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  7) How to Remember
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot&lt;/strong&gt; → Standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unknown&lt;/strong&gt; → Intelligent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold&lt;/strong&gt; → IA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Very Cold&lt;/strong&gt; → Glacier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coldest&lt;/strong&gt; → Deep Archive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultra-fast hot data&lt;/strong&gt; → Express One Zone&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8) What is Amazon S3 and What is a Bucket?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 (Simple Storage Service)&lt;/strong&gt; is a &lt;strong&gt;cloud storage service&lt;/strong&gt; provided by &lt;strong&gt;AWS&lt;/strong&gt;. It is used to store &lt;strong&gt;files and data&lt;/strong&gt; such as &lt;strong&gt;images&lt;/strong&gt;, &lt;strong&gt;videos&lt;/strong&gt;, &lt;strong&gt;logs&lt;/strong&gt;, &lt;strong&gt;backups&lt;/strong&gt;, &lt;strong&gt;datasets&lt;/strong&gt;, and &lt;strong&gt;documents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;Amazon S3 bucket&lt;/strong&gt; is the &lt;strong&gt;main container&lt;/strong&gt; where all your &lt;strong&gt;files (objects)&lt;/strong&gt; are stored. You cannot upload a file directly to S3 without a bucket. Every file must be inside a bucket.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket&lt;/strong&gt; is like a &lt;strong&gt;main folder&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object&lt;/strong&gt; is like a &lt;strong&gt;file inside the folder&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt; You create a bucket named &lt;strong&gt;company-data-bucket&lt;/strong&gt;.&lt;br&gt;
Inside this bucket, you store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;logs/app-logs-2026.json&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reports/sales-jan.csv&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;images/profile.png&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here, &lt;strong&gt;company-data-bucket&lt;/strong&gt; is the &lt;strong&gt;bucket&lt;/strong&gt;, and each file is an &lt;strong&gt;object&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  9) Basic Structure of Amazon S3
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Meaning in Simple Words&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bucket&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;top-level container&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;company-analytics-bucket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Object&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;actual file&lt;/strong&gt; stored&lt;/td&gt;
&lt;td&gt;2026/jan/sales.csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;full path&lt;/strong&gt; of the file inside the bucket&lt;/td&gt;
&lt;td&gt;2026/jan/sales.csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Region&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;AWS location&lt;/strong&gt; where the bucket lives&lt;/td&gt;
&lt;td&gt;us-east-1, ap-south-1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Important points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;strong&gt;bucket belongs to one AWS region&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Your &lt;strong&gt;data is physically stored&lt;/strong&gt; in that region&lt;/li&gt;
&lt;li&gt;You can access the bucket from anywhere if &lt;strong&gt;permissions&lt;/strong&gt; allow it&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10) Why Do We Need Amazon S3 Buckets?
&lt;/h2&gt;

&lt;p&gt;Amazon S3 buckets are used to store and manage &lt;strong&gt;almost all types of data&lt;/strong&gt; in the cloud.&lt;/p&gt;

&lt;p&gt;Common real-world use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data lakes&lt;/strong&gt; Store &lt;strong&gt;raw data&lt;/strong&gt;, &lt;strong&gt;logs&lt;/strong&gt;, &lt;strong&gt;CSV&lt;/strong&gt;, &lt;strong&gt;JSON&lt;/strong&gt;, and &lt;strong&gt;Parquet&lt;/strong&gt; files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backups&lt;/strong&gt; Store &lt;strong&gt;database backups&lt;/strong&gt;, &lt;strong&gt;server backups&lt;/strong&gt;, and &lt;strong&gt;application backups&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Application files&lt;/strong&gt; Store &lt;strong&gt;images&lt;/strong&gt;, &lt;strong&gt;videos&lt;/strong&gt;, and &lt;strong&gt;documents&lt;/strong&gt; used by web and mobile apps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analytics and Big Data&lt;/strong&gt; Store data for &lt;strong&gt;Athena&lt;/strong&gt;, &lt;strong&gt;Glue&lt;/strong&gt;, &lt;strong&gt;EMR&lt;/strong&gt;, and &lt;strong&gt;Redshift Spectrum&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Static website hosting&lt;/strong&gt; Store &lt;strong&gt;HTML&lt;/strong&gt;, &lt;strong&gt;CSS&lt;/strong&gt;, and &lt;strong&gt;JavaScript&lt;/strong&gt; files for static websites&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, &lt;strong&gt;Amazon S3 buckets&lt;/strong&gt; are the &lt;strong&gt;foundation of data storage&lt;/strong&gt; in AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  11) Amazon S3 Bucket Naming Rules
&lt;/h2&gt;

&lt;p&gt;S3 bucket names follow &lt;strong&gt;strict global rules&lt;/strong&gt;. These rules exist because bucket names are used in &lt;strong&gt;URLs&lt;/strong&gt; and must work with the &lt;strong&gt;internet DNS system&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule 1: Globally Unique Name
&lt;/h3&gt;

&lt;p&gt;Every &lt;strong&gt;bucket name must be globally unique&lt;/strong&gt; across all AWS accounts and regions. If someone else has already created a bucket with a name, you cannot use that name.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mybucket may already be taken&lt;/li&gt;
&lt;li&gt;mycompany-analytics-2026 is more likely to be available&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Rule 2: Length Rules
&lt;/h3&gt;

&lt;p&gt;Bucket name length must be between &lt;strong&gt;3 and 63 characters&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Rule 3: Allowed Characters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You can use only: lowercase letter from a to z,numbers from 0 to 9,hyphens,dots&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You cannot use: uppercase letters,underscores,spaces,special characters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples: my-data-bucket,company.logs.backup,analytics2026&lt;/p&gt;

&lt;p&gt;Invalid examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MyBucket&lt;/li&gt;
&lt;li&gt;my_bucket&lt;/li&gt;
&lt;li&gt;my bucket&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Rule 4: Start and End with Letter or Number
&lt;/h3&gt;

&lt;p&gt;Bucket name must &lt;strong&gt;start and end with a letter or number&lt;/strong&gt;. It should not start or end with a hyphen or dot.&lt;/p&gt;




&lt;h3&gt;
  
  
  Rule 5: No IP Address Format
&lt;/h3&gt;

&lt;p&gt;Bucket names cannot look like an &lt;strong&gt;IP address&lt;/strong&gt; such as 192.168.1.1. This is because bucket names are used in &lt;strong&gt;URLs&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  12) Why These Rules Exist
&lt;/h2&gt;

&lt;p&gt;Amazon S3 buckets are accessed using &lt;strong&gt;web URLs&lt;/strong&gt; like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://my-data-bucket.s3.amazonaws.com/file.csv" rel="noopener noreferrer"&gt;https://my-data-bucket.s3.amazonaws.com/file.csv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make sure these URLs work correctly with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internet routing, DNS system, SSL certificates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS enforces strict &lt;strong&gt;bucket naming rules&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  13) Important Features of Amazon S3 Buckets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Region
&lt;/h3&gt;

&lt;p&gt;When you create a &lt;strong&gt;bucket&lt;/strong&gt;, you select a &lt;strong&gt;region&lt;/strong&gt;. Your &lt;strong&gt;data stays in that region&lt;/strong&gt;. This helps with &lt;strong&gt;low latency&lt;/strong&gt;, &lt;strong&gt;cost control&lt;/strong&gt;, and &lt;strong&gt;legal compliance&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Access Control
&lt;/h3&gt;

&lt;p&gt;By default, &lt;strong&gt;buckets are private&lt;/strong&gt;. You control access using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM users and roles, Bucket policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Public access is usually used only for &lt;strong&gt;public website content&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Versioning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Versioning&lt;/strong&gt; keeps &lt;strong&gt;multiple versions&lt;/strong&gt; of the same file. If someone overwrites or deletes a file, older versions are still stored. This helps with &lt;strong&gt;data recovery&lt;/strong&gt; and &lt;strong&gt;mistake protection&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Encryption
&lt;/h3&gt;

&lt;p&gt;Amazon S3 supports &lt;strong&gt;encryption&lt;/strong&gt; to protect your data. Data can be encrypted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;at rest, in transit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encryption is important for &lt;strong&gt;security&lt;/strong&gt; and &lt;strong&gt;compliance requirements&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Lifecycle Rules
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lifecycle rules&lt;/strong&gt; help you &lt;strong&gt;automate storage management&lt;/strong&gt;. You can move old data to &lt;strong&gt;cheaper storage classes&lt;/strong&gt; or &lt;strong&gt;delete data&lt;/strong&gt; after a fixed time. This helps reduce &lt;strong&gt;storage cost&lt;/strong&gt; automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  14) Real-Life Example from Data Engineering
&lt;/h2&gt;

&lt;p&gt;In a real data engineering project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New logs come &lt;strong&gt;every day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Old logs are accessed &lt;strong&gt;rarely&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Compliance rules require keeping data for &lt;strong&gt;many years&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You may create different buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;company-raw-logs&lt;/strong&gt; for daily logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company-processed-data&lt;/strong&gt; for transformed data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company-archive-data&lt;/strong&gt; for long-term storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lifecycle rules can move old files automatically to &lt;strong&gt;cheaper storage classes&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  15) How to Remember Amazon S3 Bucket Rules
&lt;/h2&gt;

&lt;p&gt;Use the word &lt;strong&gt;BUCKET&lt;/strong&gt; as a memory trick:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; means Bucket is the main container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;U&lt;/strong&gt; means Unique globally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C&lt;/strong&gt; means Characters allowed are lowercase letters, numbers, hyphens, and dots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K&lt;/strong&gt; means Keep name length between 3 and 63&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E&lt;/strong&gt; means End with a letter or number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T&lt;/strong&gt; means Tied to one AWS region&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>aws</category>
      <category>beginners</category>
      <category>cloud</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Data Engineering Basics: From What is Data to Modern Lakehouse Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 05:19:08 +0000</pubDate>
      <link>https://forem.com/salma_aga/-data-engineering-basics-from-what-is-data-to-modern-lakehouse-architecture-1l10</link>
      <guid>https://forem.com/salma_aga/-data-engineering-basics-from-what-is-data-to-modern-lakehouse-architecture-1l10</guid>
      <description>&lt;p&gt;This post explains &lt;strong&gt;data fundamentals&lt;/strong&gt;, &lt;strong&gt;databases&lt;/strong&gt;, &lt;strong&gt;data warehousing&lt;/strong&gt;, &lt;strong&gt;data lakes&lt;/strong&gt;, and &lt;strong&gt;modern lakehouse architecture&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is &lt;strong&gt;Data&lt;/strong&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt; is &lt;strong&gt;raw facts or raw information&lt;/strong&gt; collected from &lt;strong&gt;applications, users, and machines&lt;/strong&gt;. On its own, data has little meaning. When we &lt;strong&gt;process, clean, and analyze data&lt;/strong&gt;, it becomes &lt;strong&gt;useful information&lt;/strong&gt; for &lt;strong&gt;business decisions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples of data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer name&lt;/strong&gt;, &lt;strong&gt;email&lt;/strong&gt;,&lt;strong&gt;Order amount&lt;/strong&gt;, &lt;strong&gt;order time&lt;/strong&gt;,&lt;strong&gt;Website clicks&lt;/strong&gt;, &lt;strong&gt;error logs&lt;/strong&gt;,&lt;strong&gt;Sensor readings&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Types of &lt;strong&gt;Data&lt;/strong&gt; (Structured, Semi-Structured, Unstructured)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Structured Data&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Semi-Structured Data&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Unstructured Data&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it means&lt;/td&gt;
&lt;td&gt;Data stored in &lt;strong&gt;rows and columns&lt;/strong&gt; with a &lt;strong&gt;fixed schema&lt;/strong&gt;.&lt;/td&gt;
&lt;td&gt;Data with &lt;strong&gt;some structure&lt;/strong&gt; (keys/tags), but no fixed table schema.&lt;/td&gt;
&lt;td&gt;Data with &lt;strong&gt;no predefined structure&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where it is stored&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Relational Databases&lt;/strong&gt;, &lt;strong&gt;Data Warehouses&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Data Lakes&lt;/strong&gt;, modern warehouses&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Object storage&lt;/strong&gt;, file systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How easy to query&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Very easy&lt;/strong&gt; with &lt;strong&gt;SQL&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Needs &lt;strong&gt;parsing/flattening&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Needs &lt;strong&gt;preprocessing/AI-ML&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples&lt;/td&gt;
&lt;td&gt;Customer table, Orders table&lt;/td&gt;
&lt;td&gt;JSON from APIs, Web logs, Avro/Parquet&lt;/td&gt;
&lt;td&gt;Images, videos, PDFs, emails&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Databases and Data Storage&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Databases&lt;/strong&gt; are systems used to &lt;strong&gt;store and manage structured data&lt;/strong&gt; for applications.&lt;br&gt;
&lt;strong&gt;Data storage&lt;/strong&gt; includes databases plus &lt;strong&gt;file systems&lt;/strong&gt; and &lt;strong&gt;cloud object storage&lt;/strong&gt; (for raw files).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databases:&lt;/strong&gt; PostgreSQL, MySQL, Oracle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object Storage:&lt;/strong&gt; AWS S3, Azure ADLS, Google GCS&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;SQL &amp;amp; Relational Databases&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is &lt;strong&gt;SQL&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SQL (Structured Query Language)&lt;/strong&gt; is used to &lt;strong&gt;read and write data&lt;/strong&gt; in &lt;strong&gt;relational databases&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  What is a &lt;strong&gt;Relational Database (RDBMS)&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;RDBMS&lt;/strong&gt; stores data in &lt;strong&gt;tables with relationships&lt;/strong&gt; (primary keys and foreign keys).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; &lt;strong&gt;PostgreSQL&lt;/strong&gt;, MySQL, SQL Server&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;DDL&lt;/strong&gt; vs &lt;strong&gt;DML&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDL (Data Definition Language)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Defines or changes &lt;strong&gt;table structure&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;CREATE, ALTER, DROP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DML (Data Manipulation Language)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads and modifies &lt;strong&gt;data inside tables&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;INSERT, UPDATE, DELETE, SELECT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example (DDL):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example (DML):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Salma'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;OLTP&lt;/strong&gt; vs &lt;strong&gt;OLAP&lt;/strong&gt; (Databases vs Analytics)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OLTP (Online Transaction Processing)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main purpose&lt;/td&gt;
&lt;td&gt;Run &lt;strong&gt;daily transactions&lt;/strong&gt; for apps&lt;/td&gt;
&lt;td&gt;Run &lt;strong&gt;analytics and reporting&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query pattern&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Many small, fast writes/reads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Large scans and aggregations&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Current operational data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Historical, aggregated data&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical systems&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL&lt;/td&gt;
&lt;td&gt;Snowflake, BigQuery, Redshift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Placing an order&lt;/td&gt;
&lt;td&gt;Yearly sales analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;ACID Transactions&lt;/strong&gt; (Why Databases are Reliable)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Atomicity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A transaction is &lt;strong&gt;all-or-nothing&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data stays &lt;strong&gt;valid and correct&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parallel users &lt;strong&gt;do not interfere&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Once saved, data &lt;strong&gt;will not be lost&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
If a payment fails halfway, &lt;strong&gt;Atomicity&lt;/strong&gt; ensures the whole transaction is rolled back.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Warehouse vs Data Lake&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Data Warehouse&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Data Lake&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data types&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Structured only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Structured, semi-structured, unstructured&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-write&lt;/strong&gt; (define before load)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-read&lt;/strong&gt; (define at query time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Higher storage cost&lt;/td&gt;
&lt;td&gt;Lower storage cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main use&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BI reports, dashboards&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Raw storage, ML/AI, exploration&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples&lt;/td&gt;
&lt;td&gt;Snowflake, Redshift, BigQuery&lt;/td&gt;
&lt;td&gt;AWS S3, Azure ADLS, Google GCS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Formats&lt;/strong&gt;: Avro vs Parquet vs ORC
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Storage Style&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Example use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Avro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Row-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Streaming, fast writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kafka pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parquet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Analytics, fast reads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BI queries in Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ORC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column-based&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Analytics with compression&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hive/Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Row-Based vs Column-Based Storage&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Row-Based Storage&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Column-Based Storage&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How data is stored&lt;/td&gt;
&lt;td&gt;Entire &lt;strong&gt;rows together&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Same &lt;strong&gt;columns together&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OLTP transactions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OLAP analytics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical systems&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL&lt;/td&gt;
&lt;td&gt;BigQuery, Redshift, Parquet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Fetch one customer record&lt;/td&gt;
&lt;td&gt;Aggregate one column across millions of rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;RDBMS (Row-Based) vs Columnar Databases&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;RDBMS (Row-Based)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Columnar Databases&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Workload&lt;/td&gt;
&lt;td&gt;Transactions&lt;/td&gt;
&lt;td&gt;Analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Writes&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reads (aggregations)&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;td&gt;Very fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;BigQuery, Redshift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Warehousing Concepts: Facts and Dimensions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are &lt;strong&gt;Fact Tables&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store &lt;strong&gt;measurable numbers&lt;/strong&gt; (metrics).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; sales_amount, quantity, revenue&lt;/p&gt;

&lt;h3&gt;
  
  
  What are &lt;strong&gt;Dimension Tables&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; store &lt;strong&gt;descriptive attributes&lt;/strong&gt; to analyze facts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; customer, product, date, location&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Types of Facts&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transactional Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One row per transaction&lt;/td&gt;
&lt;td&gt;Each order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snapshot Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;State at a point in time&lt;/td&gt;
&lt;td&gt;Daily inventory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accumulating Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tracks process over time&lt;/td&gt;
&lt;td&gt;Order lifecycle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Characteristics of Fact vs Dimension Tables&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Fact Table&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Dimension Table&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it stores&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Metrics (numbers)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Descriptions (attributes)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;Very large&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keys&lt;/td&gt;
&lt;td&gt;Foreign keys to dimensions&lt;/td&gt;
&lt;td&gt;Primary keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Sales fact&lt;/td&gt;
&lt;td&gt;Customer dimension&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Lakehouse Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the Lakehouse layer adds:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ACID transactions&lt;/strong&gt; for reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing&lt;/strong&gt; for faster queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; for governance and discovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimizations&lt;/strong&gt; for analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples of Lakehouse Technologies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake (Databricks)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What is Informatica?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Informatica&lt;/strong&gt; is an &lt;strong&gt;enterprise ETL tool&lt;/strong&gt; used to &lt;strong&gt;extract, transform, and load data&lt;/strong&gt; from source systems into &lt;strong&gt;data warehouses or data lakes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
Move sales data from PostgreSQL → clean it → load into Snowflake.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final End-to-End Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OLTP databases&lt;/strong&gt; run daily business transactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLAP systems (data warehouses)&lt;/strong&gt; support analytics and reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data lakes&lt;/strong&gt; store raw data of all types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse architecture&lt;/strong&gt; combines low-cost storage with fast analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Facts and dimensions&lt;/strong&gt; organize data for reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avro/Parquet/ORC&lt;/strong&gt; and &lt;strong&gt;row vs column storage&lt;/strong&gt; decide performance.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Traditional vs Modern Data Architecture</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sun, 22 Feb 2026 00:47:51 +0000</pubDate>
      <link>https://forem.com/salma_aga/traditional-vs-modern-data-architecture-37cn</link>
      <guid>https://forem.com/salma_aga/traditional-vs-modern-data-architecture-37cn</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;In many companies, data comes from different systems like ERP, CRM, application databases, and web logs. This data is used for reports, dashboards, and business decisions. To use this data properly, we need a data architecture.&lt;/p&gt;

&lt;p&gt;There are two main types of data architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Traditional Data Architecture (ETL + Data Warehouse)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Modern Data Architecture (ELT + Data Lake + Lakehouse)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This document explains both approaches. It also explains why we use tools like Data Lake, Data Warehouse, Spark, Databricks, Delta Lake, Iceberg, Snowflake, BigQuery, Redshift, ADLS, GCS, S3, and Datadog.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. High-Level Data Flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data Sources → Ingestion → Data Lake → Processing → Lakehouse Tables → Data Warehouse → BI &amp;amp; Reports → Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data comes from source systems.&lt;/li&gt;
&lt;li&gt;Data is ingested (copied) into the platform.&lt;/li&gt;
&lt;li&gt;Raw data is stored in a data lake.&lt;/li&gt;
&lt;li&gt;Data is cleaned and transformed using processing tools.&lt;/li&gt;
&lt;li&gt;Clean and reliable tables are created.&lt;/li&gt;
&lt;li&gt;Final data is loaded into a data warehouse for reports.&lt;/li&gt;
&lt;li&gt;The full system is monitored using monitoring tools.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Data Sources (Where data comes from)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ERP systems:&lt;/strong&gt; Finance, HR, inventory data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRM systems:&lt;/strong&gt; Customer and sales data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLTP databases:&lt;/strong&gt; Application transaction data (orders, payments)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web logs:&lt;/strong&gt; Website or app activity (clicks, errors, requests)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why we use them:&lt;/strong&gt;&lt;br&gt;
These systems run the business. They create the data that we later analyze.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When we use them:&lt;/strong&gt;&lt;br&gt;
All the time. These are live systems used daily by the business.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Traditional Data Architecture (ETL)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 What is Traditional Architecture?
&lt;/h3&gt;

&lt;p&gt;In traditional architecture, data is transformed &lt;strong&gt;before&lt;/strong&gt; it is loaded into the data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Sources → ETL Tool → Data Warehouse → BI/Reports&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffk85ui4agzxowy3ixojn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffk85ui4agzxowy3ixojn.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 What is ETL?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL = Extract → Transform → Load&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Take data from source systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Clean the data, fix formats, remove duplicates, join tables, and calculate metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Put the clean data into the data warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Why Traditional Architecture was used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses were expensive.&lt;/li&gt;
&lt;li&gt;Storage and compute were limited.&lt;/li&gt;
&lt;li&gt;Only clean data was allowed in the warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.4 Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Not easy to scale for big data.&lt;/li&gt;
&lt;li&gt;Raw data is lost after transformation.&lt;/li&gt;
&lt;li&gt;Not flexible for machine learning and advanced analytics.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Modern Data Architecture (ELT + Data Lake + Lakehouse)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 What is Modern Architecture?
&lt;/h3&gt;

&lt;p&gt;In modern architecture, raw data is first stored in a data lake. Transformations happen later using powerful compute engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Sources → Ingestion → Data Lake → Transform (Spark/Databricks) → Lakehouse Tables → Data Warehouse → BI &amp;amp; ML → Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcrgjv0ieqyjjvnws6xz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcrgjv0ieqyjjvnws6xz.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 What is ELT?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ELT = Extract → Load → Transform&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Take data from sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Store raw data directly in the data lake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Clean and process data later using Spark or Databricks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.3 Why Modern Architecture is used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud storage is cheap and scalable.&lt;/li&gt;
&lt;li&gt;We can store raw data and use it later for new use cases.&lt;/li&gt;
&lt;li&gt;We can support both analytics and machine learning.&lt;/li&gt;
&lt;li&gt;Compute can scale up and down based on need.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Data Lake (S3, ADLS, GCS)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Data Lake?&lt;/strong&gt;&lt;br&gt;
A data lake is a storage system that stores raw data in any format (CSV, JSON, images, logs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cheap storage&lt;/li&gt;
&lt;li&gt;Store raw data for future use&lt;/li&gt;
&lt;li&gt;Useful for big data and machine learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS S3 (AWS cloud)&lt;/li&gt;
&lt;li&gt;Azure ADLS (Azure cloud)&lt;/li&gt;
&lt;li&gt;Google GCS (GCP cloud)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Data Warehouse (Snowflake, BigQuery, Redshift, Synapse)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Data Warehouse?&lt;/strong&gt;&lt;br&gt;
A data warehouse stores clean, structured data for analytics and reporting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast SQL queries&lt;/li&gt;
&lt;li&gt;Business reports and dashboards&lt;/li&gt;
&lt;li&gt;Used by analysts and managers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;li&gt;AWS Redshift&lt;/li&gt;
&lt;li&gt;Azure Synapse&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Data Lakehouse (Delta Lake, Apache Iceberg)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Lakehouse?&lt;/strong&gt;&lt;br&gt;
A lakehouse combines the low-cost storage of a data lake with the reliability of a data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use Delta Lake and Iceberg:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID transactions (safe updates and deletes)&lt;/li&gt;
&lt;li&gt;Schema changes without breaking pipelines&lt;/li&gt;
&lt;li&gt;Time travel (see old versions of data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where we use it:&lt;/strong&gt;&lt;br&gt;
On top of the data lake, usually with Databricks and Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Processing Layer (Spark and Databricks)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Spark?&lt;/strong&gt;&lt;br&gt;
Spark is a fast distributed engine to process large data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Databricks?&lt;/strong&gt;&lt;br&gt;
Databricks is a platform that manages Spark and provides notebooks, clusters, and job scheduling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use them:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To clean and transform large data&lt;/li&gt;
&lt;li&gt;To run batch and streaming jobs&lt;/li&gt;
&lt;li&gt;To build machine learning pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. File Formats (Avro, Parquet, ORC)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Avro:&lt;/strong&gt;&lt;br&gt;
Used for data movement and streaming. Good for schema evolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parquet:&lt;/strong&gt;&lt;br&gt;
Column-based format. Very fast for analytics queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ORC:&lt;/strong&gt;&lt;br&gt;
Column-based format. Used in big data systems like Hive and Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. OLTP vs OLAP
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OLTP:&lt;/strong&gt;&lt;br&gt;
Used by applications for daily transactions (orders, payments).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLAP:&lt;/strong&gt;&lt;br&gt;
Used for analytics and reporting (data warehouse queries).&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Monitoring with Datadog
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Datadog?&lt;/strong&gt;&lt;br&gt;
Datadog is a monitoring and observability tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor data pipelines&lt;/li&gt;
&lt;li&gt;Monitor Spark jobs&lt;/li&gt;
&lt;li&gt;Monitor servers and applications&lt;/li&gt;
&lt;li&gt;Get alerts when something fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When we use it:&lt;/strong&gt;&lt;br&gt;
In production environments to keep the system healthy and reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. ETL vs ELT
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ETL (Traditional)&lt;/th&gt;
&lt;th&gt;ELT (Modern)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transform&lt;/td&gt;
&lt;td&gt;Before load&lt;/td&gt;
&lt;td&gt;After load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Data Warehouse&lt;/td&gt;
&lt;td&gt;Data Lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use Cases&lt;/td&gt;
&lt;td&gt;Reports&lt;/td&gt;
&lt;td&gt;Reports + ML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  14. Example End-to-End Use Case
&lt;/h2&gt;

&lt;p&gt;Data from ERP and CRM systems and web logs is ingested into a data lake on AWS S3. Raw data is stored in Parquet format. Spark on Databricks processes and cleans the data. Clean tables are stored using Delta Lake. Final analytics data is loaded into Snowflake. Business users use dashboards to view reports. Datadog monitors the pipelines and sends alerts when jobs fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Traditional architecture uses &lt;strong&gt;ETL + Data Warehouse&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Modern architecture uses &lt;strong&gt;ELT + Data Lake + Lakehouse&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Data Lake stores raw data.&lt;/li&gt;
&lt;li&gt;Data Warehouse stores clean data for reporting.&lt;/li&gt;
&lt;li&gt;Spark and Databricks handle large-scale processing.&lt;/li&gt;
&lt;li&gt;Delta Lake and Iceberg make data lakes reliable.&lt;/li&gt;
&lt;li&gt;Datadog monitors the entire system.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Learned AI by Building an Offline PDF Chatbot with Local LLMs</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Mon, 23 Jun 2025 17:33:14 +0000</pubDate>
      <link>https://forem.com/salma_aga/how-i-learned-ai-by-building-an-offline-pdf-chatbot-with-local-llms-52lk</link>
      <guid>https://forem.com/salma_aga/how-i-learned-ai-by-building-an-offline-pdf-chatbot-with-local-llms-52lk</guid>
      <description>&lt;p&gt;Hey everyone! I’m &lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;, and I love learning by building. Instead of just reading theory, I built something that helped me &lt;strong&gt;understand AI practically&lt;/strong&gt; and also &lt;strong&gt;prepare better for my interviews&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let me walk you through what I built, how it works, and how you can try it too.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Goal: Learning AI by Building, Not Just Reading
&lt;/h2&gt;

&lt;p&gt;I didn’t just want to use AI tools. I wanted to &lt;strong&gt;build one from scratch&lt;/strong&gt; and see what happens under the hood.&lt;/p&gt;

&lt;p&gt;I was exploring concepts like embeddings, vector search, and local LLMs but theory alone wasn’t sticking. So I built this project &lt;strong&gt;an Offline PDF Analyzer&lt;/strong&gt; to learn how documents are split, embedded, searched, and how local models generate smart responses.&lt;/p&gt;

&lt;p&gt;This project became my practical journey into AI and now it helps others too, especially those preparing for interviews.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Project Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Upload one or more PDFs through a simple web UI.&lt;/li&gt;
&lt;li&gt;Ask your questions in simple English.&lt;/li&gt;
&lt;li&gt;The system reads and understands the content, then gives you a relevant answer from the document.&lt;/li&gt;
&lt;li&gt;Everything runs &lt;strong&gt;locally&lt;/strong&gt; no internet or API keys needed.&lt;/li&gt;
&lt;li&gt;It can also count how many questions are in the PDF useful for exam prep.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Upload a PDF on Machine Learning and ask: “What is the difference between supervised and unsupervised learning?”&lt;br&gt;
You get a clear, to-the-point answer pulled directly from the relevant section of the document instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Below is the complete flow of how the &lt;strong&gt;Offline PDF Analyzer&lt;/strong&gt; works behind the scenes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PDF Upload&lt;/strong&gt;: The user uploads one or more PDF files through the UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text Extraction&lt;/strong&gt;: The app reads all pages using &lt;code&gt;PyMuPDF&lt;/code&gt; and extracts clean text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Long text is split into overlapping chunks using &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; to preserve context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: Each chunk is converted into a vector (a list of numbers) using &lt;code&gt;OllamaEmbeddings&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS Vector Search&lt;/strong&gt;: When a question is asked, similar chunks are searched using fast cosine similarity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer Generation&lt;/strong&gt;: The selected chunks are passed to a &lt;strong&gt;local LLM&lt;/strong&gt; (like &lt;code&gt;phi&lt;/code&gt;, &lt;code&gt;mistral&lt;/code&gt;, or &lt;code&gt;llama2&lt;/code&gt;) to generate the final answer.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Choose Your Local AI Model
&lt;/h2&gt;

&lt;p&gt;You can select models like &lt;strong&gt;phi&lt;/strong&gt;, &lt;strong&gt;mistral&lt;/strong&gt;, or &lt;strong&gt;llama2&lt;/strong&gt; all running locally on your laptop using &lt;strong&gt;Ollama&lt;/strong&gt; for fast and efficient results.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Design Diagram: How PDF Analyzer Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpq0cd7izfmujj2a9snj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpq0cd7izfmujj2a9snj.png" alt=" " width="542" height="339"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack I Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt;: For building a user-friendly frontend with just a few lines of Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyMuPDF (fitz)&lt;/strong&gt;: To extract text from all pages of uploaded PDFs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: To handle end-to-end chaining from query to retrieval to LLM response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RecursiveCharacterTextSplitter&lt;/strong&gt;: Breaks the text into chunks with overlaps, so context is preserved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt;: Runs local LLMs (phi, mistral, llama2) directly on your machine without internet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt;: A super-fast vector search library to retrieve relevant chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: For the backend logic, caching, state management, and pre-processing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges I Faced &amp;amp; How I Solved Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Wrong Answers from Wrong Sections
&lt;/h3&gt;

&lt;p&gt;In the beginning, it showed answers from the wrong part of the PDF, which didn’t match the question and made things confusing.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I adjusted the chunk overlap size, used better metadata like page numbers and source file names, and added tagging.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Answers Coming from Previous PDF.
&lt;/h3&gt;

&lt;p&gt;Even after uploading a different PDF, it still showed answers from the old one. &lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I added &lt;strong&gt;file hashing&lt;/strong&gt; to detect newly uploaded PDFs. If the incoming file is different from the previous one, the system discards the old data and processes the new file from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Short Queries Gave Confusing Answers
&lt;/h3&gt;

&lt;p&gt;If I typed "types?" or "examples?", the app didn’t understand what I meant. &lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: I made a way to automatically turn short questions into full ones. For example, if someone types &lt;strong&gt;"types?"&lt;/strong&gt;, it changes to &lt;strong&gt;"What are the different types mentioned in the document?"&lt;/strong&gt; so the model understands better.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No Info on Where the Answer Came From
&lt;/h3&gt;

&lt;p&gt;I wasn’t sure if the answer was right because it didn’t show where in the PDF it found the info.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Now it shows the &lt;strong&gt;PDF name&lt;/strong&gt; and &lt;strong&gt;page number&lt;/strong&gt; where the answer came from, and you can &lt;strong&gt;click to see more details&lt;/strong&gt; if you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  Techniques I Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@st.cache_data&lt;/code&gt;: To avoid reloading the same PDF again and again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Hashing&lt;/strong&gt;: So that the app resets only when a new PDF is uploaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session State&lt;/strong&gt;: Used in Streamlit to store user-uploaded files and questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regex Matching&lt;/strong&gt;: To support question formats like “How many questions are in this PDF?”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Templates&lt;/strong&gt;: Help the model understand and answer better when the user's question is short or unclear.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Any Frontend?
&lt;/h2&gt;

&lt;p&gt;Yes! I made a clean and user-friendly interface using &lt;strong&gt;Streamlit&lt;/strong&gt; that makes it easy to upload PDFs and get answers quickly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose your preferred LLM (phi / mistral / llama2)&lt;/li&gt;
&lt;li&gt;Upload one or more PDFs&lt;/li&gt;
&lt;li&gt;Ask your question&lt;/li&gt;
&lt;li&gt;See the answer + source (page number + filename)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No delays, no registration everything happens on your own system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;Here’s what I want to add next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDF Summarizer&lt;/strong&gt;: Get a quick summary of the whole PDF.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export Chat History&lt;/strong&gt;: Save your Q&amp;amp;A for later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find All Questions&lt;/strong&gt;: List all questions found inside the PDF.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tech Terms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Breaking a big document into small, readable parts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: Turning text into numbers so that the model understands meaning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt;: Finds the best match for your question from the chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLMs&lt;/strong&gt;: Small AI models running on your laptop (no internet needed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: Connects everything PDFs, questions, answers — in one neat pipeline.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions You Can Expect
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;How does chunk overlap affect retrieval quality?&lt;/li&gt;
&lt;li&gt;What’s the role of FAISS in a RAG pipeline?&lt;/li&gt;
&lt;li&gt;Why are prompt templates useful in real-world applications?&lt;/li&gt;
&lt;li&gt;How do you make vector indexes update-safe when files change?&lt;/li&gt;
&lt;li&gt;What are the trade-offs of using local LLMs vs cloud APIs?&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project started as a way to &lt;strong&gt;learn AI deeply&lt;/strong&gt; by building something useful. It taught me how to use embeddings, vector search, local LLMs, and chaining tools all while helping me with interview prep.&lt;/p&gt;

&lt;p&gt;If you want to learn by doing start small, build real, and break things.&lt;/p&gt;

&lt;p&gt;Let’s keep learning. Let’s keep building.&lt;br&gt;&lt;br&gt;
✍️&lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://github.com/ShaikSalmaAga/offline-pdf-analyzer" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why Agentic AI Changed My Job Prep Journey</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Tue, 03 Jun 2025 05:02:15 +0000</pubDate>
      <link>https://forem.com/salma_aga/-title-why-agentic-ai-changed-my-job-prep-journey--1hnk</link>
      <guid>https://forem.com/salma_aga/-title-why-agentic-ai-changed-my-job-prep-journey--1hnk</guid>
      <description>&lt;p&gt;Hi everyone! I'm &lt;strong&gt;Salma&lt;/strong&gt;, a student and software engineer preparing for full-time roles. While applying for jobs and preparing for interviews, I realized something big:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Just knowing how to code is no longer enough."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Today’s tech world is changing fast. We see &lt;strong&gt;AI everywhere&lt;/strong&gt;, and one term you’ll hear again and again is &lt;strong&gt;Agentic AI&lt;/strong&gt;. Some people know what it is, many don’t. But if you’re a student or professional looking for a job, understanding Agentic AI gives you a huge advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let’s Imagine you're building your own travel assistant app.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Traditional AI (like ChatGPT):
&lt;/h3&gt;

&lt;p&gt;You: "Book a flight to Delhi."&lt;br&gt;&lt;br&gt;
AI: "Sure. Please tell me the date, airline, timing, etc."&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic AI:
&lt;/h3&gt;

&lt;p&gt;You: "I need to be in Delhi next week for a conference."&lt;/p&gt;

&lt;p&gt;AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks your calendar for free days&lt;/li&gt;
&lt;li&gt;Suggests flight options&lt;/li&gt;
&lt;li&gt;Books your ticket&lt;/li&gt;
&lt;li&gt;Adds it to your calendar&lt;/li&gt;
&lt;li&gt;Sends you a reminder and even books your cab&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just AI that responds. &lt;strong&gt;It’s AI that acts on its own.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Agentic AI?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agentic AI&lt;/strong&gt; is artificial intelligence that &lt;strong&gt;sets goals&lt;/strong&gt;, &lt;strong&gt;makes decisions&lt;/strong&gt;, &lt;strong&gt;takes action&lt;/strong&gt;, and &lt;strong&gt;learns&lt;/strong&gt; all on its own.&lt;/p&gt;

&lt;p&gt;It doesn’t wait for your prompt. It’s like &lt;strong&gt;hiring a junior employee&lt;/strong&gt; who knows what to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional AI vs Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Traditional AI&lt;/strong&gt; works based on prompts. You give it instructions, it gives an output.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Agentic AI&lt;/strong&gt; works based on goals. You give it a goal, and it figures out how to reach it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difference Between Traditional AI and Agentic AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Traditional AI&lt;/th&gt;
&lt;th&gt;Agentic AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Needs prompts&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can act on goals&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision-making&lt;/td&gt;
&lt;td&gt;Basic logic&lt;/td&gt;
&lt;td&gt;Complex reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example&lt;/td&gt;
&lt;td&gt;Chatbot&lt;/td&gt;
&lt;td&gt;Calendar + Travel Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lifecycle of Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5l10k5a1lu9pxg8xrwg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5l10k5a1lu9pxg8xrwg.png" alt=" " width="270" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Perceive&lt;/strong&gt; – Collects data (emails, APIs, sensors)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reason&lt;/strong&gt; – Understands the task and plans next steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt; – Executes using APIs and tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn&lt;/strong&gt; – Evaluates and improves its performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaborate&lt;/strong&gt; – Works with humans or other agents&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  How Agentic AI Solves Customer Support Issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Perceive: Reads an angry customer email&lt;/li&gt;
&lt;li&gt;Reason: Understands it’s about a delayed shipment&lt;/li&gt;
&lt;li&gt;Act: Sends an apology and discount coupon&lt;/li&gt;
&lt;li&gt;Learn: Tracks response from customer&lt;/li&gt;
&lt;li&gt;Collaborate: Notifies human agent if unresolved&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Types of Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single-Agent System&lt;/strong&gt; : One agent handles everything.&lt;br&gt;&lt;br&gt;
Example: Budget manager bot that tracks, predicts, and alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Agent System&lt;/strong&gt; : Several agents with different responsibilities.&lt;br&gt;&lt;br&gt;
Example: Email agent one reads, another replies, another logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal-Oriented Agent&lt;/strong&gt; : Given a goal, it plans and acts.&lt;br&gt;&lt;br&gt;
Example: “Grow Instagram to 5K followers.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reactive Agent&lt;/strong&gt; : Reacts quickly but doesn’t plan ahead.&lt;br&gt;&lt;br&gt;
Example: Auto-braking system in cars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deliberative Agent&lt;/strong&gt; : Thinks and reasons before acting.&lt;br&gt;&lt;br&gt;
Example: Schedules meetings based on mood, urgency, and history.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Build Agentic AI
&lt;/h2&gt;

&lt;p&gt;To build an Agentic AI system, you begin with a &lt;strong&gt;frontend&lt;/strong&gt; that accepts input from users. The request is handled by a &lt;strong&gt;backend&lt;/strong&gt; which forwards the data to a &lt;strong&gt;language model (LLM)&lt;/strong&gt; such as GPT-4 or Claude. The LLM reasons about the task and initiates actions. These actions may include calling APIs or updating systems. Context or memory is stored using vector databases. Results and state changes are saved in a storage system like PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrx9w5rw8rz5o6m50dbk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrx9w5rw8rz5o6m50dbk.png" alt=" " width="570" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Agentic AI Can Automate Resume Screening
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Recruiter uploads resumes on the web interface&lt;/li&gt;
&lt;li&gt;Backend forwards data to the LLM&lt;/li&gt;
&lt;li&gt;LLM ranks the candidates based on fit&lt;/li&gt;
&lt;li&gt;Memory layer remembers past hiring preferences&lt;/li&gt;
&lt;li&gt;Action layer sends top resumes to HR&lt;/li&gt;
&lt;li&gt;PostgreSQL stores rankings and history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Components Used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: HTML, React
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Python (Flask, FastAPI)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: GPT-4, Claude, LLaMA
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: FAISS, Pinecone
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actions&lt;/strong&gt;: APIs, Zapier, CRMs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: PostgreSQL, Redis
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How LangChain + Agentic AI Works
&lt;/h2&gt;

&lt;p&gt;This diagram shows how an &lt;strong&gt;Agentic AI system&lt;/strong&gt; works when you build it using &lt;strong&gt;LangChain&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficn2x7q8a8aj0oyasqdu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficn2x7q8a8aj0oyasqdu.png" alt=" " width="412" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Input&lt;/strong&gt; : The user gives a request.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; “Remind me about my meeting and send a message if I’m late.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning / Planning&lt;/strong&gt; : The system now goes into &lt;strong&gt;thinking mode&lt;/strong&gt;.  It uses a smart model (like GPT-4 or Claude) to figure out what to do next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt; : Based on the plan, it performs the actual work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks your calendar&lt;/li&gt;
&lt;li&gt;Sends messages&lt;/li&gt;
&lt;li&gt;Searches the web&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Uses Tools&lt;/strong&gt; : To complete tasks, the AI uses different tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Web Search&lt;/strong&gt; to gather new information
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Calls&lt;/strong&gt; to apps like your calendar or email
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases, Zapier, or CRMs&lt;/strong&gt; to interact with systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory / Storage&lt;/strong&gt; : After doing the task, it &lt;strong&gt;stores what happened&lt;/strong&gt; for future reference so it can learn and improve next time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Back to User or Move to Next Task&lt;/strong&gt; : It either&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates the user about the result
&lt;/li&gt;
&lt;li&gt;Or starts working on the next goal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This full loop User → Plan → Act → Tools → Back to User is what makes Agentic AI powerful. It’s not just replying like a chatbot. It’s doing real work for you like a smart digital assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advantages of Agentic AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Proactive and autonomous&lt;/li&gt;
&lt;li&gt;Learns and adapts over time&lt;/li&gt;
&lt;li&gt;Integrates with tools and systems&lt;/li&gt;
&lt;li&gt;Can collaborate with other agents or humans&lt;/li&gt;
&lt;li&gt;Reduces repetitive human work&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Drawbacks of Agentic AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Risk of incorrect actions due to bad data&lt;/li&gt;
&lt;li&gt;Hard to debug errors in multi-step logic&lt;/li&gt;
&lt;li&gt;Requires safeguards and human override&lt;/li&gt;
&lt;li&gt;Complexity in design and testing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What If Agentic AI Fails?
&lt;/h2&gt;

&lt;p&gt;Failures can occur. Here's how to make systems robust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task Queues&lt;/strong&gt;: Split large tasks into traceable chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session Tokens&lt;/strong&gt;: Avoid confusion between user sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Templates&lt;/strong&gt;: Keep communication consistent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation Paths&lt;/strong&gt;: Alert humans when automation fails&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Failure Example:
&lt;/h3&gt;

&lt;p&gt;If a meeting booking fails due to calendar API error:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry booking&lt;/li&gt;
&lt;li&gt;On failure again, send alert to user and log the error&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where Is Agentic AI Used Today?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Salesforce – AI customer support agents
&lt;/li&gt;
&lt;li&gt;Hippocratic AI – Medical virtual assistants
&lt;/li&gt;
&lt;li&gt;Ema AI – Business workflow automation
&lt;/li&gt;
&lt;li&gt;Juna – Factory control agents
&lt;/li&gt;
&lt;li&gt;Jasper + HubSpot – AI-powered marketing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI Agents vs Agentic AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI Agent&lt;/strong&gt; : Acts only after manual user input.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Gmail Smart Reply you click it, it sends.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difference Between AI Agents and Agentic AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AI Agents&lt;/th&gt;
&lt;th&gt;Agentic AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User initiated&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Goal planning&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step task&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning ability&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Evolution of AI :&lt;/strong&gt; AI has progressed in 3 major stages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme7ccflglnohcgnyo2y8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme7ccflglnohcgnyo2y8.png" alt=" " width="510" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Predictive AI :&lt;/strong&gt; Forecasting the future&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Credit scoring, fraud detection&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generative AI :&lt;/strong&gt; Creating new content&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; ChatGPT, DALL·E, MidJourney&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic AI :&lt;/strong&gt; Thinking, planning, acting&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Example:&lt;/strong&gt; AI assistant managing tasks and meetings&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Agentic AI is not just a buzzword. It’s a &lt;strong&gt;career game-changer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you're a student or developer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn the lifecycle of agentic systems
&lt;/li&gt;
&lt;li&gt;Build a real mini-project (e.g. with LangChain)
&lt;/li&gt;
&lt;li&gt;Write about it or share your GitHub
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - Talk about it in interviews
&lt;/h2&gt;

&lt;p&gt;✍️ Written by Shaik Salma Aga&lt;/p&gt;




</description>
    </item>
    <item>
      <title>How I Built My Own RAG Chatbot with Local LLMs (And the Roadblocks That Taught Me More Than the Code)</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Sat, 31 May 2025 20:25:53 +0000</pubDate>
      <link>https://forem.com/salma_aga/how-i-built-my-own-rag-chatbot-with-local-llms-and-the-roadblocks-that-taught-me-more-than-the-3kmd</link>
      <guid>https://forem.com/salma_aga/how-i-built-my-own-rag-chatbot-with-local-llms-and-the-roadblocks-that-taught-me-more-than-the-3kmd</guid>
      <description>&lt;p&gt;A while back, I wrote a &lt;strong&gt;beginner-to-expert guide&lt;/strong&gt; on &lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;. That article was all theory. How RAG works, the difference between &lt;strong&gt;sparse and dense embeddings&lt;/strong&gt;, and why it’s powerful.&lt;/p&gt;

&lt;p&gt;This time, I wanted to get my hands dirty. I wanted to build something real.&lt;/p&gt;

&lt;p&gt;So I built a working &lt;strong&gt;RAG chatbot&lt;/strong&gt;. Completely offline. Locally.&lt;/p&gt;

&lt;p&gt;Let me walk you through the full journey what I built, how it works, and what went wrong (and how I fixed it).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why I Ran It Locally&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This wasn’t about saving money or staying private. It was about &lt;strong&gt;learning&lt;/strong&gt; raw, hands-on, deep learning.&lt;/p&gt;

&lt;p&gt;I didn’t want to just connect APIs and feel like a builder. I wanted to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;• Understand how text becomes vectors&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Debug retrieval when it breaks&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Run a model myself and see how it responds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted to learn the hard way and &lt;strong&gt;local was the best way&lt;/strong&gt; to make sure I did.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;My Project: PDF Q&amp;amp;A Chatbot (All Offline)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I had one clear goal: &lt;strong&gt;Ask questions from a PDF and get meaningful answers without internet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used a document called &lt;code&gt;Evolution_of_AI.pdf&lt;/code&gt;. I asked questions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What are the phases in AI development?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The chatbot searched the PDF, found the right section, fed it to a local LLM, and gave me a perfect answer.&lt;/p&gt;

&lt;p&gt;All offline.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;System Design Diagram: How Offline RAG Chatbot Works&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds1qkv6dfh3bp2y9eqix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds1qkv6dfh3bp2y9eqix.png" alt=" " width="406" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s the process:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User sends a question to the &lt;strong&gt;RAG chatbot&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The chatbot uses &lt;strong&gt;PyPDFLoader&lt;/strong&gt; to load the PDF.&lt;/li&gt;
&lt;li&gt;It splits the text using &lt;strong&gt;RecursiveCharacterTextSplitter&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Text chunks are converted to vectors via &lt;strong&gt;HuggingFace Embeddings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Vectors are stored and retrieved using &lt;strong&gt;FAISS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The top relevant chunks are passed to a local &lt;strong&gt;LLM&lt;/strong&gt; via &lt;strong&gt;Ollama&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The final answer is shown to the user.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Tech Stack I Used&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PyPDFLoader:&lt;/strong&gt; Used for extracting raw text from the PDF so the bot can "read" it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RecursiveCharacterTextSplitter:&lt;/strong&gt; It ensures that even long paragraphs are broken into manageable, overlapping pieces that preserve meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFaceEmbeddings:&lt;/strong&gt; Converts those text chunks into number lists (vectors) that reflect context, not just words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS:&lt;/strong&gt; A lightning-fast search tool that finds which vectors (chunks) are closest to the question vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama:&lt;/strong&gt; Runs lightweight models like &lt;code&gt;phi&lt;/code&gt; on your machine, no cloud needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain:&lt;/strong&gt; The backbone. It handles all connections from question to document to model and back.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;The Hidden Struggles and My Fixes&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empty Answers or Garbage Output&lt;/strong&gt;&lt;br&gt;
My initial PDF had just one sentence not enough for meaningful retrieval.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I created a structured PDF (&lt;code&gt;Evolution_of_AI.pdf&lt;/code&gt;) with real content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wrong Chunks Being Retrieved&lt;/strong&gt;&lt;br&gt;
Asked about AI phases, but got results about NLP techniques.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Added more chunk overlap, changed embedding model, and tagged the chunks with extra metadata.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deprecation Warnings in LangChain&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;.run()&lt;/code&gt; method stopped working.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Switched to the &lt;code&gt;.invoke()&lt;/code&gt; method per latest LangChain docs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ollama Crashes with Heavy Models&lt;/strong&gt;&lt;br&gt;
Running models like Mistral overloaded my RAM.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Downgraded to &lt;code&gt;phi&lt;/code&gt;, a lighter model that worked well locally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Change After Updating PDF&lt;/strong&gt;&lt;br&gt;
I changed the PDF but still got answers from the old one.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I cleared the FAISS index and re-embedded everything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Short or Vague Queries Confused the Bot&lt;/strong&gt;&lt;br&gt;
“Phases?” returned irrelevant content.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; I used prompt templates to expand such queries into full sentences automatically.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Technical Bits Explained Simply&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Chunking&lt;/strong&gt;&lt;br&gt;
Breaks large documents into overlapping sections so important parts aren’t lost during processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;br&gt;
Turns sentences into numbers that represent meaning. That way, "vacation" and "holiday" look nearly the same to the machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cosine Similarity&lt;/strong&gt;&lt;br&gt;
A math trick to check how similar two vectors (questions and chunks) are. Smaller angle = better match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;br&gt;
A tool that finds which chunks are most similar to the question super quickly.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LangChain simplifies the complex plumbing. It:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;• Takes your question&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Converts it to a vector&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Finds the most relevant document chunks via FAISS&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Sends it all to the LLM&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;• Collects and returns the final answer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All without you needing to manually stitch the logic together.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Evaluation Techniques I Used&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Manually compared answers with the PDF&lt;/li&gt;
&lt;li&gt;Asked intentionally vague or tricky questions&lt;/li&gt;
&lt;li&gt;Checked that the answers didn’t hallucinate&lt;/li&gt;
&lt;li&gt;Made sure important info wasn’t skipped (avoided the “lost in the middle” issue)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Any Frontend?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Not yet, but I’m planning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Streamlit&lt;/strong&gt;-based UI for chatting with the bot&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;FastAPI&lt;/strong&gt; backend to make it modular&lt;/li&gt;
&lt;li&gt;A desktop wrapper so anyone can use it easily&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;What’s Next?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi-PDF support&lt;/li&gt;
&lt;li&gt;Chunk summaries for quick previews&lt;/li&gt;
&lt;li&gt;Using &lt;strong&gt;ragas&lt;/strong&gt; for automated evaluation&lt;/li&gt;
&lt;li&gt;Feedback-based learning loop&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Interview Questions&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;How does &lt;strong&gt;chunk overlap&lt;/strong&gt; affect retrieval quality in RAG systems?&lt;/li&gt;
&lt;li&gt;What are the benefits of &lt;strong&gt;local embeddings&lt;/strong&gt; over API-based ones?&lt;/li&gt;
&lt;li&gt;How do you &lt;strong&gt;debug wrong or missing retrievals&lt;/strong&gt; in vector search?&lt;/li&gt;
&lt;li&gt;What’s the trade-off between &lt;strong&gt;dense and sparse embeddings&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;How do you handle &lt;strong&gt;stale or outdated indexes&lt;/strong&gt; in a vector DB like FAISS?&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Building this RAG chatbot wasn’t just about code it was about transforming theory into practice. Every bug I fixed and every wrong answer I debugged helped me grow.&lt;/p&gt;

&lt;p&gt;If you’ve read about RAG and want to &lt;em&gt;really&lt;/em&gt; learn it build something.&lt;/p&gt;

&lt;p&gt;Let’s keep learning, building, and breaking things together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shaik Salma Aga&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;[🔗 GitHub: &lt;a href="https://github.com/ShaikSalmaAga/rag-chatbot" rel="noopener noreferrer"&gt;https://github.com/ShaikSalmaAga/rag-chatbot&lt;/a&gt;](url)&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Mastering Retrieval-Augmented Generation (RAG): From Beginner to Expert</title>
      <dc:creator>Salma Aga Shaik</dc:creator>
      <pubDate>Tue, 27 May 2025 05:50:52 +0000</pubDate>
      <link>https://forem.com/salma_aga/-mastering-retrieval-augmented-generation-rag-from-beginner-to-expert-5fgi</link>
      <guid>https://forem.com/salma_aga/-mastering-retrieval-augmented-generation-rag-from-beginner-to-expert-5fgi</guid>
      <description>&lt;h2&gt;
  
  
  Why Should You Care About RAG?
&lt;/h2&gt;

&lt;p&gt;Imagine you work in the HR department of a company that has a 100-page PDF filled with employee policies. One day, an intern walks up to your desk and asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How many work-from-home days are allowed for interns?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You open the document, press &lt;strong&gt;Ctrl+F&lt;/strong&gt;, and type “work-from-home.” But the PDF uses the term &lt;strong&gt;“remote work flexibility.”&lt;/strong&gt; You scroll endlessly, read random sections, and still can’t find a clear answer. It’s frustrating.&lt;/p&gt;

&lt;p&gt;Now imagine a smart chatbot that can read the entire PDF, understand the meaning, and say in 3 seconds:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Interns are eligible for 5 remote working days per month.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s the power of &lt;strong&gt;RAG: Retrieval-Augmented Generation&lt;/strong&gt;. It gives &lt;strong&gt;real answers from real documents&lt;/strong&gt; — not guesses.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Large Language Models Alone Aren’t Enough
&lt;/h2&gt;

&lt;p&gt;LLMs like &lt;strong&gt;GPT-4&lt;/strong&gt; are powerful but have key limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They sometimes &lt;strong&gt;hallucinate&lt;/strong&gt; — they make up answers that sound real but aren’t true.&lt;/li&gt;
&lt;li&gt;Their knowledge is frozen. For example, GPT-4 was trained only until &lt;strong&gt;2023&lt;/strong&gt;, so anything after that is unknown.&lt;/li&gt;
&lt;li&gt;They can’t access &lt;strong&gt;private documents&lt;/strong&gt; like your PDFs or internal policies.&lt;/li&gt;
&lt;li&gt;They don’t &lt;strong&gt;search&lt;/strong&gt;, they just generate responses from memory.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Better Way: Use RAG
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; connects a smart language model to external documents. Instead of guessing, it &lt;strong&gt;retrieves the correct information&lt;/strong&gt; and generates accurate responses.&lt;/p&gt;

&lt;p&gt;So if the intern asks the same question again, the system will search the HR policy and respond:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Interns are eligible for 5 remote working days per month.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s fast, trustworthy, and grounded in &lt;strong&gt;real content&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Prepare the Data (Ingestion)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Chunking
&lt;/h3&gt;

&lt;p&gt;Your document is like a big cake. Chunking is slicing it into small parts — 256 or 512 tokens — so it’s easier to search.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding
&lt;/h3&gt;

&lt;p&gt;Each chunk is turned into a vector (a list of numbers) using models like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI Embeddings&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BERT&lt;/strong&gt; (Bidirectional Encoder Representations from Transformers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why? Because computers understand &lt;strong&gt;numbers&lt;/strong&gt;, not words. Embeddings help the machine capture the &lt;strong&gt;meaning&lt;/strong&gt; behind the text.&lt;/p&gt;

&lt;p&gt;Example: “holiday leave” and “paid vacation” are different phrases but mean the same thing. Embeddings can tell.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing in a Vector Database
&lt;/h3&gt;

&lt;p&gt;These vectors go into special databases built for fast search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chroma&lt;/strong&gt; – beginner friendly and local&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pinecone&lt;/strong&gt; – cloud-based and scalable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt; – open-source tool by Facebook for high-speed search&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Are Embeddings and Why Do They Matter?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Embeddings&lt;/strong&gt; convert text into vectors so we can search and compare meaning — not just words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sparse Embeddings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tools: &lt;strong&gt;TF-IDF&lt;/strong&gt; (Term Frequency–Inverse Document Frequency), &lt;strong&gt;BM25&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Fast, matches exact terms&lt;/li&gt;
&lt;li&gt;Doesn’t understand deeper meaning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: If you ask about “holiday” and the doc says “vacation,” sparse embedding will &lt;strong&gt;miss&lt;/strong&gt; it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dense Embeddings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tools: &lt;strong&gt;BERT&lt;/strong&gt;, &lt;strong&gt;Sentence-BERT&lt;/strong&gt;, &lt;strong&gt;OpenAI Embeddings&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Understands context and meaning&lt;/li&gt;
&lt;li&gt;Better match, even if exact words differ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: Ask about “vacation policy,” and the doc says “30 days paid leave.” Dense embeddings will &lt;strong&gt;match&lt;/strong&gt; it.&lt;/p&gt;

&lt;p&gt;Dense embeddings are ideal for RAG because meaning matters more than words.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Retrieval (Find the Right Chunks)
&lt;/h2&gt;

&lt;p&gt;When a user asks something:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is &lt;strong&gt;converted into a vector&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Compared with all document vectors&lt;/li&gt;
&lt;li&gt;The closest matches are selected&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Similarity Techniques
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cosine Similarity&lt;/strong&gt;: Measures the &lt;strong&gt;angle&lt;/strong&gt; between vectors. Smaller angle = more similar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Euclidean Distance&lt;/strong&gt;: Measures the &lt;strong&gt;distance&lt;/strong&gt; between points. Closer = more similar.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Retrieval Methods
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Standard Retrieval&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Just pick the top-matching chunk and send to the model. Fast but might lack context.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sentence-Window Retrieval&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Picks the match and adds surrounding sentences — so the model understands context better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ensemble Retrieval&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Tries multiple chunk sizes (128, 256, 512), combines best chunks, and sorts them with a &lt;strong&gt;Re-Ranker&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 3: Re-Ranking
&lt;/h2&gt;

&lt;p&gt;You might get 10 chunks — but LLMs can only handle 3-5. So, we &lt;strong&gt;sort&lt;/strong&gt; them by importance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Types of Re-Ranking:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lexical&lt;/strong&gt;: Based on keywords (TF-IDF, BM25)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: Based on meaning (BERT, Cohere)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LTR (Learning to Rank)&lt;/strong&gt;: ML model trained to choose best&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt;: Combines keyword and meaning-based methods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of re-ranking like a judge picking the best answers to pass to the LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problems You Might Face
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lost in the Middle
&lt;/h3&gt;

&lt;p&gt;LLMs focus more on the &lt;strong&gt;start and end&lt;/strong&gt; of the input — often skipping the middle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Move key content to start/end, limit total chunks, and use better re-ranking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: If the answer is in paragraph 3 of 5, reorder the chunk or split it to push the key info up.&lt;/p&gt;




&lt;h3&gt;
  
  
  Wrong Retrieval
&lt;/h3&gt;

&lt;p&gt;Sometimes irrelevant chunks get retrieved — leading to wrong answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve chunking (e.g., avoid breaking sentences)
&lt;/li&gt;
&lt;li&gt;Use better embeddings (dense vs sparse)
&lt;/li&gt;
&lt;li&gt;Add filters to improve search accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: A policy question brings in finance data? You likely need to refine your vector store or chunk size.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fine-Tuning vs RAG
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fine-Tuning
&lt;/h3&gt;

&lt;p&gt;You &lt;strong&gt;retrain the LLM&lt;/strong&gt; to speak in a specific tone or style.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Great for personalization or branding&lt;/li&gt;
&lt;li&gt;Expensive, slow, needs lots of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: You fine-tune a model to sound like Shakespeare.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG
&lt;/h3&gt;

&lt;p&gt;You don’t touch the model. Just &lt;strong&gt;add your documents&lt;/strong&gt;, and the model uses them for answering.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy to update&lt;/li&gt;
&lt;li&gt;No retraining needed&lt;/li&gt;
&lt;li&gt;Works out-of-the-box&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: You upload HR policies. Now the chatbot answers HR questions instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with RAG&lt;/strong&gt; — fine-tune only if your use case demands personality or tone changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Tools and Full Forms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt;: Retrieval-Augmented Generation
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: Large Language Model
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BERT&lt;/strong&gt;: Bidirectional Encoder Representations from Transformers
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt;: Facebook AI Similarity Search
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TF-IDF&lt;/strong&gt;: Term Frequency-Inverse Document Frequency
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LTR&lt;/strong&gt;: Learning to Rank
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NLTK&lt;/strong&gt;: Natural Language Toolkit
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;RAG is a &lt;strong&gt;game changer&lt;/strong&gt;. It connects LLMs to real, updated knowledge — making AI assistants smarter and more trustworthy.&lt;/p&gt;

&lt;p&gt;I’m currently preparing for &lt;strong&gt;software engineering interviews&lt;/strong&gt;, and AI is everywhere. I thought, if I’m learning this, why not help others too?&lt;/p&gt;

&lt;p&gt;That’s why I wrote this post in order to make RAG simple and useful for anyone interested in AI.&lt;/p&gt;

&lt;p&gt;I'll be posting more content around AI, tools, and interview prep. &lt;strong&gt;Stay connected&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5 Must-Know RAG Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What is Retrieval-Augmented Generation (RAG), and how is it different from traditional LLMs?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the difference between sparse and dense embeddings? When should you use each?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explain the “Lost in the Middle” problem and how to handle it.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do cosine similarity and Euclidean distance help in finding relevant document chunks?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When should you choose fine-tuning over RAG, and what trade-offs come with it?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🖊️ &lt;em&gt;Written by Shaik Salma Aga&lt;/em&gt;  &lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
