“How I Built an End-to-End ETL Pipeline Using Databricks & Delta Lake”

Khandare Shubham — Fri, 19 Dec 2025 19:15:53 +0000

In this project, I built an end-to-end ETL pipeline using Databricks and Delta Lake,
following the Bronze–Silver–Gold architecture.

The goal was to simulate a real-world data engineering pipeline with incremental
processing, workflow orchestration, and analytics-ready datasets.

Tech Stack

Databricks (Free Edition)
Apache Spark (PySpark)
Delta Lake
SQL
GitHub

Architecture Overview

The pipeline follows the Bronze–Silver–Gold data architecture:

Bronze Layer: Raw data ingestion (append-only)
Silver Layer: Cleaned and deduplicated data with incremental updates using Delta MERGE
Gold Layer: Aggregated business metrics optimized for analytics

Architecture Overview

The pipeline follows the Bronze–Silver–Gold data architecture:

Bronze Layer: Raw data ingestion (append-only)
Silver Layer: Cleaned and deduplicated data with incremental updates using Delta MERGE
Gold Layer: Aggregated business metrics optimized for analytics

Bronze Layer

The Bronze layer ingests raw CSV data into Delta tables in append mode.
This layer acts as the source of truth and allows full reprocessing if downstream
transformations fail.

Silver Layer

The Silver layer performs data cleaning and deduplication.
Incremental updates are handled using Delta Lake MERGE to ensure idempotent processing
and avoid duplicate records.

Gold Layer

The Gold layer contains aggregated business metrics such as:

Daily sales KPIs
Customer-level metrics
Product-level metrics

Gold tables are rebuilt using overwrite mode to ensure consistent and deterministic results.

Workflow Orchestration

The entire pipeline is orchestrated using Databricks Workflows.
Tasks are executed in sequence from Bronze to Silver, followed by parallel Gold aggregations.

Source Code

The complete source code is available on GitHub:
https://github.com/shubhkhandare/databricks-etl-sales

Conclusion

This project helped me understand how production-style ETL pipelines are designed
using Databricks and Delta Lake, including incremental processing and workflow orchestration.

Forem: Khandare Shubham

“How I Built an End-to-End ETL Pipeline Using Databricks & Delta Lake”

Tech Stack

Architecture Overview

Architecture Overview

Bronze Layer

Silver Layer

Gold Layer

Workflow Orchestration

Source Code

Conclusion