Forem: Anthony Gicheru

Slowly Changing Dimensions Explained: How Data Warehouses Keep History Accurate

Anthony Gicheru — Sun, 17 May 2026 07:11:45 +0000

1. Why Slowly Changing Dimensions Matter

In data engineering, not all data changes the same way.

Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.

But other data changes slowly.

A customer changes their address.
A product changes category.
An employee moves to a new department.
A supplier changes region.
A user upgrades from a free plan to a premium plan.

These changes do not happen every second, but when they happen, they matter a lot.

Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.

That is a problem.

The business may ask:

“How much revenue did we make from Nairobi customers last year?”

But if you overwrote the customer’s location, your report may give the wrong answer.

This is the exact problem Slowly Changing Dimensions solve.

Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.

2. What Is a Slowly Changing Dimension?

A Slowly Changing Dimension, often shortened to SCD, is a technique used in data warehousing to manage changes in dimension tables over time.

A dimension table stores descriptive information.

For example:

customer_id	customer_name	city	customer_type
101	Mary Wanjiku	Nairobi	Regular

This is not a transaction. It describes the customer.

Now imagine Mary moves from Nairobi to Kisumu.

The question becomes:

Should we overwrite Nairobi with Kisumu, or should we keep a history of both?

That decision is what SCD is all about.

This is where Slowly Changing Dimensions become useful.

They give data teams a structured way to decide how changes should be stored.

Sometimes we only care about the latest value.

Sometimes we want to preserve the original value.

Sometimes we need the full history of every meaningful change.

And sometimes we only need a simple previous-and-current comparison.

3. How Slowly Changing Dimensions Work

In a data warehouse, data is usually organized into fact tables and dimension tables.

Fact Tables

Fact tables store business events.

Examples:

Sales
Orders
Payments
Website clicks
Deliveries

A sales fact table might look like this:

sale_id	customer_key	product_key	amount	sale_date
5001	1	20	3000	2025-01-10

Dimension Tables

Dimension tables describe the facts.

A customer dimension might look like this:

customer_key	customer_id	name	city	customer_type
1	101	Mary Wanjiku	Nairobi	Regular

The fact table tells us what happened.

The dimension table tells us who, what, where, or how it happened.

The challenge is that dimension data changes.

When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.

There are different SCD types, but the most commonly used are:

SCD Type 0
SCD Type 1
SCD Type 2
SCD Type 3

Let’s go through them practically.

4. SCD Type 0: Keep the Original Value

SCD Type 0 means a value does not change in the data warehouse.

Once the value is loaded, it stays the same, even if the source system changes later.

In simple terms, Type 0 says:

“Keep the original value as it was first recorded.”

In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.

They may simply say:

“This field should never be updated.”

Or:

“Preserve the original value.”

So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.

Good examples of Type 0 fields are:

Original signup date
First purchase date
Original registration country
Original acquisition channel
Original product launch date
Original employee hire date

For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.

customer_id	name	original_signup_date	original_country	acquisition_channel
101	Mary Wanjiku	2025-01-01	Kenya	Instagram Ads

Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.

But the original acquisition channel should still remain Instagram Ads.

Why?

Because it tells the business how Mary was first acquired.

If we overwrite that value, we lose the ability to answer questions like:

“Which marketing channel originally brought us our best customers?”

That is where Type 0 is useful.

It protects values that describe the original state of a record.

5. SCD Type 1: Overwrite the Old Value

SCD Type 1 is the simplest approach.

When a value changes, you overwrite the old value with the new one.

Before:

customer_id	name	city
101	Mary Wanjiku	Nairobi

After Mary moves:

customer_id	name	city
101	Mary Wanjiku	Kisumu

The old city is gone.

When Type 1 Makes Sense

SCD Type 1 is useful when history does not matter.

For example:

Fixing a spelling mistake
Correcting wrong data
Updating an email address
Updating a phone number
Correcting a product name typo

If the original value was wrong, you usually do not want to preserve it.

Example

Imagine a customer’s name was loaded as:

Mary Wanjikuuu

Then later corrected to:

Mary Wanjiku

You do not need historical tracking for the typo. You just update the record.

That is SCD Type 1.

The Risk

The risk with Type 1 is that it destroys history.

If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.

So Type 1 is simple, but dangerous when historical accuracy matters.

6. SCD Type 2: Keep Full History

SCD Type 2 is the most important and most commonly used SCD technique in analytics.

Instead of overwriting the old record, you create a new row when important attributes change.

Before Mary moves:

customer_key	customer_id	name	city	start_date	end_date	is_current
1	101	Mary Wanjiku	Nairobi	2024-01-01	NULL	true

After Mary moves to Kisumu:

customer_key	customer_id	name	city	start_date	end_date	is_current
1	101	Mary Wanjiku	Nairobi	2024-01-01	2025-03-15	false
2	101	Mary Wanjiku	Kisumu	2025-03-15	NULL	true

Notice something important.

The customer_id stays the same because it represents the real-world customer.

But the customer_key changes because each historical version gets its own unique warehouse key.

This is usually called a surrogate key.

Why This Matters

Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.

If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.

This allows historical reports to stay accurate.

Example Fact Table

sale_id	customer_key	amount	sale_date
5001	1	3000	2025-02-10
5002	2	4500	2025-04-20

The first sale belongs to Mary when she was in Nairobi.

The second sale belongs to Mary when she was in Kisumu.

That is the power of SCD Type 2.

7. SCD Type 3: Store Limited History in Columns

SCD Type 3 keeps limited history by adding extra columns.

For example:

customer_id	name	current_city	previous_city
101	Mary Wanjiku	Kisumu	Nairobi

This lets you see the current value and one previous value.

When Type 3 Makes Sense

SCD Type 3 is useful when you only care about a small amount of history.

For example:

Previous region and current region
Previous plan and current plan
Previous department and current department

But it does not scale well if changes happen many times.

What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?

You would need more columns:

previous_city_1
previous_city_2
previous_city_3

That becomes messy quickly.

So Type 3 is useful, but only for very specific cases.

8. Practical Example in a Data Warehousing Project

Let’s say you are building a sales analytics warehouse for an e-commerce company.

You have data coming from:

PostgreSQL for application data
Kafka for order events
Airflow for orchestration
dbt for transformations
Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse

Your source customer table in PostgreSQL looks like this:

customer_id	name	city	customer_type	acquisition_channel	updated_at
101	Mary Wanjiku	Nairobi	Regular	Instagram Ads	2025-01-01

Later, the same customer changes:

customer_id	name	city	customer_type	acquisition_channel	updated_at
101	Mary Wanjiku	Kisumu	Premium	Instagram Ads	2025-03-15

Now the data team must decide:

Do we overwrite the old record?

Or do we preserve the old version?

And what do we do with the original acquisition channel?

In this example:

acquisition_channel can be treated as Type 0 because it represents how Mary was originally acquired.
city and customer_type can be treated as Type 2 because they affect historical reporting.

For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.

A Type 2 customer dimension may look like this:

customer_key	customer_id	name	city	customer_type	acquisition_channel	valid_from	valid_to	is_current
1	101	Mary Wanjiku	Nairobi	Regular	Instagram Ads	2025-01-01	2025-03-15	false
2	101	Mary Wanjiku	Kisumu	Premium	Instagram Ads	2025-03-15	NULL	true

Now your reports can answer questions like:

How many sales came from Nairobi customers in February?
How much revenue came from Premium customers after March?
What was the customer type at the time of purchase?
How many customers upgraded from Regular to Premium?

Without SCD Type 2, these questions become difficult or inaccurate.

9. A Simple SCD Type 2 Flow

A typical SCD Type 2 pipeline works like this:

Step 1: Load the Latest Source Data

You extract the latest customer data from the source system.

This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.

Step 2: Compare Source Data With Current Dimension Records

You compare the incoming record with the current active record in the warehouse.

For example, compare:

source.city
source.customer_type

against:

dim_customer.city
dim_customer.customer_type

Step 3: Detect Changes

If nothing changed, do nothing.

If important attributes changed, expire the old record.

For example:

customer_key	customer_id	city	valid_to	is_current
1	101	Nairobi	2025-03-15	false

Step 4: Insert a New Current Record

Then insert the new version:

customer_key	customer_id	city	valid_from	valid_to	is_current
2	101	Kisumu	2025-03-15	NULL	true

Step 5: Use the Correct Dimension Version in Fact Tables

When loading fact data, join the fact date to the correct dimension record using the validity period.

For example:

SELECT
    f.order_id,
    d.customer_key,
    f.order_date,
    f.amount
FROM staging_orders f
JOIN dim_customer d
    ON f.customer_id = d.customer_id
   AND f.order_date >= d.valid_from
   AND (
        f.order_date < d.valid_to
        OR d.valid_to IS NULL
   );

This ensures the order connects to the correct version of the customer.

10. Common Mistakes Beginners Make

Mistake 1: Using Type 1 When History Matters

This is probably the most common mistake.

Beginners often overwrite dimension records because it feels simple.

But later, when the business asks historical questions, the warehouse cannot answer correctly.

For example:

“What was revenue by customer region last year?”

If you overwrote all customer regions with the latest value, the report will be wrong.

How to Avoid It

Before choosing Type 1, ask:

“Will the business ever need to know what this value was in the past?”

If yes, consider Type 2.

Mistake 2: Tracking Every Column as Type 2

Not every change deserves a new historical version.

For example, do you really need a new customer dimension row when the phone number changes?

Maybe not.

If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.

How to Avoid It

Classify columns carefully.

For example:

Column	SCD Type
customer_name typo fix	Type 1
city	Type 2
customer_type	Type 2
phone_number	Type 1
email	Type 1 or Type 2 depending on business need

The decision should be based on reporting needs, not just technical preference.

Mistake 3: Not Using a Surrogate Key

A big mistake is using the source system ID as the primary key for the dimension table.

For example, using customer_id as the only key.

That becomes a problem in Type 2 because the same customer can have multiple historical versions.

Better Approach

Use:

customer_id as the business key
customer_key as the warehouse surrogate key

Example:

customer_key	customer_id	city	is_current
1	101	Nairobi	false
2	101	Kisumu	true

The surrogate key uniquely identifies each version.

Mistake 4: Forgetting the Current Flag

Without an is_current column, it becomes harder to query the latest version of each record.

You would have to check for:

valid_to IS NULL

That works, but is_current makes queries easier and clearer.

A good SCD Type 2 table usually has:

valid_from
valid_to
is_current

Some teams also add:

created_at
updated_at
record_hash

Mistake 5: Poor Date Handling

SCD Type 2 depends heavily on dates.

If your valid_from and valid_to values are wrong, your historical joins will be wrong.

Common problems include:

Overlapping date ranges
Gaps between versions
Incorrect timezone handling
Using load date instead of actual business effective date
Not handling late-arriving data

How to Avoid It

Be very intentional about what your dates mean.

For example:

valid_from: when this version became valid
valid_to: when this version stopped being valid
loaded_at: when the data entered the warehouse

These are not always the same thing.

11. Best Practices for SCD

Use SCD Type 2 for Business-Critical History

If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.

Examples:

Customer region
Customer plan
Product category
Sales territory
Employee department
Account status

These are usually worth tracking with Type 2.

Use Hashing to Detect Changes

Instead of comparing many columns one by one, you can create a hash from the important attributes.

Example:

MD5(CONCAT(city, customer_type, region))

Then compare the source hash with the current dimension hash.

If the hash changes, something important changed.

This makes SCD pipelines easier to maintain, especially when there are many columns.

Keep Your SCD Logic Clear

SCD logic can become confusing quickly.

Use clear column names like:

valid_from
valid_to
is_current

Avoid unclear names like:

start
end
flag

Your future self and your teammates will thank you.

Document Which Columns Are Tracked

Do not leave SCD behavior hidden inside SQL code only.

Document which columns are:

Type 0
Type 1
Type 2
Type 3
Ignored
Used for change detection

This is especially important in team environments.

Avoid Duplicates in Current Records

For a Type 2 dimension, each business key should have only one current record.

This should be true:

One customer_id = one current record

You can test this with SQL:

SELECT
    customer_id,
    COUNT(*) AS current_records
FROM dim_customer
WHERE is_current = true
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this query returns rows, your dimension has a problem.

Test for Overlapping Validity Periods

For each business key, the date ranges should not overlap.

Bad example:

customer_id	city	valid_from	valid_to
101	Nairobi	2025-01-01	2025-04-01
101	Kisumu	2025-03-15	NULL

These overlap between March 15 and April 1.

That means a sale on March 20 could match both records.

That is dangerous.

12. When to Use Slowly Changing Dimensions

Use SCD when dimension data changes and those changes affect analysis.

Good use cases include:

Customer address history
Product category history
Subscription plan changes
Employee department changes
Supplier region changes
Account status changes
Sales territory changes

SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.

For example:

“Show revenue by the customer’s region at the time of purchase.”

That is a classic SCD Type 2 problem.

13. When SCD May Not Be the Best Choice

SCD is powerful, but not every situation needs it.

Do Not Use Type 2 for Every Small Change

If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.

For example:

Last login timestamp
Profile picture URL
Phone number
Minor spelling corrections
Temporary status fields

For these, Type 1 may be enough.

Be Careful With Very High-Volume Changes

If a dimension changes too frequently, Type 2 can grow very fast.

At that point, you may need a different modeling approach, such as:

Event sourcing
Audit tables
Snapshot tables
Data lake versioning
Change Data Capture history

SCD is best for slowly changing descriptive attributes, not every event that happens in the system.

14. SCD in Modern Data Tools

SCD is not limited to traditional warehouses.

You can implement SCD patterns in many modern data stacks.

In dbt

dbt supports snapshots, which are commonly used to implement SCD Type 2.

A dbt snapshot can track when records change and automatically create historical versions.

In Airflow

Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.

In Spark

Spark is useful when you are handling large-scale dimension updates.

You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.

In Kafka and CDC

Kafka can stream changes from source systems.

For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.

From there, you can process those changes and update your dimension tables.

In Warehouses

SCD can be implemented in:

Snowflake
BigQuery
Redshift
PostgreSQL
Databricks
SQL Server

The concept stays the same, even if the syntax differs.

15. Final Summary

Slowly Changing Dimensions help data engineers manage changes in dimension data over time.

They solve an important problem:

How do we keep analytics accurate when descriptive data changes?

The most common SCD types are:

Type	Meaning	Best For
Type 0	Keep the original value unchanged	Original values that should not change
Type 1	Overwrite old values	Corrections and non-historical changes
Type 2	Keep full history using new rows	Historical reporting
Type 3	Keep limited history in columns	Simple previous/current comparisons

In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.

Without SCD, reports can quietly become wrong.

A customer’s current city may overwrite their past city.
A product’s new category may rewrite old sales history.
An employee’s new department may change old performance reports.

That is why SCD matters.

The practical takeaway is simple:

Whenever a dimension value changes, ask whether the business needs to remember the old value.

If the value should never change, Type 0 may be the right choice.

If the old value does not matter, Type 1 may be enough.

If the business needs full history, Type 2 is usually the best choice.

If the business only needs a simple previous-and-current comparison, Type 3 may work.

Why Your Code Breaks in Production (and How Docker Fixes It)

Anthony Gicheru — Tue, 12 May 2026 05:28:52 +0000

1. Why This Matters

You write your code.
You test it locally.
Everything works perfectly.

Then it goes to production… and breaks.

You spend hours debugging, only to realize:
nothing is wrong with your code — the environment is the problem.

In data engineering, this happens all the time:

A Spark job runs locally but fails in production
Airflow works on Ubuntu but breaks on macOS
Kafka pipelines behave differently across environments

At its core, the issue is simple:

Your environment is not consistent.

Containerization solves this by packaging everything your application needs into a single, portable unit that runs the same way anywhere.

2. Core Concept — What is Containerization?

Let’s simplify it with an analogy.

Analogy: A Fully Equipped House

Imagine being placed in an empty field with nothing around you.

No food.
No water.
No electricity.
No shelter.

You might survive for a while, but functioning properly would be difficult.

Now imagine being placed inside a fully equipped house.

Everything you need is already there:

food
water
electricity
furniture
internet
a bed

No matter where that house is moved, you can still live comfortably because your essentials move with you.

Applications work the same way.

An application needs certain things to function:

libraries
runtime versions
system tools
environment variables
dependencies

Without them, the application breaks.

Containerization solves this problem by packaging the application together with everything it needs to run.

Think of a container as:

a fully equipped house for your application.

Inside the container, the app already has:

its dependencies
configurations
runtime environment
required tools

So whether the container runs on:

your laptop
a cloud server
a teammate’s machine

…the application still behaves the same way.

The Mental Model

Containerization gives your application its own portable environment with everything it needs to survive and run consistently.

3. Docker Basics

Key Components

Image - A blueprint/template
Container - A running instance of that image
Dockerfile - Instructions to build the image

Let’s Make It Real

Here’s the smallest possible Docker setup for a Python app.

app.py

print("Hello from Docker!")

Dockerfile

FROM python:3.10-slim

WORKDIR /app
COPY app.py .

CMD ["python", "app.py"]

Build and Run

docker build -t my-python-app .
docker run my-python-app

Notice what we didn’t do:

Install Python manually
Manage versions
Configure anything

The environment is fully defined in the Dockerfile.

4. Why Docker is Useful in Data Engineering

In real-world data systems, you work with tools like:

Apache Airflow
Spark / PySpark
PostgreSQL or another data warehouse
Reporting tools or dashboards

Each of these has:

Different dependencies
Different configurations
Different runtime requirements
Different ports
Different environment variables

Without Docker, they often conflict.

For example:

Airflow may require specific Python packages
PySpark may need Java and Spark installed
PostgreSQL may need database credentials and storage
Dashboard tools may need access to the processed data

With Docker:

each tool runs in its own isolated environment — no conflicts, no surprises.

This is especially useful in batch data pipelines because the entire workflow can be reproduced across different machines and environments.

5. Docker Compose — Managing Multiple Containers

Real systems are never just one container.

A Dockerized data engineering pipeline may include:

An Airflow webserver
An Airflow scheduler
A PostgreSQL database
A Spark / PySpark processing service
Shared folders for DAGs, logs, scripts, and data

Running each service manually quickly becomes painful.

Docker vs Docker Compose

Docker - runs one container
Docker Compose - runs an entire system made up of multiple containers

The Key Insight

Without Docker Compose:

Multiple terminals
Manual startup order
Constant configuration issues
Harder networking between services

With Docker Compose:

one command starts everything.

Example: Multi-Service Setup

A simplified Docker Compose setup for a batch pipeline may include Airflow and PostgreSQL.

docker-compose.yml

services:
  airflow-webserver:
    image: apache/airflow:3.2.1
    container_name: airflow_webserver
    command: airflow webserver
    ports:
      - "8080:8080"
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./jobs:/opt/airflow/jobs
    depends_on:
      - postgres

  airflow-scheduler:
    image: apache/airflow:3.2.1
    container_name: airflow_scheduler
    command: airflow scheduler
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./jobs:/opt/airflow/jobs
    depends_on:
      - postgres

  postgres:
    image: postgres:16
    container_name: postgres_db
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    ports:
      - "5433:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

8. Common Mistakes

Using localhost inside containers

This breaks almost everyone at first.

Inside a container:

localhost refers to the container itself, not your machine.

Forgetting environment variables

Missing configs often cause silent failures.

Not persisting data

Containers are temporary. Without volumes, your data disappears.

  volumes:
    - postgres_data:/var/lib/postgresql/data

Rebuilding unnecessarily

Poor Dockerfile structure can slow builds significantly.

9. Best Practices

Use lightweight images

  FROM python:3.10-slim

Add a .dockerignore

  node_modules
  .git
  .env

Avoid latest in production

Use fixed versions to keep builds predictable.

Separate dev and production setups

They have different requirements.

Use Docker Compose for local development

It helps simulate real systems easily.

Use clear service names

Examples:

kafka
postgres
airflow

This simplifies networking and debugging.

10. Conclusion

Containerization changes how you think about environments.

Docker packages your application into a portable unit.
Docker Compose runs entire systems with one command.
Your pipelines become reproducible and consistent.

The real shift is this:

You stop debugging environments — and start defining them as code.

And once you reach that point:

You’re no longer just writing code — you’re building systems.

Data Warehouses, Data Marts, Data Lakes, and Lakehouses - Explained Like You’re Building Them in Real Life

Anthony Gicheru — Sun, 03 May 2026 08:10:27 +0000

If you’ve been around data engineering long enough, you’ve probably heard these terms thrown around in meetings:

“Just dump it in the data lake”
“We’ll expose it through the warehouse”
“That goes into the mart”
“We’re moving to a lakehouse architecture”

And honestly… it can sound like four different ways of saying the same thing.

They’re not.

Each one solves a slightly different problem in the data ecosystem. Once you understand the “why” behind each, the architecture suddenly feels a lot less like buzzwords and more like a clean system design.

Let’s break it down in a practical, engineer-first way.

1. The Big Picture (Why all these systems exist)

In most companies, data doesn’t come from one place — it flows in from everywhere:

User clicks from web/mobile apps
Payments and transactions
Logs from servers
Third-party APIs (Stripe, Shopify, etc.)
IoT or streaming data (Kafka, sensors, etc.)

Now here’s the problem:

Raw data is messy. Business users don’t want messy.

So we build systems that progressively refine data from:

Raw → Clean → Structured → Business-ready

That’s where these four concepts come in:

Data Lake → store everything raw
Data Warehouse → structured analytics-ready data
Data Mart → department-specific slices of warehouse data
Lakehouse → hybrid of lake + warehouse

2. Data Lake — “Store everything first, figure it out later”

A data lake is basically a massive storage system where you dump raw data in its original format.

Think of it like:

A giant warehouse where you throw every box in as-is, without opening it.
Or even better: a farm storage system where everything is stored right after harvest, unprocessed and mixed together.

Characteristics:

Stores structured, semi-structured, and unstructured data
Cheap storage (usually object storage like S3)
Schema is applied when reading, not writing (schema-on-read)

Example tools:

Amazon S3
Azure Data Lake Storage
Google Cloud Storage

Example:

You might store:

/events/clicks/2026/05/01.json
/logs/api/2026/05/01.log
/payments/stripe/2026/05/01.parquet

No transformations. No enforcement. Just storage.

Here’s the catch:

If you’re not careful, a data lake becomes a data swamp — lots of data, zero usability.

3. Data Warehouse — “Clean, structured, and business-ready”

A data warehouse is where data goes after it has been cleaned, transformed, and modeled for analytics.

Think of it like:

A well-organized supermarket where everything is cleaned, packaged, labeled, and placed on the right shelves.
You don’t pick raw potatoes from the soil — you get them washed, sorted, and priced.

Characteristics:

Structured data only
Schema-on-write (you define structure before loading)
Optimized for analytics queries (OLAP systems)
Highly curated and trustworthy

Example tools:

Amazon Redshift
Snowflake
Google BigQuery

Typical workflow:

Extract data from sources
Transform (clean, join, aggregate)
Load into warehouse tables

Example SQL model:

CREATE TABLE sales_fact (
    order_id INT,
    customer_id INT,
    product_id INT,
    amount DECIMAL(10,2),
    order_date DATE
);

Now business analysts can run queries like:

SELECT product_id, SUM(amount)
FROM sales_fact
GROUP BY product_id;

4. Data Marts — “Department-specific mini warehouses”

A data mart is a subset of a data warehouse focused on a specific business domain.

Think of it like:

A grocery store or specialty shop — like a bakery, butcher, or vegetable shop.
It doesn’t sell everything. It only sells what its customers actually need.

Characteristics:

Smaller scope than a warehouse
Built for a specific team (finance, marketing, sales)
Faster queries for targeted use cases

Example:

A marketing data mart might include:

Campaign performance
Customer acquisition metrics
Ad spend data

Example structure:

CREATE TABLE marketing_campaign_performance AS
SELECT
    campaign_id,
    SUM(clicks) AS total_clicks,
    SUM(impressions) AS total_impressions
FROM ad_events
GROUP BY campaign_id;

Why it exists:

Instead of everyone querying a massive warehouse, teams get pre-optimized datasets.

5. Data Lakehouse — “Best of both worlds”

Now this is where things get interesting.

A lakehouse combines:

The flexibility of a data lake
The structure and performance of a data warehouse

Think of it like:

A modern retail system where the warehouse and supermarket are combined into one smart facility.
Raw goods arrive, but they are immediately tracked, organized, and made queryable without losing flexibility.

Characteristics:

Uses low-cost storage (like a lake)
Adds structure, ACID transactions, and governance
Supports both analytics and ML workloads

Example tools:

Apache Spark + Delta Lake
Apache Iceberg
Apache Hudi

Why it matters:

In traditional setups:

Data lakes = flexible but messy
Warehouses = clean but expensive and rigid

Lakehouses try to remove that tradeoff.

6. How They Work Together (Real Architecture Flow)

A modern data pipeline often looks like this:

[ Data Sources ]
      ↓
   DATA LAKE (raw storage)
      ↓
ETL / ELT pipelines (Airflow, Spark)
      ↓
DATA WAREHOUSE (modeled data)
      ↓
DATA MARTS (team-specific views)
      ↓
Dashboards / BI tools

Or in a lakehouse setup:

[ Data Sources ]
      ↓
DATA LAKEHOUSE (single system)
      ↓
BI + ML + Analytics directly

7. Practical Example (Mini Pipeline)

Let’s say we’re processing e-commerce data.

Step 1: Raw data in S3 (Data Lake)

{
  "order_id": 101,
  "user_id": 55,
  "amount": 250,
  "timestamp": "2026-05-01T10:00:00Z"
}

Step 2: Spark transformation

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("etl").getOrCreate()

df = spark.read.json("s3://datalake/raw/orders/")

clean_df = df.dropna() \
             .withColumnRenamed("amount", "order_amount")

clean_df.write.mode("overwrite").parquet("s3://warehouse/sales_fact/")

Step 3: Load into warehouse (Redshift example)

COPY sales_fact
FROM 's3://warehouse/sales_fact/'
IAM_ROLE 'arn:aws:iam::123456:role/RedshiftRole'
FORMAT AS PARQUET;

Step 4: Create a data mart

CREATE TABLE sales_summary AS
SELECT
    DATE(order_date) AS date,
    SUM(order_amount) AS revenue
FROM sales_fact
GROUP BY DATE(order_date);

8. Common Pitfalls (Where most teams mess up)

1. Turning the data lake into a swamp

Dumping everything without metadata or structure leads to chaos.

2. Over-modeling too early

Trying to build perfect schemas upfront slows everything down.

3. Duplicating logic across marts

You end up with inconsistent metrics like “Revenue_v1”, “Revenue_final”, “Revenue_real_final”.

4. No governance layer

Without access control and cataloging, nobody trusts the data.

9. Best Practices

1. Use layered architecture

Raw (lake)
Cleaned (staging)
Modeled (warehouse)
Aggregated (marts)

2. Standardize transformations

Use tools like:

dbt
Apache Airflow
Spark jobs with clear ownership

3. Define a single source of truth

One metric definition per business KPI. No duplicates.

4. Treat data like software

Version it, test it, document it.

5. Monitor everything

Pipeline failures
Data freshness
Schema changes

10. Conclusion — The mental model that matters

If you strip away the jargon, it’s really simple:

Data Lake → store everything
Data Warehouse → clean and organize it
Data Mart → tailor it for teams
Lakehouse → unify storage and analytics

The real skill in data engineering isn’t memorizing definitions.

It’s knowing:

When to keep data raw, when to structure it, and when to specialize it.

Once that clicks, designing data systems becomes a lot more intuitive — and honestly, more fun to build.

Refactoring Airflow Pipelines: From PythonOperator to TaskFlow

Anthony Gicheru — Fri, 24 Apr 2026 10:52:39 +0000

Actually Embracing TaskFlow After a Year of Doing It the “Old Way”

1. Introduction: This Isn’t New… But It Feels New

If you’ve been using Airflow for a while-like I have-you probably didn’t start with the TaskFlow API.

You likely started with the classic Airflow 2.x style:

PythonOperator
**kwargs
ti.xcom_push() and ti.xcom_pull()
Explicit task chaining with >>

I spent over a year building pipelines this way. And to be clear-it works. It’s stable, production-ready, and widely used.

But here’s the interesting part:

The TaskFlow API has existed since Airflow 2.0. I just didn’t fully adopt it.

Honestly, I ignored TaskFlow for a long time because I thought it was just ‘syntactic sugar’. That’s more common than people admit.

Most production systems and tutorials still rely on operators, so you naturally stay in that pattern. It’s only later-when readability and maintainability start to matter-that TaskFlow becomes interesting.

And once it clicks, it changes how you think about Airflow.

2. Core Concepts: Same Engine, Different Experience

TaskFlow doesn’t replace Airflow concepts-it abstracts them.

You still work with:

Tasks
DAGs
Scheduling
XComs

The difference is how you express them.

Traditional Approach

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(**kwargs):
    data = [1, 2, 3]
    kwargs['ti'].xcom_push(key='data', value=data)

def transform(**kwargs):
    ti = kwargs['ti']
    data = ti.xcom_pull(task_ids='extract', key='data')
    return [x * 2 for x in data]

with DAG('traditional_dag', start_date=datetime(2023, 1, 1)) as dag:

    t1 = PythonOperator(task_id='extract', python_callable=extract)
    t2 = PythonOperator(task_id='transform', python_callable=transform)

    t1 >> t2

This works, but it introduces a lot of orchestration boilerplate into your business logic.

TaskFlow Approach

from airflow.sdk import dag, task
from datetime import datetime

@dag(start_date=datetime(2023, 1, 1), schedule='@daily', catchup=False)
def taskflow_dag():

    @task
    def extract():
        return [1, 2, 3]

    @task
    def transform(data):
        return [x * 2 for x in data]

    transform(extract())

dag = taskflow_dag()

This feels simpler because it is.

TaskFlow removes explicit XCom handling and lets function returns define data flow.

3. The Real Shift: From Wiring Tasks to Modeling Data Flow

With the traditional approach, your mental model looks like this:

Task A - XCom - Task B - XCom - Task C

With TaskFlow, it becomes:

data = extract()
result = transform(data)

Same execution engine. Different abstraction.

The shift is from task orchestration to data flow composition.

4. XComs: Manual vs Automatic

Manual XComs

def extract(**kwargs):
    kwargs['ti'].xcom_push(key='data', value=[1, 2, 3])

def transform(**kwargs):
    ti = kwargs['ti']
    data = ti.xcom_pull(task_ids='extract', key='data')

You manage everything explicitly.

TaskFlow XComs

@task
def extract():
    return [1, 2, 3]

@task
def transform(data):
    return [x * 2 for x in data]

Airflow handles:

serialization
storage
retrieval

You focus on logic.

When You Still Need Control

TaskFlow still allows explicit control when needed:

from airflow.models.xcom_arg import XComArg

@task
def extract():
    return {"numbers": [1, 2, 3]}

@task
def transform(data):
    return [x * 2 for x in data["numbers"]]

transform(XComArg(extract()))

5. Real-World Example: Gas Prices ETL Refactor

I didn’t build two versions of this pipeline at once.

I originally built it using the traditional Airflow 2.x approach and later refactored it using TaskFlow.

That’s when the difference became clear.

Pipeline Overview

API - Extract gas prices - Transform - Store in PostgreSQL

GitHub Reference

Full project: Github Link to the project

It includes both the original DAG and the TaskFlow refactor.

Traditional Version

def fetch_gas_prices(**kwargs):
    kwargs['ti'].xcom_push(key='raw_gas_data', value=decoded_data)

def transform_gas_prices(**kwargs):
    raw_data = kwargs['ti'].xcom_pull(
        task_ids='fetch_gas_prices',
        key='raw_gas_data'
    )

This approach tightly couples logic with Airflow internals.

Data must be serialized manually:

json_data = df.to_json(orient='records')

TaskFlow Version

@task
def fetch_gas_prices():
    return decoded_data

@task
def transform_gas_prices(raw_data: str):
    return df.to_json(orient='records')

And the pipeline becomes:

raw = fetch_gas_prices()
cleaned = transform_gas_prices(raw)
store_gas_prices(cleaned)

This reads like standard Python.

What Changed

The logic stayed the same. The structure changed completely.

Instead of manually managing XComs, data flows naturally between functions.

Before vs After

Aspect	Traditional	TaskFlow
Task definition	PythonOperator	@task
Data passing	Manual XCom	Automatic
Readability	Medium	High
Boilerplate	High	Low
Mental model	Wiring tasks	Data flow

6. Lessons From the Refactor

1. TaskFlow doesn’t remove XComs

It only hides them.

You still need to respect serialization limits:

return big_dataframe  # still not ideal

2. Passing data is easier-but not always better

TaskFlow makes it easy to pass data between tasks, but large payloads should still live in external storage.

3. Refactoring was mostly structural

Most of the work was:

removing **kwargs
replacing XCom logic with returns
simplifying task boundaries

4. The biggest change is mental

The shift was not technical-it was conceptual.
From:

How do I connect tasks?
to:

How does data flow through this pipeline?

7. Pitfalls to Avoid

Don’t push large objects through XCom
Don’t mix styles without intention
Don’t overuse TaskFlow just because it’s cleaner
Don’t forget serialization still exists

8. Conclusion

TaskFlow isn’t new-but adopting it after using the traditional approach makes its benefits clearer.

It moves you from writing orchestration-heavy DAGs to writing clean Python workflows.

And that shift improves:

readability
maintainability
reasoning about pipelines

Key Takeaways

TaskFlow simplifies DAG structure without changing Airflow’s core engine
XComs still exist but are abstracted
The real improvement is cleaner data flow modeling
Refactoring old DAGs is one of the best ways to understand it

Data Pipelines Explained Simply (and How to Build Them with Python)

Anthony Gicheru — Fri, 17 Apr 2026 07:34:55 +0000

Data pipelines are the backbone of modern data-driven organizations. They automate the movement, transformation, and storage of data - from raw sources to actionable insights.

Python has become the go-to language for building scalable pipelines because of its rich ecosystem, flexibility, and ease of use.

This guide walks through the fundamentals, tools, and best practices for building robust data pipelines using Python.

Understanding Data Pipelines

Imagine you need to supply clean water to a village. The process involves collecting water from different sources (rivers, wells, rain), purifying it, transporting it, and storing it so people can access it whenever they need it.

A data pipeline works in a very similar way.

It automates the journey of raw, unstructured data from multiple sources (like databases, APIs, or IoT devices) and transforms it into clean, usable data stored in a destination (like a data warehouse) for analysis.

Components of a Data Pipeline

Let’s break it down using the same analogy:

1. Collecting Water (Data Ingestion)

Just like gathering water from lakes or wells, a pipeline starts by extracting data from sources such as databases, APIs, spreadsheets, or sensors.

The goal here is simple: get all the data into one system, no matter how scattered it is.

2. Filtering and Purifying (Data Transformation)

Raw water isn’t clean—and neither is raw data.

At this stage, the pipeline:

Removes duplicates
Handles missing values
Standardizes formats
Enriches data

This is where messy data becomes usable.

3. Transporting Through Pipes (Data Movement)

Once cleaned, water flows through pipes. In data pipelines, this represents the movement of data between systems.

This can involve:

ETL processes
Message queues (like Kafka)
Cloud data transfer services

The goal is to move data efficiently without delays or bottlenecks.

4. Storing in Tanks (Data Storage)

Clean water is stored in tanks. Similarly, processed data is stored in:

Data warehouses (like Snowflake)
Data lakes (like AWS S3)
Databases

This is where data becomes ready for use.

5. Accessing on Demand (Data Consumption)

Finally, people use the water.

In the same way, data is consumed through:

Dashboards
APIs
Machine learning models

This is where insights actually happen.

Essential Python Libraries and Tools

Python supports every stage of a pipeline:

Data Ingestion

requests - API calls
pandas - handling CSV/JSON files

Transformation

pandas - cleaning and aggregation
PySpark - large-scale distributed processing

Storage

SQLAlchemy - database interaction
boto3 - AWS S3 integration

Orchestration

Apache Airflow - workflow scheduling and automation
Dagster - modern pipeline orchestration with observability

Best Practices

Error Handling

Implement retries and proper logging to avoid silent failures.

Monitoring

Track pipeline health using tools like Airflow’s UI.

Documentation

Keep clear documentation for:

Code
Dependencies
Workflow logic

Testing

Test each stage of the pipeline using:

Unit tests
Sample datasets

Popular Frameworks for Advanced Use Cases

Apache Airflow - Best for complex workflows with dependencies
Dagster - Strong focus on testing and data asset visibility
Prefect - Simplifies building fault-tolerant pipelines
Luigi - Good for batch processing and dependency management

ETL vs ELT: Which One Should You Use and Why?

Anthony Gicheru — Sun, 12 Apr 2026 21:36:22 +0000

When I first started learning data engineering, ETL and ELT honestly felt like the same thing with just swapped letters. Everyone kept mentioning them like they were obvious concepts, but I had to sit down and really break them apart before it made sense.

If you’re in the same place, don’t worry, you’re not alone.

Let’s make it simple.

First things first: what do ETL and ELT even mean?

Both ETL and ELT are ways of moving and processing data from one place to another.

ETL (Extract, Transform, Load)

Extract data from a source (like an API or database)
Transform it before storing it (cleaning, filtering, joining, etc.)
Load the final cleaned data into a target system (like a data warehouse)

The key idea: you clean the data before storing it.

ELT (Extract, Load, Transform)

Extract data from the source
Load it directly into the storage system first
Transform it inside the database/warehouse later

The key idea: you store raw data first, then clean it inside the system.

So what’s the real difference?

The biggest difference is where the transformation happens.

ETL → Transform happens outside the warehouse
ELT → Transform happens inside the warehouse

That one shift changes a lot more than you’d think.

When ETL makes sense

ETL is usually used when:

You have smaller datasets
You need strict data control before loading
Your system can’t handle heavy processing
Data quality must be enforced early

Think of it like cleaning your room before putting things in storage.

You don’t want messy data entering your system at all.

When ELT makes sense

ELT is more common in modern systems, especially with cloud platforms.

It works well when:

You have large volumes of data
You’re using powerful cloud warehouses (like Snowflake or BigQuery)
You want flexibility in how data is transformed
You want to keep raw data for future use

Think of it like dumping everything into a warehouse first, then organizing it later when needed.

A simple real-world example

Imagine you’re building a dashboard for an e-commerce app.

With ETL:

You:

Pull order data
Clean it (remove duplicates, fix missing values)
Then load it into your database ready for reporting

Everything is neat before it even arrives.

With ELT:

You:

Pull raw order data
Load everything into a data warehouse
Later write SQL transformations to clean and structure it

This gives you more flexibility if business rules change later.

My key takeaway

When I first learned this, I thought ETL was “old” and ELT was “new,” but that’s not really true.

They both still matter.

Here’s a simple way I now remember it:

ETL = Clean first, store later
ELT = Store first, clean later

Common mistakes beginners make

A few things that confused me at the start:

Thinking ELT means “no cleaning” (it still involves transformation!)
Mixing up where SQL transformations happen
Assuming one is always better than the other (it depends on the system)

So… which one should YOU use?

There’s no universal winner.

If you’re working with traditional systems → ETL is common
If you’re in modern cloud data engineering → ELT is more popular

Most real companies actually use a mix of both, depending on the pipeline.

To make this even more practical, here are some common tools used in real ETL and ELT workflows

ETL Tools (Transformation happens before loading)

Apache Airflow – for scheduling and orchestrating ETL workflows
Informatica PowerCenter – widely used in enterprise ETL pipelines
Talend – open-source tool for data integration and transformation
Apache NiFi – good for real-time data flow and routing
SSIS (SQL Server Integration Services) – Microsoft-based ETL tool

These tools usually handle data cleaning and transformation before sending data to a warehouse.

ELT Tools (Transformation happens after loading)

Snowflake – modern cloud data warehouse with strong ELT support
Google BigQuery – popular for serverless ELT workflows
Amazon Redshift – widely used in AWS-based data stacks
dbt (Data Build Tool) – one of the most popular tools for transformations inside the warehouse
Databricks (Apache Spark) – used for large-scale ELT processing

In ELT setups, tools like dbt handle transformation using SQL after data is loaded.

Final thoughts

Once I understood this difference, a lot of other concepts like data pipelines, warehouses, and analytics started to make way more sense.

If you’re learning data engineering right now, don’t rush it. Build a small pipeline, try both approaches, and you’ll see the difference quickly.

That’s where it really clicks.