<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anthony Gicheru</title>
    <description>The latest articles on Forem by Anthony Gicheru (@anthony-gicheru).</description>
    <link>https://forem.com/anthony-gicheru</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1186529%2Fed9dc374-bfac-4eee-bce7-90ea63105510.jpeg</url>
      <title>Forem: Anthony Gicheru</title>
      <link>https://forem.com/anthony-gicheru</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anthony-gicheru"/>
    <language>en</language>
    <item>
      <title>Slowly Changing Dimensions Explained: How Data Warehouses Keep History Accurate</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 17 May 2026 07:11:45 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/slowly-changing-dimensions-explained-how-data-warehouses-keep-history-accurate-2mim</link>
      <guid>https://forem.com/anthony-gicheru/slowly-changing-dimensions-explained-how-data-warehouses-keep-history-accurate-2mim</guid>
      <description>&lt;h2&gt;
  
  
  1. Why Slowly Changing Dimensions Matter
&lt;/h2&gt;

&lt;p&gt;In data engineering, not all data changes the same way.&lt;/p&gt;

&lt;p&gt;Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.&lt;/p&gt;

&lt;p&gt;But other data changes slowly.&lt;/p&gt;

&lt;p&gt;A customer changes their address.&lt;br&gt;
A product changes category.&lt;br&gt;
An employee moves to a new department.&lt;br&gt;
A supplier changes region.&lt;br&gt;
A user upgrades from a free plan to a premium plan.&lt;/p&gt;

&lt;p&gt;These changes do not happen every second, but when they happen, they matter a lot.&lt;/p&gt;

&lt;p&gt;Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.&lt;/p&gt;

&lt;p&gt;That is a problem.&lt;/p&gt;

&lt;p&gt;The business may ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How much revenue did we make from Nairobi customers last year?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But if you overwrote the customer’s location, your report may give the wrong answer.&lt;/p&gt;

&lt;p&gt;This is the exact problem Slowly Changing Dimensions solve.&lt;/p&gt;

&lt;p&gt;Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. What Is a Slowly Changing Dimension?
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Slowly Changing Dimension&lt;/strong&gt;, often shortened to &lt;strong&gt;SCD&lt;/strong&gt;, is a technique used in data warehousing to manage changes in dimension tables over time.&lt;/p&gt;

&lt;p&gt;A dimension table stores descriptive information.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not a transaction. It describes the customer.&lt;/p&gt;

&lt;p&gt;Now imagine Mary moves from Nairobi to Kisumu.&lt;/p&gt;

&lt;p&gt;The question becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should we overwrite Nairobi with Kisumu, or should we keep a history of both?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That decision is what SCD is all about.&lt;/p&gt;

&lt;p&gt;This is where Slowly Changing Dimensions become useful.&lt;/p&gt;

&lt;p&gt;They give data teams a structured way to decide how changes should be stored.&lt;/p&gt;

&lt;p&gt;Sometimes we only care about the latest value.&lt;/p&gt;

&lt;p&gt;Sometimes we want to preserve the original value.&lt;/p&gt;

&lt;p&gt;Sometimes we need the full history of every meaningful change.&lt;/p&gt;

&lt;p&gt;And sometimes we only need a simple previous-and-current comparison.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. How Slowly Changing Dimensions Work
&lt;/h2&gt;

&lt;p&gt;In a data warehouse, data is usually organized into &lt;strong&gt;fact tables&lt;/strong&gt; and &lt;strong&gt;dimension tables&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fact Tables
&lt;/h3&gt;

&lt;p&gt;Fact tables store business events.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sales&lt;/li&gt;
&lt;li&gt;Orders&lt;/li&gt;
&lt;li&gt;Payments&lt;/li&gt;
&lt;li&gt;Website clicks&lt;/li&gt;
&lt;li&gt;Deliveries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sales fact table might look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sale_id&lt;/th&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;product_key&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;sale_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5001&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;td&gt;2025-01-10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Dimension Tables
&lt;/h3&gt;

&lt;p&gt;Dimension tables describe the facts.&lt;/p&gt;

&lt;p&gt;A customer dimension might look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fact table tells us &lt;strong&gt;what happened&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The dimension table tells us &lt;strong&gt;who, what, where, or how&lt;/strong&gt; it happened.&lt;/p&gt;

&lt;p&gt;The challenge is that dimension data changes.&lt;/p&gt;

&lt;p&gt;When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.&lt;/p&gt;

&lt;p&gt;There are different SCD types, but the most commonly used are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCD Type 0&lt;/li&gt;
&lt;li&gt;SCD Type 1&lt;/li&gt;
&lt;li&gt;SCD Type 2&lt;/li&gt;
&lt;li&gt;SCD Type 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s go through them practically.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. SCD Type 0: Keep the Original Value
&lt;/h2&gt;

&lt;p&gt;SCD Type 0 means a value does not change in the data warehouse.&lt;/p&gt;

&lt;p&gt;Once the value is loaded, it stays the same, even if the source system changes later.&lt;/p&gt;

&lt;p&gt;In simple terms, Type 0 says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Keep the original value as it was first recorded.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.&lt;/p&gt;

&lt;p&gt;They may simply say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“This field should never be updated.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Preserve the original value.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.&lt;/p&gt;

&lt;p&gt;Good examples of Type 0 fields are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original signup date&lt;/li&gt;
&lt;li&gt;First purchase date&lt;/li&gt;
&lt;li&gt;Original registration country&lt;/li&gt;
&lt;li&gt;Original acquisition channel&lt;/li&gt;
&lt;li&gt;Original product launch date&lt;/li&gt;
&lt;li&gt;Original employee hire date&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;original_signup_date&lt;/th&gt;
&lt;th&gt;original_country&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;Kenya&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.&lt;/p&gt;

&lt;p&gt;But the original acquisition channel should still remain &lt;strong&gt;Instagram Ads&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because it tells the business how Mary was first acquired.&lt;/p&gt;

&lt;p&gt;If we overwrite that value, we lose the ability to answer questions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which marketing channel originally brought us our best customers?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is where Type 0 is useful.&lt;/p&gt;

&lt;p&gt;It protects values that describe the original state of a record.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. SCD Type 1: Overwrite the Old Value
&lt;/h2&gt;

&lt;p&gt;SCD Type 1 is the simplest approach.&lt;/p&gt;

&lt;p&gt;When a value changes, you overwrite the old value with the new one.&lt;/p&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After Mary moves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The old city is gone.&lt;/p&gt;
&lt;h3&gt;
  
  
  When Type 1 Makes Sense
&lt;/h3&gt;

&lt;p&gt;SCD Type 1 is useful when history does not matter.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fixing a spelling mistake&lt;/li&gt;
&lt;li&gt;Correcting wrong data&lt;/li&gt;
&lt;li&gt;Updating an email address&lt;/li&gt;
&lt;li&gt;Updating a phone number&lt;/li&gt;
&lt;li&gt;Correcting a product name typo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the original value was wrong, you usually do not want to preserve it.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Imagine a customer’s name was loaded as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mary Wanjikuuu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then later corrected to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mary Wanjiku
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You do not need historical tracking for the typo. You just update the record.&lt;/p&gt;

&lt;p&gt;That is SCD Type 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Risk
&lt;/h3&gt;

&lt;p&gt;The risk with Type 1 is that it destroys history.&lt;/p&gt;

&lt;p&gt;If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.&lt;/p&gt;

&lt;p&gt;So Type 1 is simple, but dangerous when historical accuracy matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. SCD Type 2: Keep Full History
&lt;/h2&gt;

&lt;p&gt;SCD Type 2 is the most important and most commonly used SCD technique in analytics.&lt;/p&gt;

&lt;p&gt;Instead of overwriting the old record, you create a new row when important attributes change.&lt;/p&gt;

&lt;p&gt;Before Mary moves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2024-01-01&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After Mary moves to Kisumu:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2024-01-01&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice something important.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;customer_id&lt;/code&gt; stays the same because it represents the real-world customer.&lt;/p&gt;

&lt;p&gt;But the &lt;code&gt;customer_key&lt;/code&gt; changes because each historical version gets its own unique warehouse key.&lt;/p&gt;

&lt;p&gt;This is usually called a &lt;strong&gt;surrogate key&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.&lt;/p&gt;

&lt;p&gt;If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.&lt;/p&gt;

&lt;p&gt;This allows historical reports to stay accurate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Fact Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sale_id&lt;/th&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;sale_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5001&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;td&gt;2025-02-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5002&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4500&lt;/td&gt;
&lt;td&gt;2025-04-20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first sale belongs to Mary when she was in Nairobi.&lt;/p&gt;

&lt;p&gt;The second sale belongs to Mary when she was in Kisumu.&lt;/p&gt;

&lt;p&gt;That is the power of SCD Type 2.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. SCD Type 3: Store Limited History in Columns
&lt;/h2&gt;

&lt;p&gt;SCD Type 3 keeps limited history by adding extra columns.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;current_city&lt;/th&gt;
&lt;th&gt;previous_city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This lets you see the current value and one previous value.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Type 3 Makes Sense
&lt;/h3&gt;

&lt;p&gt;SCD Type 3 is useful when you only care about a small amount of history.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Previous region and current region&lt;/li&gt;
&lt;li&gt;Previous plan and current plan&lt;/li&gt;
&lt;li&gt;Previous department and current department&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not scale well if changes happen many times.&lt;/p&gt;

&lt;p&gt;What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?&lt;/p&gt;

&lt;p&gt;You would need more columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;previous_city_1
previous_city_2
previous_city_3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That becomes messy quickly.&lt;/p&gt;

&lt;p&gt;So Type 3 is useful, but only for very specific cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Practical Example in a Data Warehousing Project
&lt;/h2&gt;

&lt;p&gt;Let’s say you are building a sales analytics warehouse for an e-commerce company.&lt;/p&gt;

&lt;p&gt;You have data coming from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL for application data&lt;/li&gt;
&lt;li&gt;Kafka for order events&lt;/li&gt;
&lt;li&gt;Airflow for orchestration&lt;/li&gt;
&lt;li&gt;dbt for transformations&lt;/li&gt;
&lt;li&gt;Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your source customer table in PostgreSQL looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;updated_at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Later, the same customer changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;updated_at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now the data team must decide:&lt;/p&gt;

&lt;p&gt;Do we overwrite the old record?&lt;/p&gt;

&lt;p&gt;Or do we preserve the old version?&lt;/p&gt;

&lt;p&gt;And what do we do with the original acquisition channel?&lt;/p&gt;

&lt;p&gt;In this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;acquisition_channel&lt;/code&gt; can be treated as Type 0 because it represents how Mary was originally acquired.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;city&lt;/code&gt; and &lt;code&gt;customer_type&lt;/code&gt; can be treated as Type 2 because they affect historical reporting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.&lt;/p&gt;

&lt;p&gt;A Type 2 customer dimension may look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now your reports can answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many sales came from Nairobi customers in February?&lt;/li&gt;
&lt;li&gt;How much revenue came from Premium customers after March?&lt;/li&gt;
&lt;li&gt;What was the customer type at the time of purchase?&lt;/li&gt;
&lt;li&gt;How many customers upgraded from Regular to Premium?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without SCD Type 2, these questions become difficult or inaccurate.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. A Simple SCD Type 2 Flow
&lt;/h2&gt;

&lt;p&gt;A typical SCD Type 2 pipeline works like this:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load the Latest Source Data
&lt;/h3&gt;

&lt;p&gt;You extract the latest customer data from the source system.&lt;/p&gt;

&lt;p&gt;This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Compare Source Data With Current Dimension Records
&lt;/h3&gt;

&lt;p&gt;You compare the incoming record with the current active record in the warehouse.&lt;/p&gt;

&lt;p&gt;For example, compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source.city
source.customer_type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;against:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dim_customer.city
dim_customer.customer_type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Detect Changes
&lt;/h3&gt;

&lt;p&gt;If nothing changed, do nothing.&lt;/p&gt;

&lt;p&gt;If important attributes changed, expire the old record.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 4: Insert a New Current Record
&lt;/h3&gt;

&lt;p&gt;Then insert the new version:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 5: Use the Correct Dimension Version in Fact Tables
&lt;/h3&gt;

&lt;p&gt;When loading fact data, join the fact date to the correct dimension record using the validity period.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;staging_orders&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_from&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt;
        &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
   &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures the order connects to the correct version of the customer.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Common Mistakes Beginners Make
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Using Type 1 When History Matters
&lt;/h3&gt;

&lt;p&gt;This is probably the most common mistake.&lt;/p&gt;

&lt;p&gt;Beginners often overwrite dimension records because it feels simple.&lt;/p&gt;

&lt;p&gt;But later, when the business asks historical questions, the warehouse cannot answer correctly.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What was revenue by customer region last year?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you overwrote all customer regions with the latest value, the report will be wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Before choosing Type 1, ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Will the business ever need to know what this value was in the past?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If yes, consider Type 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 2: Tracking Every Column as Type 2
&lt;/h3&gt;

&lt;p&gt;Not every change deserves a new historical version.&lt;/p&gt;

&lt;p&gt;For example, do you really need a new customer dimension row when the phone number changes?&lt;/p&gt;

&lt;p&gt;Maybe not.&lt;/p&gt;

&lt;p&gt;If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Classify columns carefully.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;SCD Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;customer_name typo fix&lt;/td&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;city&lt;/td&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;customer_type&lt;/td&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;phone_number&lt;/td&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;email&lt;/td&gt;
&lt;td&gt;Type 1 or Type 2 depending on business need&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decision should be based on reporting needs, not just technical preference.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 3: Not Using a Surrogate Key
&lt;/h3&gt;

&lt;p&gt;A big mistake is using the source system ID as the primary key for the dimension table.&lt;/p&gt;

&lt;p&gt;For example, using &lt;code&gt;customer_id&lt;/code&gt; as the only key.&lt;/p&gt;

&lt;p&gt;That becomes a problem in Type 2 because the same customer can have multiple historical versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Better Approach
&lt;/h3&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;customer_id&lt;/code&gt; as the business key&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_key&lt;/code&gt; as the warehouse surrogate key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The surrogate key uniquely identifies each version.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 4: Forgetting the Current Flag
&lt;/h3&gt;

&lt;p&gt;Without an &lt;code&gt;is_current&lt;/code&gt; column, it becomes harder to query the latest version of each record.&lt;/p&gt;

&lt;p&gt;You would have to check for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works, but &lt;code&gt;is_current&lt;/code&gt; makes queries easier and clearer.&lt;/p&gt;

&lt;p&gt;A good SCD Type 2 table usually has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_from
valid_to
is_current
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some teams also add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;created_at
updated_at
record_hash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Mistake 5: Poor Date Handling
&lt;/h3&gt;

&lt;p&gt;SCD Type 2 depends heavily on dates.&lt;/p&gt;

&lt;p&gt;If your &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt; values are wrong, your historical joins will be wrong.&lt;/p&gt;

&lt;p&gt;Common problems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overlapping date ranges&lt;/li&gt;
&lt;li&gt;Gaps between versions&lt;/li&gt;
&lt;li&gt;Incorrect timezone handling&lt;/li&gt;
&lt;li&gt;Using load date instead of actual business effective date&lt;/li&gt;
&lt;li&gt;Not handling late-arriving data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Be very intentional about what your dates mean.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;valid_from&lt;/code&gt;: when this version became valid&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;valid_to&lt;/code&gt;: when this version stopped being valid&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;loaded_at&lt;/code&gt;: when the data entered the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not always the same thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Best Practices for SCD
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use SCD Type 2 for Business-Critical History
&lt;/h3&gt;

&lt;p&gt;If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer region&lt;/li&gt;
&lt;li&gt;Customer plan&lt;/li&gt;
&lt;li&gt;Product category&lt;/li&gt;
&lt;li&gt;Sales territory&lt;/li&gt;
&lt;li&gt;Employee department&lt;/li&gt;
&lt;li&gt;Account status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are usually worth tracking with Type 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Use Hashing to Detect Changes
&lt;/h3&gt;

&lt;p&gt;Instead of comparing many columns one by one, you can create a hash from the important attributes.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;MD5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then compare the source hash with the current dimension hash.&lt;/p&gt;

&lt;p&gt;If the hash changes, something important changed.&lt;/p&gt;

&lt;p&gt;This makes SCD pipelines easier to maintain, especially when there are many columns.&lt;/p&gt;




&lt;h3&gt;
  
  
  Keep Your SCD Logic Clear
&lt;/h3&gt;

&lt;p&gt;SCD logic can become confusing quickly.&lt;/p&gt;

&lt;p&gt;Use clear column names like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_from
valid_to
is_current
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Avoid unclear names like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;start
end
flag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your future self and your teammates will thank you.&lt;/p&gt;




&lt;h3&gt;
  
  
  Document Which Columns Are Tracked
&lt;/h3&gt;

&lt;p&gt;Do not leave SCD behavior hidden inside SQL code only.&lt;/p&gt;

&lt;p&gt;Document which columns are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Type 0&lt;/li&gt;
&lt;li&gt;Type 1&lt;/li&gt;
&lt;li&gt;Type 2&lt;/li&gt;
&lt;li&gt;Type 3&lt;/li&gt;
&lt;li&gt;Ignored&lt;/li&gt;
&lt;li&gt;Used for change detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially important in team environments.&lt;/p&gt;




&lt;h3&gt;
  
  
  Avoid Duplicates in Current Records
&lt;/h3&gt;

&lt;p&gt;For a Type 2 dimension, each business key should have only one current record.&lt;/p&gt;

&lt;p&gt;This should be true:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One customer_id = one current record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can test this with SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_records&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this query returns rows, your dimension has a problem.&lt;/p&gt;




&lt;h3&gt;
  
  
  Test for Overlapping Validity Periods
&lt;/h3&gt;

&lt;p&gt;For each business key, the date ranges should not overlap.&lt;/p&gt;

&lt;p&gt;Bad example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2025-04-01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These overlap between March 15 and April 1.&lt;/p&gt;

&lt;p&gt;That means a sale on March 20 could match both records.&lt;/p&gt;

&lt;p&gt;That is dangerous.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. When to Use Slowly Changing Dimensions
&lt;/h2&gt;

&lt;p&gt;Use SCD when dimension data changes and those changes affect analysis.&lt;/p&gt;

&lt;p&gt;Good use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer address history&lt;/li&gt;
&lt;li&gt;Product category history&lt;/li&gt;
&lt;li&gt;Subscription plan changes&lt;/li&gt;
&lt;li&gt;Employee department changes&lt;/li&gt;
&lt;li&gt;Supplier region changes&lt;/li&gt;
&lt;li&gt;Account status changes&lt;/li&gt;
&lt;li&gt;Sales territory changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Show revenue by the customer’s region at the time of purchase.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a classic SCD Type 2 problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. When SCD May Not Be the Best Choice
&lt;/h2&gt;

&lt;p&gt;SCD is powerful, but not every situation needs it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Not Use Type 2 for Every Small Change
&lt;/h3&gt;

&lt;p&gt;If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Last login timestamp&lt;/li&gt;
&lt;li&gt;Profile picture URL&lt;/li&gt;
&lt;li&gt;Phone number&lt;/li&gt;
&lt;li&gt;Minor spelling corrections&lt;/li&gt;
&lt;li&gt;Temporary status fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these, Type 1 may be enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Be Careful With Very High-Volume Changes
&lt;/h3&gt;

&lt;p&gt;If a dimension changes too frequently, Type 2 can grow very fast.&lt;/p&gt;

&lt;p&gt;At that point, you may need a different modeling approach, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event sourcing&lt;/li&gt;
&lt;li&gt;Audit tables&lt;/li&gt;
&lt;li&gt;Snapshot tables&lt;/li&gt;
&lt;li&gt;Data lake versioning&lt;/li&gt;
&lt;li&gt;Change Data Capture history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SCD is best for slowly changing descriptive attributes, not every event that happens in the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  14. SCD in Modern Data Tools
&lt;/h2&gt;

&lt;p&gt;SCD is not limited to traditional warehouses.&lt;/p&gt;

&lt;p&gt;You can implement SCD patterns in many modern data stacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  In dbt
&lt;/h3&gt;

&lt;p&gt;dbt supports snapshots, which are commonly used to implement SCD Type 2.&lt;/p&gt;

&lt;p&gt;A dbt snapshot can track when records change and automatically create historical versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Airflow
&lt;/h3&gt;

&lt;p&gt;Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Spark
&lt;/h3&gt;

&lt;p&gt;Spark is useful when you are handling large-scale dimension updates.&lt;/p&gt;

&lt;p&gt;You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Kafka and CDC
&lt;/h3&gt;

&lt;p&gt;Kafka can stream changes from source systems.&lt;/p&gt;

&lt;p&gt;For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.&lt;/p&gt;

&lt;p&gt;From there, you can process those changes and update your dimension tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Warehouses
&lt;/h3&gt;

&lt;p&gt;SCD can be implemented in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;BigQuery&lt;/li&gt;
&lt;li&gt;Redshift&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Databricks&lt;/li&gt;
&lt;li&gt;SQL Server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The concept stays the same, even if the syntax differs.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Final Summary
&lt;/h2&gt;

&lt;p&gt;Slowly Changing Dimensions help data engineers manage changes in dimension data over time.&lt;/p&gt;

&lt;p&gt;They solve an important problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we keep analytics accurate when descriptive data changes?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The most common SCD types are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Type 0&lt;/td&gt;
&lt;td&gt;Keep the original value unchanged&lt;/td&gt;
&lt;td&gt;Original values that should not change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;td&gt;Overwrite old values&lt;/td&gt;
&lt;td&gt;Corrections and non-historical changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;td&gt;Keep full history using new rows&lt;/td&gt;
&lt;td&gt;Historical reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 3&lt;/td&gt;
&lt;td&gt;Keep limited history in columns&lt;/td&gt;
&lt;td&gt;Simple previous/current comparisons&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.&lt;/p&gt;

&lt;p&gt;Without SCD, reports can quietly become wrong.&lt;/p&gt;

&lt;p&gt;A customer’s current city may overwrite their past city.&lt;br&gt;
A product’s new category may rewrite old sales history.&lt;br&gt;
An employee’s new department may change old performance reports.&lt;/p&gt;

&lt;p&gt;That is why SCD matters.&lt;/p&gt;

&lt;p&gt;The practical takeaway is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Whenever a dimension value changes, ask whether the business needs to remember the old value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the value should never change, Type 0 may be the right choice.&lt;/p&gt;

&lt;p&gt;If the old value does not matter, Type 1 may be enough.&lt;/p&gt;

&lt;p&gt;If the business needs full history, Type 2 is usually the best choice.&lt;/p&gt;

&lt;p&gt;If the business only needs a simple previous-and-current comparison, Type 3 may work.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datawarehouse</category>
      <category>dataengineering</category>
      <category>scd</category>
    </item>
    <item>
      <title>Why Your Code Breaks in Production (and How Docker Fixes It)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Tue, 12 May 2026 05:28:52 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</link>
      <guid>https://forem.com/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</guid>
      <description>&lt;h2&gt;
  
  
  1. Why This Matters
&lt;/h2&gt;

&lt;p&gt;You write your code.&lt;br&gt;
You test it locally.&lt;br&gt;
Everything works perfectly.&lt;/p&gt;

&lt;p&gt;Then it goes to production… and breaks.&lt;/p&gt;

&lt;p&gt;You spend hours debugging, only to realize:&lt;br&gt;
&lt;strong&gt;nothing is wrong with your code — the environment is the problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In data engineering, this happens all the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Spark job runs locally but fails in production&lt;/li&gt;
&lt;li&gt;Airflow works on Ubuntu but breaks on macOS&lt;/li&gt;
&lt;li&gt;Kafka pipelines behave differently across environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At its core, the issue is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your environment is not consistent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Containerization solves this&lt;/strong&gt; by packaging everything your application needs into a single, portable unit that runs the same way anywhere.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Core Concept — What is Containerization?
&lt;/h2&gt;

&lt;p&gt;Let’s simplify it with an analogy.&lt;/p&gt;
&lt;h2&gt;
  
  
  Analogy: A Fully Equipped House
&lt;/h2&gt;

&lt;p&gt;Imagine being placed in an empty field with nothing around you.&lt;/p&gt;

&lt;p&gt;No food.&lt;br&gt;
No water.&lt;br&gt;
No electricity.&lt;br&gt;
No shelter.&lt;/p&gt;

&lt;p&gt;You might survive for a while, but functioning properly would be difficult.&lt;/p&gt;

&lt;p&gt;Now imagine being placed inside a fully equipped house.&lt;/p&gt;

&lt;p&gt;Everything you need is already there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;food&lt;/li&gt;
&lt;li&gt;water&lt;/li&gt;
&lt;li&gt;electricity&lt;/li&gt;
&lt;li&gt;furniture&lt;/li&gt;
&lt;li&gt;internet&lt;/li&gt;
&lt;li&gt;a bed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No matter where that house is moved, you can still live comfortably because your essentials move with you.&lt;/p&gt;

&lt;p&gt;Applications work the same way.&lt;/p&gt;

&lt;p&gt;An application needs certain things to function:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;libraries&lt;/li&gt;
&lt;li&gt;runtime versions&lt;/li&gt;
&lt;li&gt;system tools&lt;/li&gt;
&lt;li&gt;environment variables&lt;/li&gt;
&lt;li&gt;dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without them, the application breaks.&lt;/p&gt;

&lt;p&gt;Containerization solves this problem by packaging the application together with everything it needs to run.&lt;/p&gt;

&lt;p&gt;Think of a container as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a fully equipped house for your application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" alt="Diagram comparing a Docker container to a fully equipped house containing everything an application needs to run consistently across different environments." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inside the container, the app already has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;its dependencies&lt;/li&gt;
&lt;li&gt;configurations&lt;/li&gt;
&lt;li&gt;runtime environment&lt;/li&gt;
&lt;li&gt;required tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So whether the container runs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your laptop&lt;/li&gt;
&lt;li&gt;a cloud server&lt;/li&gt;
&lt;li&gt;a teammate’s machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…the application still behaves the same way.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Mental Model
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Containerization gives your application its own portable environment with everything it needs to survive and run consistently.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  3. Docker Basics
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Key Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt; - A blueprint/template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt; - A running instance of that image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dockerfile&lt;/strong&gt; - Instructions to build the image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" alt="A clean modern DevOps-style diagram showing the relationship between a Dockerfile, a Docker Image, and a running Docker Container." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Let’s Make It Real
&lt;/h3&gt;

&lt;p&gt;Here’s the smallest possible Docker setup for a Python app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from Docker!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dockerfile&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.10-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Build and Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; my-python-app &lt;span class="nb"&gt;.&lt;/span&gt;
docker run my-python-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notice what we didn’t do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install Python manually&lt;/li&gt;
&lt;li&gt;Manage versions&lt;/li&gt;
&lt;li&gt;Configure anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The environment is fully defined in the Dockerfile.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Why Docker is Useful in Data Engineering
&lt;/h2&gt;

&lt;p&gt;In real-world data systems, you work with tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark / PySpark&lt;/li&gt;
&lt;li&gt;PostgreSQL or another data warehouse&lt;/li&gt;
&lt;li&gt;Reporting tools or dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different dependencies&lt;/li&gt;
&lt;li&gt;Different configurations&lt;/li&gt;
&lt;li&gt;Different runtime requirements&lt;/li&gt;
&lt;li&gt;Different ports&lt;/li&gt;
&lt;li&gt;Different environment variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without Docker, they often conflict.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow may require specific Python packages&lt;/li&gt;
&lt;li&gt;PySpark may need Java and Spark installed&lt;/li&gt;
&lt;li&gt;PostgreSQL may need database credentials and storage&lt;/li&gt;
&lt;li&gt;Dashboard tools may need access to the processed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;each tool runs in its own isolated environment — no conflicts, no surprises.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is especially useful in batch data pipelines because the entire workflow can be reproduced across different machines and environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Docker Compose — Managing Multiple Containers
&lt;/h2&gt;

&lt;p&gt;Real systems are never just one container.&lt;/p&gt;

&lt;p&gt;A Dockerized data engineering pipeline may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Airflow webserver&lt;/li&gt;
&lt;li&gt;An Airflow scheduler&lt;/li&gt;
&lt;li&gt;A PostgreSQL database&lt;/li&gt;
&lt;li&gt;A Spark / PySpark processing service&lt;/li&gt;
&lt;li&gt;Shared folders for DAGs, logs, scripts, and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running each service manually quickly becomes painful.&lt;/p&gt;




&lt;h3&gt;
  
  
  Docker vs Docker Compose
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; - runs one container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; - runs an entire system made up of multiple containers&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Key Insight
&lt;/h3&gt;

&lt;p&gt;Without Docker Compose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple terminals&lt;/li&gt;
&lt;li&gt;Manual startup order&lt;/li&gt;
&lt;li&gt;Constant configuration issues&lt;/li&gt;
&lt;li&gt;Harder networking between services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker Compose:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;one command starts everything.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" alt="Docker Compose orchestration diagram showing multiple services being started and managed from one docker-compose.yml file." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Example: Multi-Service Setup
&lt;/h3&gt;

&lt;p&gt;A simplified Docker Compose setup for a batch pipeline may include Airflow and PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;docker-compose.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;airflow-webserver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_webserver&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow webserver&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;airflow-scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_scheduler&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow scheduler&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres_db&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5433:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  8. Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Using &lt;code&gt;localhost&lt;/code&gt; inside containers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breaks almost everyone at first.&lt;/p&gt;

&lt;p&gt;Inside a container:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;localhost&lt;/code&gt; refers to the container itself, not your machine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Forgetting environment variables&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Missing configs often cause silent failures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Not persisting data&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containers are temporary. Without volumes, your data disappears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rebuilding unnecessarily&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor Dockerfile structure can slow builds significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use lightweight images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;  FROM python:3.10-slim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a &lt;code&gt;.dockerignore&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  node_modules
  .git
  .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Avoid &lt;code&gt;latest&lt;/code&gt; in production&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use fixed versions to keep builds predictable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Separate dev and production setups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They have different requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use Docker Compose for local development&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It helps simulate real systems easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use clear service names&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kafka&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgres&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airflow&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simplifies networking and debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;Containerization changes how you think about environments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker packages your application into a portable unit.&lt;/li&gt;
&lt;li&gt;Docker Compose runs entire systems with one command.&lt;/li&gt;
&lt;li&gt;Your pipelines become reproducible and consistent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real shift is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You stop debugging environments — and start defining them as code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And once you reach that point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You’re no longer just writing code — you’re building systems.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>containers</category>
      <category>docker</category>
      <category>devops</category>
    </item>
    <item>
      <title>Data Warehouses, Data Marts, Data Lakes, and Lakehouses - Explained Like You’re Building Them in Real Life</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 03 May 2026 08:10:27 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</link>
      <guid>https://forem.com/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</guid>
      <description>&lt;p&gt;If you’ve been around data engineering long enough, you’ve probably heard these terms thrown around in meetings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Just dump it in the data lake”&lt;/li&gt;
&lt;li&gt;“We’ll expose it through the warehouse”&lt;/li&gt;
&lt;li&gt;“That goes into the mart”&lt;/li&gt;
&lt;li&gt;“We’re moving to a lakehouse architecture”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And honestly… it can sound like four different ways of saying the same thing.&lt;/p&gt;

&lt;p&gt;They’re not.&lt;/p&gt;

&lt;p&gt;Each one solves a slightly different problem in the data ecosystem. Once you understand the “why” behind each, the architecture suddenly feels a lot less like buzzwords and more like a clean system design.&lt;/p&gt;

&lt;p&gt;Let’s break it down in a practical, engineer-first way.&lt;/p&gt;




&lt;h1&gt;
  
  
  1. The Big Picture (Why all these systems exist)
&lt;/h1&gt;

&lt;p&gt;In most companies, data doesn’t come from one place — it flows in from everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User clicks from web/mobile apps&lt;/li&gt;
&lt;li&gt;Payments and transactions&lt;/li&gt;
&lt;li&gt;Logs from servers&lt;/li&gt;
&lt;li&gt;Third-party APIs (Stripe, Shopify, etc.)&lt;/li&gt;
&lt;li&gt;IoT or streaming data (Kafka, sensors, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now here’s the problem:&lt;/p&gt;

&lt;p&gt;Raw data is messy. Business users don’t want messy.&lt;/p&gt;

&lt;p&gt;So we build systems that progressively refine data from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Raw → Clean → Structured → Business-ready&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s where these four concepts come in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything raw&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → structured analytics-ready data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → department-specific slices of warehouse data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → hybrid of lake + warehouse&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. Data Lake — “Store everything first, figure it out later”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data lake&lt;/strong&gt; is basically a massive storage system where you dump raw data in its original format.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A giant warehouse where you throw every box in as-is, without opening it.&lt;br&gt;
Or even better: a farm storage system where everything is stored right after harvest, unprocessed and mixed together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stores structured, semi-structured, and unstructured data&lt;/li&gt;
&lt;li&gt;Cheap storage (usually object storage like S3)&lt;/li&gt;
&lt;li&gt;Schema is applied &lt;strong&gt;when reading&lt;/strong&gt;, not writing (schema-on-read)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;Azure Data Lake Storage&lt;/li&gt;
&lt;li&gt;Google Cloud Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;You might store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/events/clicks/2026/05/01.json
/logs/api/2026/05/01.log
/payments/stripe/2026/05/01.parquet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No transformations. No enforcement. Just storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Here’s the catch:
&lt;/h3&gt;

&lt;p&gt;If you’re not careful, a data lake becomes a &lt;strong&gt;data swamp&lt;/strong&gt; — lots of data, zero usability.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. Data Warehouse — “Clean, structured, and business-ready”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; is where data goes after it has been cleaned, transformed, and modeled for analytics.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A well-organized supermarket where everything is cleaned, packaged, labeled, and placed on the right shelves.&lt;br&gt;
You don’t pick raw potatoes from the soil — you get them washed, sorted, and priced.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Structured data only&lt;/li&gt;
&lt;li&gt;Schema-on-write (you define structure before loading)&lt;/li&gt;
&lt;li&gt;Optimized for analytics queries (OLAP systems)&lt;/li&gt;
&lt;li&gt;Highly curated and trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Typical workflow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Extract data from sources&lt;/li&gt;
&lt;li&gt;Transform (clean, join, aggregate)&lt;/li&gt;
&lt;li&gt;Load into warehouse tables&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example SQL model:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now business analysts can run queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  4. Data Marts — “Department-specific mini warehouses”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data mart&lt;/strong&gt; is a subset of a data warehouse focused on a specific business domain.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A grocery store or specialty shop — like a bakery, butcher, or vegetable shop.&lt;br&gt;
It doesn’t sell everything. It only sells what its customers actually need.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller scope than a warehouse&lt;/li&gt;
&lt;li&gt;Built for a specific team (finance, marketing, sales)&lt;/li&gt;
&lt;li&gt;Faster queries for targeted use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;A marketing data mart might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Campaign performance&lt;/li&gt;
&lt;li&gt;Customer acquisition metrics&lt;/li&gt;
&lt;li&gt;Ad spend data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example structure:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;marketing_campaign_performance&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_impressions&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ad_events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it exists:
&lt;/h3&gt;

&lt;p&gt;Instead of everyone querying a massive warehouse, teams get &lt;strong&gt;pre-optimized datasets&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Data Lakehouse — “Best of both worlds”
&lt;/h1&gt;

&lt;p&gt;Now this is where things get interesting.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;lakehouse&lt;/strong&gt; combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The flexibility of a data lake&lt;/li&gt;
&lt;li&gt;The structure and performance of a data warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A modern retail system where the warehouse and supermarket are combined into one smart facility.&lt;br&gt;
Raw goods arrive, but they are immediately tracked, organized, and made queryable without losing flexibility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Uses low-cost storage (like a lake)&lt;/li&gt;
&lt;li&gt;Adds structure, ACID transactions, and governance&lt;/li&gt;
&lt;li&gt;Supports both analytics and ML workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark + Delta Lake&lt;/li&gt;
&lt;li&gt;Apache Iceberg&lt;/li&gt;
&lt;li&gt;Apache Hudi&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why it matters:
&lt;/h3&gt;

&lt;p&gt;In traditional setups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes = flexible but messy&lt;/li&gt;
&lt;li&gt;Warehouses = clean but expensive and rigid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lakehouses try to remove that tradeoff.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. How They Work Together (Real Architecture Flow)
&lt;/h1&gt;

&lt;p&gt;A modern data pipeline often looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
   DATA LAKE (raw storage)
      ↓
ETL / ELT pipelines (Airflow, Spark)
      ↓
DATA WAREHOUSE (modeled data)
      ↓
DATA MARTS (team-specific views)
      ↓
Dashboards / BI tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in a lakehouse setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
DATA LAKEHOUSE (single system)
      ↓
BI + ML + Analytics directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  7. Practical Example (Mini Pipeline)
&lt;/h1&gt;

&lt;p&gt;Let’s say we’re processing e-commerce data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Raw data in S3 (Data Lake)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"order_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-01T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Spark transformation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://datalake/raw/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://warehouse/sales_fact/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Load into warehouse (Redshift example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://warehouse/sales_fact/'&lt;/span&gt;
&lt;span class="n"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="s1"&gt;'arn:aws:iam::123456:role/RedshiftRole'&lt;/span&gt;
&lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Create a data mart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  8. Common Pitfalls (Where most teams mess up)
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Turning the data lake into a swamp
&lt;/h3&gt;

&lt;p&gt;Dumping everything without metadata or structure leads to chaos.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Over-modeling too early
&lt;/h3&gt;

&lt;p&gt;Trying to build perfect schemas upfront slows everything down.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Duplicating logic across marts
&lt;/h3&gt;

&lt;p&gt;You end up with inconsistent metrics like “Revenue_v1”, “Revenue_final”, “Revenue_real_final”.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No governance layer
&lt;/h3&gt;

&lt;p&gt;Without access control and cataloging, nobody trusts the data.&lt;/p&gt;




&lt;h1&gt;
  
  
  9. Best Practices
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Use layered architecture
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Raw (lake)&lt;/li&gt;
&lt;li&gt;Cleaned (staging)&lt;/li&gt;
&lt;li&gt;Modeled (warehouse)&lt;/li&gt;
&lt;li&gt;Aggregated (marts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Standardize transformations
&lt;/h3&gt;

&lt;p&gt;Use tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark jobs with clear ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Define a single source of truth
&lt;/h3&gt;

&lt;p&gt;One metric definition per business KPI. No duplicates.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Treat data like software
&lt;/h3&gt;

&lt;p&gt;Version it, test it, document it.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Monitor everything
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline failures&lt;/li&gt;
&lt;li&gt;Data freshness&lt;/li&gt;
&lt;li&gt;Schema changes&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10. Conclusion — The mental model that matters
&lt;/h1&gt;

&lt;p&gt;If you strip away the jargon, it’s really simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → clean and organize it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → tailor it for teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → unify storage and analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real skill in data engineering isn’t memorizing definitions.&lt;/p&gt;

&lt;p&gt;It’s knowing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When to keep data raw, when to structure it, and when to specialize it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once that clicks, designing data systems becomes a lot more intuitive — and honestly, more fun to build.&lt;/p&gt;

</description>
      <category>datawarehouse</category>
      <category>datamarts</category>
      <category>datalake</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>Refactoring Airflow Pipelines: From PythonOperator to TaskFlow</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:52:39 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</link>
      <guid>https://forem.com/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</guid>
      <description>&lt;h1&gt;
  
  
  Actually Embracing TaskFlow After a Year of Doing It the “Old Way”
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Introduction: This Isn’t New… But It &lt;em&gt;Feels&lt;/em&gt; New
&lt;/h2&gt;

&lt;p&gt;If you’ve been using Airflow for a while-like I have-you probably didn’t start with the TaskFlow API.&lt;/p&gt;

&lt;p&gt;You likely started with the classic Airflow 2.x style:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PythonOperator&lt;/li&gt;
&lt;li&gt;&lt;code&gt;**kwargs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ti.xcom_push()&lt;/code&gt; and &lt;code&gt;ti.xcom_pull()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Explicit task chaining with &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent over a year building pipelines this way. And to be clear-it works. It’s stable, production-ready, and widely used.&lt;/p&gt;

&lt;p&gt;But here’s the interesting part:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The TaskFlow API has existed since Airflow 2.0. I just didn’t fully adopt it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Honestly, I ignored TaskFlow for a long time because I thought it was just ‘syntactic sugar’. That’s more common than people admit.&lt;/p&gt;

&lt;p&gt;Most production systems and tutorials still rely on operators, so you naturally stay in that pattern. It’s only later-when readability and maintainability start to matter-that TaskFlow becomes interesting.&lt;/p&gt;

&lt;p&gt;And once it clicks, it changes how you think about Airflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Core Concepts: Same Engine, Different Experience
&lt;/h2&gt;

&lt;p&gt;TaskFlow doesn’t replace Airflow concepts-it abstracts them.&lt;/p&gt;

&lt;p&gt;You still work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks&lt;/li&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;XComs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is &lt;em&gt;how&lt;/em&gt; you express them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;traditional_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but it introduces a lot of orchestration boilerplate into your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feels simpler because it is.&lt;/p&gt;

&lt;p&gt;TaskFlow removes explicit XCom handling and lets function returns define data flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Real Shift: From Wiring Tasks to Modeling Data Flow
&lt;/h2&gt;

&lt;p&gt;With the traditional approach, your mental model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task A - XCom - Task B - XCom - Task C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With TaskFlow, it becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" alt="Airflow DAG: Traditional vs Taskflow" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same execution engine. Different abstraction.&lt;/p&gt;

&lt;p&gt;The shift is from &lt;em&gt;task orchestration&lt;/em&gt; to &lt;em&gt;data flow composition&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. XComs: Manual vs Automatic
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You manage everything explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Airflow handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;serialization&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You focus on logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  When You Still Need Control
&lt;/h3&gt;

&lt;p&gt;TaskFlow still allows explicit control when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models.xcom_arg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XComArg&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;XComArg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Real-World Example: Gas Prices ETL Refactor
&lt;/h2&gt;

&lt;p&gt;I didn’t build two versions of this pipeline at once.&lt;/p&gt;

&lt;p&gt;I originally built it using the traditional Airflow 2.x approach and later refactored it using TaskFlow.&lt;/p&gt;

&lt;p&gt;That’s when the difference became clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API - Extract gas prices - Transform - Store in PostgreSQL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Reference
&lt;/h3&gt;

&lt;p&gt;Full project: &lt;a href="https://github.com/Anthony-Gicheru/Gas-Prices-ETL-with-Apache-Airflow" rel="noopener noreferrer"&gt;Github Link to the project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It includes both the original DAG and the TaskFlow refactor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach tightly couples logic with Airflow internals.&lt;/p&gt;

&lt;p&gt;Data must be serialized manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;json_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TaskFlow Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decoded_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the pipeline becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads like standard Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Changed
&lt;/h3&gt;

&lt;p&gt;The logic stayed the same. The structure changed completely.&lt;/p&gt;

&lt;p&gt;Instead of manually managing XComs, data flows naturally between functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task definition&lt;/td&gt;
&lt;td&gt;PythonOperator&lt;/td&gt;
&lt;td&gt;@task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data passing&lt;/td&gt;
&lt;td&gt;Manual XCom&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Readability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mental model&lt;/td&gt;
&lt;td&gt;Wiring tasks&lt;/td&gt;
&lt;td&gt;Data flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Lessons From the Refactor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. TaskFlow doesn’t remove XComs
&lt;/h3&gt;

&lt;p&gt;It only hides them.&lt;/p&gt;

&lt;p&gt;You still need to respect serialization limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;big_dataframe&lt;/span&gt;  &lt;span class="c1"&gt;# still not ideal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Passing data is easier-but not always better
&lt;/h3&gt;

&lt;p&gt;TaskFlow makes it easy to pass data between tasks, but large payloads should still live in external storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring was mostly structural
&lt;/h3&gt;

&lt;p&gt;Most of the work was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing &lt;code&gt;**kwargs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;replacing XCom logic with returns&lt;/li&gt;
&lt;li&gt;simplifying task boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The biggest change is mental
&lt;/h3&gt;

&lt;p&gt;The shift was not technical-it was conceptual.&lt;br&gt;
From:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I connect tasks?&lt;br&gt;
to:&lt;/p&gt;
&lt;h2&gt;
  
  
  How does data flow through this pipeline?
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Pitfalls to Avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don’t push large objects through XCom&lt;/li&gt;
&lt;li&gt;Don’t mix styles without intention&lt;/li&gt;
&lt;li&gt;Don’t overuse TaskFlow just because it’s cleaner&lt;/li&gt;
&lt;li&gt;Don’t forget serialization still exists&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;TaskFlow isn’t new-but adopting it after using the traditional approach makes its benefits clearer.&lt;/p&gt;

&lt;p&gt;It moves you from writing orchestration-heavy DAGs to writing clean Python workflows.&lt;/p&gt;

&lt;p&gt;And that shift improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;readability&lt;/li&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;li&gt;reasoning about pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;TaskFlow simplifies DAG structure without changing Airflow’s core engine&lt;/li&gt;
&lt;li&gt;XComs still exist but are abstracted&lt;/li&gt;
&lt;li&gt;The real improvement is cleaner data flow modeling&lt;/li&gt;
&lt;li&gt;Refactoring old DAGs is one of the best ways to understand it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>apacheairflow</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Data Pipelines Explained Simply (and How to Build Them with Python)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:34:55 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</link>
      <guid>https://forem.com/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</guid>
      <description>&lt;p&gt;Data pipelines are the backbone of modern data-driven organizations. They automate the movement, transformation, and storage of data - from raw sources to actionable insights.&lt;/p&gt;

&lt;p&gt;Python has become the go-to language for building scalable pipelines because of its rich ecosystem, flexibility, and ease of use.&lt;/p&gt;

&lt;p&gt;This guide walks through the fundamentals, tools, and best practices for building robust data pipelines using Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding Data Pipelines&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you need to supply clean water to a village. The process involves collecting water from different sources (rivers, wells, rain), purifying it, transporting it, and storing it so people can access it whenever they need it.&lt;/p&gt;

&lt;p&gt;A data pipeline works in a very similar way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" alt="A data pipeline represented as a water system, showing how raw data flows through ingestion, transformation, storage, and finally consumption." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It automates the journey of raw, unstructured data from multiple sources (like databases, APIs, or IoT devices) and transforms it into clean, usable data stored in a destination (like a data warehouse) for analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Components of a Data Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s break it down using the same analogy:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Collecting Water (Data Ingestion)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Just like gathering water from lakes or wells, a pipeline starts by extracting data from sources such as databases, APIs, spreadsheets, or sensors.&lt;/p&gt;

&lt;p&gt;The goal here is simple: get all the data into one system, no matter how scattered it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Filtering and Purifying (Data Transformation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Raw water isn’t clean—and neither is raw data.&lt;/p&gt;

&lt;p&gt;At this stage, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removes duplicates&lt;/li&gt;
&lt;li&gt;Handles missing values&lt;/li&gt;
&lt;li&gt;Standardizes formats&lt;/li&gt;
&lt;li&gt;Enriches data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where messy data becomes usable.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transporting Through Pipes (Data Movement)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once cleaned, water flows through pipes. In data pipelines, this represents the movement of data between systems.&lt;/p&gt;

&lt;p&gt;This can involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL processes&lt;/li&gt;
&lt;li&gt;Message queues (like Kafka)&lt;/li&gt;
&lt;li&gt;Cloud data transfer services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to move data efficiently without delays or bottlenecks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Storing in Tanks (Data Storage)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clean water is stored in tanks. Similarly, processed data is stored in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses (like Snowflake)&lt;/li&gt;
&lt;li&gt;Data lakes (like AWS S3)&lt;/li&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where data becomes ready for use.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Accessing on Demand (Data Consumption)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, people use the water.&lt;/p&gt;

&lt;p&gt;In the same way, data is consumed through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboards&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Machine learning models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where insights actually happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Essential Python Libraries and Tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python supports every stage of a pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests&lt;/code&gt; - API calls&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - handling CSV/JSON files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Transformation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - cleaning and aggregation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PySpark&lt;/code&gt; - large-scale distributed processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SQLAlchemy&lt;/code&gt; - database interaction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;boto3&lt;/code&gt; - AWS S3 integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Orchestration&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Apache Airflow&lt;/code&gt; - workflow scheduling and automation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dagster&lt;/code&gt; - modern pipeline orchestration with observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Implement retries and proper logging to avoid silent failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Track pipeline health using tools like Airflow’s UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Documentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Keep clear documentation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;li&gt;Dependencies&lt;/li&gt;
&lt;li&gt;Workflow logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Test each stage of the pipeline using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Sample datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Popular Frameworks for Advanced Use Cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; - Best for complex workflows with dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt; - Strong focus on testing and data asset visibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt; - Simplifies building fault-tolerant pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luigi&lt;/strong&gt; - Good for batch processing and dependency management&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>etl</category>
      <category>python</category>
      <category>datapipeline</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</link>
      <guid>https://forem.com/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</guid>
      <description>&lt;p&gt;When I first started learning data engineering, ETL and ELT honestly felt like the same thing with just swapped letters. Everyone kept mentioning them like they were obvious concepts, but I had to sit down and really break them apart before it made sense.&lt;/p&gt;

&lt;p&gt;If you’re in the same place, don’t worry, you’re not alone.&lt;/p&gt;

&lt;p&gt;Let’s make it simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  First things first: what do ETL and ELT even mean?
&lt;/h2&gt;

&lt;p&gt;Both ETL and ELT are ways of moving and processing data from one place to another.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL (Extract, Transform, Load)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from a source (like an API or database)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it before storing it (cleaning, filtering, joining, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; the final cleaned data into a target system (like a data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you clean the data before storing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" alt="ELT" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT (Extract, Load, Transform)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from the source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; it directly into the storage system first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it inside the database/warehouse later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you store raw data first, then clean it inside the system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" alt="ETL" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  So what’s the real difference?
&lt;/h2&gt;

&lt;p&gt;The biggest difference is &lt;strong&gt;where the transformation happens&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL → Transform happens outside the warehouse&lt;/li&gt;
&lt;li&gt;ELT → Transform happens inside the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one shift changes a lot more than you’d think.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ETL makes sense
&lt;/h2&gt;

&lt;p&gt;ETL is usually used when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have smaller datasets&lt;/li&gt;
&lt;li&gt;You need strict data control before loading&lt;/li&gt;
&lt;li&gt;Your system can’t handle heavy processing&lt;/li&gt;
&lt;li&gt;Data quality must be enforced early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like cleaning your room before putting things in storage.&lt;/p&gt;

&lt;p&gt;You don’t want messy data entering your system at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ELT makes sense
&lt;/h2&gt;

&lt;p&gt;ELT is more common in modern systems, especially with cloud platforms.&lt;/p&gt;

&lt;p&gt;It works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have large volumes of data&lt;/li&gt;
&lt;li&gt;You’re using powerful cloud warehouses (like Snowflake or BigQuery)&lt;/li&gt;
&lt;li&gt;You want flexibility in how data is transformed&lt;/li&gt;
&lt;li&gt;You want to keep raw data for future use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like dumping everything into a warehouse first, then organizing it later when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple real-world example
&lt;/h2&gt;

&lt;p&gt;Imagine you’re building a dashboard for an e-commerce app.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ETL:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull order data&lt;/li&gt;
&lt;li&gt;Clean it (remove duplicates, fix missing values)&lt;/li&gt;
&lt;li&gt;Then load it into your database ready for reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is neat before it even arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ELT:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull raw order data&lt;/li&gt;
&lt;li&gt;Load everything into a data warehouse&lt;/li&gt;
&lt;li&gt;Later write SQL transformations to clean and structure it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you more flexibility if business rules change later.&lt;/p&gt;

&lt;h2&gt;
  
  
  My key takeaway
&lt;/h2&gt;

&lt;p&gt;When I first learned this, I thought ETL was “old” and ELT was “new,” but that’s not really true.&lt;/p&gt;

&lt;p&gt;They both still matter.&lt;/p&gt;

&lt;p&gt;Here’s a simple way I now remember it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL = Clean first, store later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ELT = Store first, clean later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes beginners make
&lt;/h2&gt;

&lt;p&gt;A few things that confused me at the start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thinking ELT means “no cleaning” (it still involves transformation!)&lt;/li&gt;
&lt;li&gt;Mixing up where SQL transformations happen&lt;/li&gt;
&lt;li&gt;Assuming one is always better than the other (it depends on the system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So… which one should YOU use?
&lt;/h2&gt;

&lt;p&gt;There’s no universal winner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you’re working with traditional systems → ETL is common&lt;/li&gt;
&lt;li&gt;If you’re in modern cloud data engineering → ELT is more popular&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most real companies actually use a &lt;strong&gt;mix of both&lt;/strong&gt;, depending on the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  To make this even more practical, here are some common tools used in real ETL and ELT workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL Tools (Transformation happens before loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; – for scheduling and orchestrating ETL workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Informatica PowerCenter&lt;/strong&gt; – widely used in enterprise ETL pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talend&lt;/strong&gt; – open-source tool for data integration and transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache NiFi&lt;/strong&gt; – good for real-time data flow and routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSIS (SQL Server Integration Services)&lt;/strong&gt; – Microsoft-based ETL tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools usually handle data cleaning and transformation before sending data to a warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT Tools (Transformation happens after loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake&lt;/strong&gt; – modern cloud data warehouse with strong ELT support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google BigQuery&lt;/strong&gt; – popular for serverless ELT workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift&lt;/strong&gt; – widely used in AWS-based data stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt (Data Build Tool)&lt;/strong&gt; – one of the most popular tools for transformations inside the warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks (Apache Spark)&lt;/strong&gt; – used for large-scale ELT processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In ELT setups, tools like &lt;strong&gt;dbt&lt;/strong&gt; handle transformation using SQL after data is loaded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Once I understood this difference, a lot of other concepts like data pipelines, warehouses, and analytics started to make way more sense.&lt;/p&gt;

&lt;p&gt;If you’re learning data engineering right now, don’t rush it. Build a small pipeline, try both approaches, and you’ll see the difference quickly.&lt;/p&gt;

&lt;p&gt;That’s where it really clicks.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>elt</category>
    </item>
  </channel>
</rss>
