<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anthony Gicheru</title>
    <description>The latest articles on Forem by Anthony Gicheru (@anthony-gicheru).</description>
    <link>https://forem.com/anthony-gicheru</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1186529%2Fed9dc374-bfac-4eee-bce7-90ea63105510.jpeg</url>
      <title>Forem: Anthony Gicheru</title>
      <link>https://forem.com/anthony-gicheru</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anthony-gicheru"/>
    <language>en</language>
    <item>
      <title>Refactoring Airflow Pipelines: From PythonOperator to TaskFlow</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:52:39 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</link>
      <guid>https://forem.com/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</guid>
      <description>&lt;h1&gt;
  
  
  Actually Embracing TaskFlow After a Year of Doing It the “Old Way”
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Introduction: This Isn’t New… But It &lt;em&gt;Feels&lt;/em&gt; New
&lt;/h2&gt;

&lt;p&gt;If you’ve been using Airflow for a while-like I have-you probably didn’t start with the TaskFlow API.&lt;/p&gt;

&lt;p&gt;You likely started with the classic Airflow 2.x style:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PythonOperator&lt;/li&gt;
&lt;li&gt;&lt;code&gt;**kwargs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ti.xcom_push()&lt;/code&gt; and &lt;code&gt;ti.xcom_pull()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Explicit task chaining with &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent over a year building pipelines this way. And to be clear-it works. It’s stable, production-ready, and widely used.&lt;/p&gt;

&lt;p&gt;But here’s the interesting part:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The TaskFlow API has existed since Airflow 2.0. I just didn’t fully adopt it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Honestly, I ignored TaskFlow for a long time because I thought it was just ‘syntactic sugar’. That’s more common than people admit.&lt;/p&gt;

&lt;p&gt;Most production systems and tutorials still rely on operators, so you naturally stay in that pattern. It’s only later-when readability and maintainability start to matter-that TaskFlow becomes interesting.&lt;/p&gt;

&lt;p&gt;And once it clicks, it changes how you think about Airflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Core Concepts: Same Engine, Different Experience
&lt;/h2&gt;

&lt;p&gt;TaskFlow doesn’t replace Airflow concepts-it abstracts them.&lt;/p&gt;

&lt;p&gt;You still work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks&lt;/li&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;XComs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is &lt;em&gt;how&lt;/em&gt; you express them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;traditional_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but it introduces a lot of orchestration boilerplate into your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feels simpler because it is.&lt;/p&gt;

&lt;p&gt;TaskFlow removes explicit XCom handling and lets function returns define data flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Real Shift: From Wiring Tasks to Modeling Data Flow
&lt;/h2&gt;

&lt;p&gt;With the traditional approach, your mental model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task A - XCom - Task B - XCom - Task C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With TaskFlow, it becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" alt="Airflow DAG: Traditional vs Taskflow" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same execution engine. Different abstraction.&lt;/p&gt;

&lt;p&gt;The shift is from &lt;em&gt;task orchestration&lt;/em&gt; to &lt;em&gt;data flow composition&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. XComs: Manual vs Automatic
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You manage everything explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Airflow handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;serialization&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You focus on logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  When You Still Need Control
&lt;/h3&gt;

&lt;p&gt;TaskFlow still allows explicit control when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models.xcom_arg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XComArg&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;XComArg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Real-World Example: Gas Prices ETL Refactor
&lt;/h2&gt;

&lt;p&gt;I didn’t build two versions of this pipeline at once.&lt;/p&gt;

&lt;p&gt;I originally built it using the traditional Airflow 2.x approach and later refactored it using TaskFlow.&lt;/p&gt;

&lt;p&gt;That’s when the difference became clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API - Extract gas prices - Transform - Store in PostgreSQL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Reference
&lt;/h3&gt;

&lt;p&gt;Full project: &lt;a href="https://github.com/Anthony-Gicheru/Gas-Prices-ETL-with-Apache-Airflow" rel="noopener noreferrer"&gt;Github Link to the project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It includes both the original DAG and the TaskFlow refactor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach tightly couples logic with Airflow internals.&lt;/p&gt;

&lt;p&gt;Data must be serialized manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;json_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TaskFlow Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decoded_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the pipeline becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads like standard Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Changed
&lt;/h3&gt;

&lt;p&gt;The logic stayed the same. The structure changed completely.&lt;/p&gt;

&lt;p&gt;Instead of manually managing XComs, data flows naturally between functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task definition&lt;/td&gt;
&lt;td&gt;PythonOperator&lt;/td&gt;
&lt;td&gt;@task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data passing&lt;/td&gt;
&lt;td&gt;Manual XCom&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Readability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mental model&lt;/td&gt;
&lt;td&gt;Wiring tasks&lt;/td&gt;
&lt;td&gt;Data flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Lessons From the Refactor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. TaskFlow doesn’t remove XComs
&lt;/h3&gt;

&lt;p&gt;It only hides them.&lt;/p&gt;

&lt;p&gt;You still need to respect serialization limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;big_dataframe&lt;/span&gt;  &lt;span class="c1"&gt;# still not ideal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Passing data is easier-but not always better
&lt;/h3&gt;

&lt;p&gt;TaskFlow makes it easy to pass data between tasks, but large payloads should still live in external storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring was mostly structural
&lt;/h3&gt;

&lt;p&gt;Most of the work was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing &lt;code&gt;**kwargs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;replacing XCom logic with returns&lt;/li&gt;
&lt;li&gt;simplifying task boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The biggest change is mental
&lt;/h3&gt;

&lt;p&gt;The shift was not technical-it was conceptual.&lt;br&gt;
From:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I connect tasks?&lt;br&gt;
to:&lt;/p&gt;
&lt;h2&gt;
  
  
  How does data flow through this pipeline?
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Pitfalls to Avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don’t push large objects through XCom&lt;/li&gt;
&lt;li&gt;Don’t mix styles without intention&lt;/li&gt;
&lt;li&gt;Don’t overuse TaskFlow just because it’s cleaner&lt;/li&gt;
&lt;li&gt;Don’t forget serialization still exists&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;TaskFlow isn’t new-but adopting it after using the traditional approach makes its benefits clearer.&lt;/p&gt;

&lt;p&gt;It moves you from writing orchestration-heavy DAGs to writing clean Python workflows.&lt;/p&gt;

&lt;p&gt;And that shift improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;readability&lt;/li&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;li&gt;reasoning about pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;TaskFlow simplifies DAG structure without changing Airflow’s core engine&lt;/li&gt;
&lt;li&gt;XComs still exist but are abstracted&lt;/li&gt;
&lt;li&gt;The real improvement is cleaner data flow modeling&lt;/li&gt;
&lt;li&gt;Refactoring old DAGs is one of the best ways to understand it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>apacheairflow</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Data Pipelines Explained Simply (and How to Build Them with Python)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:34:55 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</link>
      <guid>https://forem.com/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</guid>
      <description>&lt;p&gt;Data pipelines are the backbone of modern data-driven organizations. They automate the movement, transformation, and storage of data - from raw sources to actionable insights.&lt;/p&gt;

&lt;p&gt;Python has become the go-to language for building scalable pipelines because of its rich ecosystem, flexibility, and ease of use.&lt;/p&gt;

&lt;p&gt;This guide walks through the fundamentals, tools, and best practices for building robust data pipelines using Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding Data Pipelines&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you need to supply clean water to a village. The process involves collecting water from different sources (rivers, wells, rain), purifying it, transporting it, and storing it so people can access it whenever they need it.&lt;/p&gt;

&lt;p&gt;A data pipeline works in a very similar way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" alt="A data pipeline represented as a water system, showing how raw data flows through ingestion, transformation, storage, and finally consumption." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It automates the journey of raw, unstructured data from multiple sources (like databases, APIs, or IoT devices) and transforms it into clean, usable data stored in a destination (like a data warehouse) for analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Components of a Data Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s break it down using the same analogy:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Collecting Water (Data Ingestion)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Just like gathering water from lakes or wells, a pipeline starts by extracting data from sources such as databases, APIs, spreadsheets, or sensors.&lt;/p&gt;

&lt;p&gt;The goal here is simple: get all the data into one system, no matter how scattered it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Filtering and Purifying (Data Transformation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Raw water isn’t clean—and neither is raw data.&lt;/p&gt;

&lt;p&gt;At this stage, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removes duplicates&lt;/li&gt;
&lt;li&gt;Handles missing values&lt;/li&gt;
&lt;li&gt;Standardizes formats&lt;/li&gt;
&lt;li&gt;Enriches data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where messy data becomes usable.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transporting Through Pipes (Data Movement)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once cleaned, water flows through pipes. In data pipelines, this represents the movement of data between systems.&lt;/p&gt;

&lt;p&gt;This can involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL processes&lt;/li&gt;
&lt;li&gt;Message queues (like Kafka)&lt;/li&gt;
&lt;li&gt;Cloud data transfer services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to move data efficiently without delays or bottlenecks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Storing in Tanks (Data Storage)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clean water is stored in tanks. Similarly, processed data is stored in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses (like Snowflake)&lt;/li&gt;
&lt;li&gt;Data lakes (like AWS S3)&lt;/li&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where data becomes ready for use.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Accessing on Demand (Data Consumption)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, people use the water.&lt;/p&gt;

&lt;p&gt;In the same way, data is consumed through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboards&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Machine learning models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where insights actually happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Essential Python Libraries and Tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python supports every stage of a pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests&lt;/code&gt; - API calls&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - handling CSV/JSON files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Transformation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - cleaning and aggregation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PySpark&lt;/code&gt; - large-scale distributed processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SQLAlchemy&lt;/code&gt; - database interaction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;boto3&lt;/code&gt; - AWS S3 integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Orchestration&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Apache Airflow&lt;/code&gt; - workflow scheduling and automation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dagster&lt;/code&gt; - modern pipeline orchestration with observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Implement retries and proper logging to avoid silent failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Track pipeline health using tools like Airflow’s UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Documentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Keep clear documentation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;li&gt;Dependencies&lt;/li&gt;
&lt;li&gt;Workflow logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Test each stage of the pipeline using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Sample datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Popular Frameworks for Advanced Use Cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; - Best for complex workflows with dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt; - Strong focus on testing and data asset visibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt; - Simplifies building fault-tolerant pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luigi&lt;/strong&gt; - Good for batch processing and dependency management&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>etl</category>
      <category>python</category>
      <category>datapipeline</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://forem.com/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</link>
      <guid>https://forem.com/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</guid>
      <description>&lt;p&gt;When I first started learning data engineering, ETL and ELT honestly felt like the same thing with just swapped letters. Everyone kept mentioning them like they were obvious concepts, but I had to sit down and really break them apart before it made sense.&lt;/p&gt;

&lt;p&gt;If you’re in the same place, don’t worry, you’re not alone.&lt;/p&gt;

&lt;p&gt;Let’s make it simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  First things first: what do ETL and ELT even mean?
&lt;/h2&gt;

&lt;p&gt;Both ETL and ELT are ways of moving and processing data from one place to another.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL (Extract, Transform, Load)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from a source (like an API or database)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it before storing it (cleaning, filtering, joining, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; the final cleaned data into a target system (like a data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you clean the data before storing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" alt="ELT" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT (Extract, Load, Transform)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from the source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; it directly into the storage system first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it inside the database/warehouse later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you store raw data first, then clean it inside the system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" alt="ETL" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  So what’s the real difference?
&lt;/h2&gt;

&lt;p&gt;The biggest difference is &lt;strong&gt;where the transformation happens&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL → Transform happens outside the warehouse&lt;/li&gt;
&lt;li&gt;ELT → Transform happens inside the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one shift changes a lot more than you’d think.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ETL makes sense
&lt;/h2&gt;

&lt;p&gt;ETL is usually used when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have smaller datasets&lt;/li&gt;
&lt;li&gt;You need strict data control before loading&lt;/li&gt;
&lt;li&gt;Your system can’t handle heavy processing&lt;/li&gt;
&lt;li&gt;Data quality must be enforced early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like cleaning your room before putting things in storage.&lt;/p&gt;

&lt;p&gt;You don’t want messy data entering your system at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ELT makes sense
&lt;/h2&gt;

&lt;p&gt;ELT is more common in modern systems, especially with cloud platforms.&lt;/p&gt;

&lt;p&gt;It works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have large volumes of data&lt;/li&gt;
&lt;li&gt;You’re using powerful cloud warehouses (like Snowflake or BigQuery)&lt;/li&gt;
&lt;li&gt;You want flexibility in how data is transformed&lt;/li&gt;
&lt;li&gt;You want to keep raw data for future use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like dumping everything into a warehouse first, then organizing it later when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple real-world example
&lt;/h2&gt;

&lt;p&gt;Imagine you’re building a dashboard for an e-commerce app.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ETL:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull order data&lt;/li&gt;
&lt;li&gt;Clean it (remove duplicates, fix missing values)&lt;/li&gt;
&lt;li&gt;Then load it into your database ready for reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is neat before it even arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ELT:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull raw order data&lt;/li&gt;
&lt;li&gt;Load everything into a data warehouse&lt;/li&gt;
&lt;li&gt;Later write SQL transformations to clean and structure it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you more flexibility if business rules change later.&lt;/p&gt;

&lt;h2&gt;
  
  
  My key takeaway
&lt;/h2&gt;

&lt;p&gt;When I first learned this, I thought ETL was “old” and ELT was “new,” but that’s not really true.&lt;/p&gt;

&lt;p&gt;They both still matter.&lt;/p&gt;

&lt;p&gt;Here’s a simple way I now remember it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL = Clean first, store later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ELT = Store first, clean later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes beginners make
&lt;/h2&gt;

&lt;p&gt;A few things that confused me at the start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thinking ELT means “no cleaning” (it still involves transformation!)&lt;/li&gt;
&lt;li&gt;Mixing up where SQL transformations happen&lt;/li&gt;
&lt;li&gt;Assuming one is always better than the other (it depends on the system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So… which one should YOU use?
&lt;/h2&gt;

&lt;p&gt;There’s no universal winner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you’re working with traditional systems → ETL is common&lt;/li&gt;
&lt;li&gt;If you’re in modern cloud data engineering → ELT is more popular&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most real companies actually use a &lt;strong&gt;mix of both&lt;/strong&gt;, depending on the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  To make this even more practical, here are some common tools used in real ETL and ELT workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL Tools (Transformation happens before loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; – for scheduling and orchestrating ETL workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Informatica PowerCenter&lt;/strong&gt; – widely used in enterprise ETL pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talend&lt;/strong&gt; – open-source tool for data integration and transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache NiFi&lt;/strong&gt; – good for real-time data flow and routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSIS (SQL Server Integration Services)&lt;/strong&gt; – Microsoft-based ETL tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools usually handle data cleaning and transformation before sending data to a warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT Tools (Transformation happens after loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake&lt;/strong&gt; – modern cloud data warehouse with strong ELT support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google BigQuery&lt;/strong&gt; – popular for serverless ELT workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift&lt;/strong&gt; – widely used in AWS-based data stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt (Data Build Tool)&lt;/strong&gt; – one of the most popular tools for transformations inside the warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks (Apache Spark)&lt;/strong&gt; – used for large-scale ELT processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In ELT setups, tools like &lt;strong&gt;dbt&lt;/strong&gt; handle transformation using SQL after data is loaded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Once I understood this difference, a lot of other concepts like data pipelines, warehouses, and analytics started to make way more sense.&lt;/p&gt;

&lt;p&gt;If you’re learning data engineering right now, don’t rush it. Build a small pipeline, try both approaches, and you’ll see the difference quickly.&lt;/p&gt;

&lt;p&gt;That’s where it really clicks.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>elt</category>
    </item>
  </channel>
</rss>
