<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Egor Tarasenko</title>
    <description>The latest articles on Forem by Egor Tarasenko (@egor_tarasenko).</description>
    <link>https://forem.com/egor_tarasenko</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3675147%2Fb0f749ac-c27f-4377-a596-3fd7b6f9eecc.jpg</url>
      <title>Forem: Egor Tarasenko</title>
      <link>https://forem.com/egor_tarasenko</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/egor_tarasenko"/>
    <language>en</language>
    <item>
      <title>Building a Flight Data Pipeline Without Trusting AI</title>
      <dc:creator>Egor Tarasenko</dc:creator>
      <pubDate>Sun, 29 Mar 2026 22:54:12 +0000</pubDate>
      <link>https://forem.com/egor_tarasenko/building-a-flight-data-pipeline-without-trusting-ai-2766</link>
      <guid>https://forem.com/egor_tarasenko/building-a-flight-data-pipeline-without-trusting-ai-2766</guid>
      <description>&lt;p&gt;A few months ago I was cleaning up my GitHub and found this repo -- &lt;a href="https://github.com/dat-a-ish/data-gov" rel="noopener noreferrer"&gt;https://github.com/dat-a-ish/data-gov&lt;/a&gt;. It was built at that weird inflection point -- AI tools existed, Copilot was already in my editor, but I still rewrote every suggestion by hand because I didn't fully trust what came out. So the code in here is genuinely mine, typed line by line, and that's exactly what makes it useful as a learning resource.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4ax0lkutal0hqkw41sd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4ax0lkutal0hqkw41sd.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The irony? The project pulls real flight data from the Israeli Government API -- Ben Gurion Airport departures and arrivals. With the current situation in the Middle East making flights in and out of Tel Aviv a daily news story, a pipeline that tracks exactly that data feels more relevant than ever. But jokes aside -- this repo is not about geopolitics. It's a study in how a production-grade batch data pipeline actually fits together, from raw API ingestion all the way to a live dashboard.&lt;/p&gt;

&lt;p&gt;My honest recommendation: clone it, read it, and then build your own version without AI help. Not because AI is bad -- but because the only way to truly understand how these tools fit together is to wire them up yourself at least once.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does This Project Actually Do?
&lt;/h2&gt;

&lt;p&gt;The pipeline collects real-time flight data from Israel's open government API, enriches it with geolocation data, transforms it through multiple modeling layers, validates it, and serves it on a Streamlit dashboard. Every layer is automated with Apache Airflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7sv5r2ueos1for5ymnf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7sv5r2ueos1for5ymnf.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Apache Airflow 2.8.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transformation&lt;/td&gt;
&lt;td&gt;dbt 1.7.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;PostgreSQL 15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visualization&lt;/td&gt;
&lt;td&gt;Streamlit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Quality&lt;/td&gt;
&lt;td&gt;Great Expectations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geospatial&lt;/td&gt;
&lt;td&gt;GeoPandas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infra&lt;/td&gt;
&lt;td&gt;Docker, Terraform, AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Architecture at a Glance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eg6vdbk84khpj0t0bq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5eg6vdbk84khpj0t0bq0.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline has four stages that flow into each other using Airflow's Dataset-based triggering -- no fixed cron dependencies, just data contracts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flight_sensor -&amp;gt; flights_load -&amp;gt; flights_geocode -&amp;gt; flights_dbt -&amp;gt; dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage is a separate DAG. When one finishes and emits a Dataset event, the next one wakes up automatically. This is a clean way to build loosely-coupled, event-driven batch pipelines in Airflow without hardcoding schedules everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Data Ingestion -- &lt;code&gt;base/pipelines/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;This is where raw data enters the system. The &lt;code&gt;flights.py&lt;/code&gt; pipeline hits the government API, pulls the latest flight records, and loads them into PostgreSQL using a passive table pattern.&lt;/p&gt;

&lt;p&gt;The idea is simple but powerful: you never write directly to the production table. Instead you write to a &lt;code&gt;_passive&lt;/code&gt; staging table, validate it, and only then atomically swap it with the production table. Zero downtime, easy rollback.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The core atomic swap
&lt;/span&gt;&lt;span class="nf"&gt;exchange_tables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;flight_source_passive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;flight_source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;create_or_replace_synonym_view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;flight_source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The geocoding pipeline (&lt;code&gt;flights_geocode.py&lt;/code&gt;) enriches the raw data by converting city/country pairs to latitude/longitude using GeoPandas and OpenStreetMap. This runs as a separate DAG triggered automatically after ingestion completes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Orchestration -- &lt;code&gt;airflow/dags/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Four DAGs, each with a single responsibility:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flight_sensor&lt;/strong&gt; -- runs continuously, polls the source API for new data. When it detects a change, it fires a trigger to &lt;code&gt;flights_load&lt;/code&gt;. This is your entry point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flights_load&lt;/strong&gt; -- the main ingestion DAG. Runs hourly, or on-demand from the sensor. Fetches data, loads to the passive table, runs dbt tests, runs Great Expectations validation, then emits a &lt;code&gt;Dataset('flight_load')&lt;/code&gt; event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flights_geocode&lt;/strong&gt; -- triggered by &lt;code&gt;flight_load&lt;/code&gt;. Enriches the dataset with coordinates. Emits &lt;code&gt;Dataset('flights_geocode')&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt_dags&lt;/strong&gt; -- dynamically generates two dbt DAGs from config. &lt;code&gt;flights_dbt&lt;/code&gt; builds the transformation models and is triggered when either &lt;code&gt;flight_load&lt;/code&gt; or &lt;code&gt;flights_geocode&lt;/code&gt; completes. &lt;code&gt;clean_up_database_dbt&lt;/code&gt; runs daily and drops any database objects not touched in the current run -- a clean, automated way to handle schema drift.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g0kekpmxjbtoh0b21bo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g0kekpmxjbtoh0b21bo.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One thing worth studying: the &lt;code&gt;dag_template.py&lt;/code&gt; custom decorator. It wraps every DAG definition with consistent defaults -- retry logic, alerting, tags -- so you don't repeat yourself across DAG files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: Transformation -- &lt;code&gt;base/dbt/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;This is where raw flight records become usable analytical models. The dbt project implements three modeling layers -- a classic medallion-style architecture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silver -- Normalized (3NF / Inmon)&lt;/strong&gt;&lt;br&gt;
Breaks the raw table into clean, normalized entities: &lt;code&gt;flights&lt;/code&gt;, &lt;code&gt;airports&lt;/code&gt;, &lt;code&gt;cities&lt;/code&gt;, &lt;code&gt;countries&lt;/code&gt;, &lt;code&gt;flight_statuses&lt;/code&gt;, &lt;code&gt;flight_companies&lt;/code&gt;. Good for integrity, joins, and historical tracking. Slowly Changing Dimensions (SCD Type 2) are implemented here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gold Star Schema -- Dimensional (Kimball)&lt;/strong&gt;&lt;br&gt;
Fact and dimension tables optimized for analytical queries: &lt;code&gt;fact_flights_amount&lt;/code&gt;, &lt;code&gt;dim_airports&lt;/code&gt;, &lt;code&gt;dim_terminals&lt;/code&gt;, &lt;code&gt;dim_time&lt;/code&gt;. The classic BI-ready format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gold OBT -- One Big Table&lt;/strong&gt;&lt;br&gt;
A fully denormalized &lt;code&gt;flights_obt&lt;/code&gt; table. Fast for dashboards, no joins needed. This is what Streamlit reads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5bd7rdy56ao9eg6vg3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5bd7rdy56ao9eg6vg3u.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The progression from raw -&amp;gt; silver -&amp;gt; gold -&amp;gt; OBT is something every data engineer should build manually at least once. It makes the trade-offs between normalization and query performance very concrete.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 4: Data Quality -- &lt;code&gt;base/gx/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Two layers of validation run automatically after every ingestion:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;dbt tests&lt;/strong&gt; -- schema tests on the raw source (not null, unique, accepted values). These run before the table swap, so bad data never reaches production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Great Expectations&lt;/strong&gt; -- more complex expectation suites on the loaded data: column value ranges, row count thresholds, distribution checks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If either layer fails, the DAG fails cleanly and the passive table is never promoted.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 5: Visualization -- &lt;code&gt;streamlit/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;A multi-page Streamlit app that reads directly from the &lt;code&gt;flights_obt&lt;/code&gt; view in PostgreSQL. The main page shows a live map of flights; other pages break down data by airline, status, terminal, and time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24ugicix58lnmqaz0e60.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24ugicix58lnmqaz0e60.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interesting design choice here: Streamlit reads from a view (&lt;code&gt;flight_source_v&lt;/code&gt;), not the underlying table. The atomic swap in the ingestion layer updates the view pointer, so the dashboard always reads a consistent snapshot without downtime or stale reads.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 6: Infrastructure -- &lt;code&gt;prod_deploy/&lt;/code&gt; and &lt;code&gt;.github/workflows/&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Three separate CI/CD pipelines -- one each for Airflow, dbt, and Streamlit -- triggered only when relevant directories change. Each pipeline lints the code, builds a Docker image, pushes it to AWS ECR, and deploys to EC2 via SSH.&lt;/p&gt;

&lt;p&gt;Terraform manages ECR repositories and the EC2 instance. The local development setup mirrors production using Docker Compose, so what runs locally runs in prod.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Local setup takes about five minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dat-a-ish/data-gov
docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; local_deploy/docker-compose.yml up

&lt;span class="c"&gt;# Airflow:   http://localhost:8080&lt;/span&gt;
&lt;span class="c"&gt;# dbt docs:  http://localhost&lt;/span&gt;
&lt;span class="c"&gt;# Streamlit: http://localhost:8501&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to run just the ingestion without Airflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose &lt;span class="nt"&gt;-f&lt;/span&gt; local_deploy/docker-compose-database.yml up
python &lt;span class="nt"&gt;-m&lt;/span&gt; base.pipelines.flights
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And dbt alone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;base/dbt/data_gov_il
dbt run &lt;span class="nt"&gt;-t&lt;/span&gt; dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why Build This Without AI?
&lt;/h2&gt;

&lt;p&gt;Not as a purist exercise -- just because the concepts here (atomic swaps, passive tables, dataset-triggered DAGs, dbt layer separation, GX validation) become intuitive only when you've had to debug them yourself. When something breaks at 2am and your DAG is stuck, you need to know the system, not just remember that AI generated it for you.&lt;/p&gt;

&lt;p&gt;Build it, break it, fix it. Then use AI to build the next one 10x faster.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thanks for reading! If you found this useful, follow me here on dev.to for more Data &amp;amp; AI Engineering content.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>apacheairflow</category>
      <category>dbt</category>
      <category>python</category>
    </item>
  </channel>
</rss>
