<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Denzel Kanyeki</title>
    <description>The latest articles on Forem by Denzel Kanyeki (@dkkinyua).</description>
    <link>https://forem.com/dkkinyua</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1171630%2F5287df81-0524-44a4-adf7-f27b37582ef8.jpeg</url>
      <title>Forem: Denzel Kanyeki</title>
      <link>https://forem.com/dkkinyua</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dkkinyua"/>
    <language>en</language>
    <item>
      <title>Building a Reddit Sentiment Pipeline using Python, PostgreSQL, VADER, Airflow, Grafana, Prometheus and StatsD</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Mon, 22 Sep 2025 09:39:34 +0000</pubDate>
      <link>https://forem.com/dkkinyua/building-a-reddit-sentiment-pipeline-using-python-postgresql-vader-airflow-grafana-prometheus-5gdn</link>
      <guid>https://forem.com/dkkinyua/building-a-reddit-sentiment-pipeline-using-python-postgresql-vader-airflow-grafana-prometheus-5gdn</guid>
      <description>&lt;p&gt;Data is everywhere, but making sense of it requires collecting, cleaning, storing, and analyzing it effectively. In this blog, I’ll walk you through how I built a &lt;strong&gt;Reddit Sentiment Analysis pipeline&lt;/strong&gt; that fetches Reddit posts, analyzes their sentiment, stores results in &lt;strong&gt;PostgreSQL&lt;/strong&gt;, and visualizes insights in &lt;strong&gt;Grafana&lt;/strong&gt; and orchestrated with &lt;strong&gt;Apache Airflow&lt;/strong&gt; and containerized using &lt;strong&gt;Docker Compose&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;If would want to check the repository out, here is the &lt;a href="https://github.com/dkkinyua/RedditSentimentAnalysis" rel="noopener noreferrer"&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Extract Reddit posts from specific subreddits (using &lt;code&gt;praw&lt;/code&gt; API).
&lt;/li&gt;
&lt;li&gt;Clean and analyze sentiment with &lt;strong&gt;VADER Sentiment Analyzer&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Load processed data into &lt;strong&gt;Postgres&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Orchestrate everything with &lt;strong&gt;Airflow DAGs&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Visualize insights on a &lt;strong&gt;Grafana Dashboard&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Get &lt;strong&gt;email notifications&lt;/strong&gt; when DAGs fail or succeed via SMTP.&lt;/li&gt;
&lt;li&gt;Monitoring metrics and service health using &lt;strong&gt;Prometheus&lt;/strong&gt;, &lt;strong&gt;StatsD&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The snapshot below shows the project architecture:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9gs848bbqxm2kbr92naw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9gs848bbqxm2kbr92naw.jpg" alt="Architecture" width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Project Setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Extract Reddit Data
&lt;/h3&gt;

&lt;p&gt;Using the &lt;code&gt;praw&lt;/code&gt; Python library, I pull the &lt;strong&gt;top 50 posts&lt;/strong&gt; from subreddits like &lt;code&gt;r/kenya&lt;/code&gt; and &lt;code&gt;r/nairobi&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="n"&gt;reddit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;praw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Reddit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
              &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CLIENT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;client_secret&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SECRET_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data pipeline by u/user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
          &lt;span class="p"&gt;)&lt;/span&gt;

          &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
          &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
          &lt;span class="n"&gt;subs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kenya&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nairobi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
          &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sub&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;subs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
              &lt;span class="n"&gt;subreddit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reddit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subreddit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;subreddit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;top&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                  &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;clean_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;clean_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;selftext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;selftext&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subreddit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time_created&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_utc&lt;/span&gt;
                  &lt;span class="p"&gt;})&lt;/span&gt;

          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Transform &amp;amp; Analyze
&lt;/h3&gt;

&lt;p&gt;Posts are cleaned (&lt;code&gt;remove special characters, filter out N/A&lt;/code&gt;) and passed into &lt;strong&gt;VADER Sentiment Analyzer&lt;/strong&gt;, which returns:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pos&lt;/code&gt; → Positive score
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;neu&lt;/code&gt; → Neutral score
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;neg&lt;/code&gt; → Negative score
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;compound&lt;/code&gt; → Overall sentiment
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I then classify sentiment into &lt;strong&gt;Positive, Neutral, or Negative&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_text&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;fetch posts from Postgres, analyze sentiment from posts, and save results back to a different psql table.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Impoting these modules inside this task helps with run time
&lt;/span&gt;        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vaderSentiment.vaderSentiment&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentimentIntensityAnalyzer&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;utils.extract&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_sentiment&lt;/span&gt;

        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DB_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentimentIntensityAnalyzer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Load from DB
&lt;/span&gt;            &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_sql_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reddit_posts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reddit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Clean text off the N/A tags
&lt;/span&gt;            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Apply sentiment
&lt;/span&gt;            &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;polarity_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pos&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pos&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_sentiment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Save results
&lt;/span&gt;            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reddit_sentiment_analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reddit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error analyzing or saving data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Load into PostgreSQL
&lt;/h3&gt;

&lt;p&gt;The processed data is stored in a &lt;code&gt;reddit_sentiment_analysis&lt;/code&gt; table, and the analysis data is stored in a &lt;code&gt;reddit_posts&lt;/code&gt; table which separates concerns in a medallion architecture, that is a &lt;strong&gt;bronze layer&lt;/strong&gt; for raw data without any transformations from PRAW that prevents any data loss, &lt;strong&gt;silver layer&lt;/strong&gt; for semi-structured data and &lt;strong&gt;gold layer&lt;/strong&gt; which serves as the "single source of truth" for decision-making and ready for dashboards in Grafana.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Orchestration with Airflow
&lt;/h3&gt;

&lt;p&gt;Two DAGs coordinate the workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;reddit_dag.py&lt;/code&gt; → Extract &amp;amp; store raw posts.
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;analyze_dag.py&lt;/code&gt; → Waits for the first DAG, runs sentiment analysis, and sends email notifications via Airflow's &lt;code&gt;ExternalTaskSensor&lt;/code&gt; which pokes &lt;code&gt;reddit_dag&lt;/code&gt; to check if it's complete then lets &lt;code&gt;analyze_dag&lt;/code&gt; run.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Read more about &lt;code&gt;ExternalTaskSensor&lt;/code&gt; in the &lt;a href="https://airflow.apache.org/docs/apache-airflow/2.1.2/_api/airflow/sensors/external_task/index.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Visualize in Grafana
&lt;/h3&gt;

&lt;p&gt;Grafana queries Postgres directly and visualizes metrics like sentiment trends, subreddit comparisons, and post volume. Also, visualizes Airflow metrics via Prometheus and Grafana&lt;/p&gt;

&lt;p&gt;StatsD is integrated into this project as a metrics collector for Airflow. Airflow emits metrics such as &lt;code&gt;dag_processing.processes&lt;/code&gt; and &lt;code&gt;executor.open_slots&lt;/code&gt;, which capture scheduler and worker health, DAG processing times, and task execution details. These metrics are forwarded to monitoring backends and visualized in Grafana dashboards for observability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1b85dkbfa41hpl55agy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1b85dkbfa41hpl55agy.jpg" alt="Airflow x StatsD" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prometheus is used as the metrics storage and querying system. The prometheus service is configured in &lt;code&gt;docker-compose.yml&lt;/code&gt; and &lt;code&gt;prometheus.yml&lt;/code&gt; files scrapes metrics exposed by Airflow at the &lt;code&gt;/metrics&lt;/code&gt; endpoint via a Prometheus exporter. Prometheus then makes these metrics queryable using PromQL and visualizable in Grafana.&lt;/p&gt;

&lt;h3&gt;
  
  
  To add Prometheus as a data source on Grafana and visualize metrics:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Head over to &lt;code&gt;localhost:3000&lt;/code&gt; to access the Grafana service
&lt;/h4&gt;

&lt;p&gt;Use &lt;code&gt;admin&lt;/code&gt; &amp;amp; &lt;code&gt;admin&lt;/code&gt; as your login details, and you will be prompted to change your password for security reasons&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkou0shxm1rwgvvbb6dlf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkou0shxm1rwgvvbb6dlf.png" alt="login page" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Adding Prometheus as a data source
&lt;/h4&gt;

&lt;p&gt;On the left side, click on 'Add Data Source' and select Prometheus as the data source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyay38baevy4tgqx6gr1z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyay38baevy4tgqx6gr1z.png" alt="datasource page" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Import a custom dashboard
&lt;/h4&gt;

&lt;p&gt;To use a custom dashboard, click on &lt;strong&gt;Import Dashboard&lt;/strong&gt; option and paste the contents of &lt;code&gt;dashboard.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad1ctsoxewjb6udw9mpg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad1ctsoxewjb6udw9mpg.png" alt="Import dashboard" width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is a snapshot of the metrics dashboard showing scheduler heartbeat rate, DAGs loaded, DAG processing duration, queued tasks and time taken in running tasks bt the executor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09fmighwjhdqk1ldzu9i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09fmighwjhdqk1ldzu9i.png" alt="Metrics dashboard" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights generated from this project
&lt;/h2&gt;

&lt;p&gt;The dashboard below visualizes the data generated from this project based on sentiment scores. Here is the &lt;a href="https://deecodes.grafana.net/public-dashboards/8fe4cbfc00dd48e08038c5ae040c8dfb" rel="noopener noreferrer"&gt;link&lt;/a&gt; to this dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpp9pyw9o5q16saznvxb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpp9pyw9o5q16saznvxb6.png" alt="findings" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6syjw1tau3cbsfjkelh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6syjw1tau3cbsfjkelh.png" alt="Findings 2" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;66% of the data show positive sentiment, as 23% show negative sentiment and 7% show neutral sentiment.&lt;/li&gt;
&lt;li&gt;r/Nairobi show more positive sentiment posts than r/Kenya, but with a higher amount of posts from users.&lt;/li&gt;
&lt;li&gt;r/Nairobi has more posts has r/Kenya with 659 posts compared to r/Kenya's 616.&lt;/li&gt;
&lt;li&gt;The highest sentiment was recorded on 15 September 2025 (0.483), and the lowest sentiment was recorded on 12 September 2025 (0.213).&lt;/li&gt;
&lt;li&gt;Upvote score does not affect sentiment score, so lower upvote score doesn't show low sentiment and vice versa.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project highlights how well-designed data pipelines can unlock valuable insights from unstructured data sources like Reddit, while following best practices in data engineering and analytics.&lt;/p&gt;

&lt;p&gt;Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at &lt;a href="mailto:denzelkinyua11@gmail.com"&gt;denzelkinyua11@gmail.com&lt;/a&gt; or visit any of my social media platforms linked on my GitHub page.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>data</category>
      <category>programming</category>
    </item>
    <item>
      <title>Docker for Data Engineers: The Complete Beginner’s Guide</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Wed, 03 Sep 2025 11:33:43 +0000</pubDate>
      <link>https://forem.com/dkkinyua/docker-for-data-engineers-the-complete-beginners-guide-4ik0</link>
      <guid>https://forem.com/dkkinyua/docker-for-data-engineers-the-complete-beginners-guide-4ik0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Imagine being able to package your entire data pipeline into a neat, portable box that runs anywhere without the dreaded &lt;em&gt;“but it works on my machine”&lt;/em&gt; excuse. That’s the magic of Docker.&lt;/p&gt;

&lt;p&gt;For data engineers, where workflows often span multiple environments which include local machines, cloud servers, and clusters, Docker provides consistency, scalability, and speed. But before we dive in, let’s get an idea of what Docker is and its history.&lt;/p&gt;

&lt;p&gt;Docker is a virtualization tool and was first released in 2013 by Solomon Hykes at dotCloud, a Platform as a Service company (PaaS). It quickly gained popularity for making containerization mainstream. Instead of running heavy, resource-consuming virtual machines, Docker introduced lightweight, portable containers. Fast forward to today, and Docker has become a cornerstone of modern data engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Functions of Docker in Data Engineering
&lt;/h2&gt;

&lt;p&gt;Below are functions of Docker that make it indispensable to data engineers. They include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Portability – Package your pipelines and dependencies into a single container that runs anywhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consistency – No more environment conflicts between dev, staging, and production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Isolation – Each container runs independently, so one failing service won’t break the rest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Efficiency – Containers use fewer resources compared to virtual machines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scalability – Run multiple containers simultaneously and scale your pipelines with ease.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Docker vs Virtual Machines (VMs)
&lt;/h2&gt;

&lt;p&gt;The table below shows the main differences between a Virtual Machine and Docker.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Docker (Containers)&lt;/th&gt;
&lt;th&gt;Virtual Machines (VMs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Startup Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Minutes, as it has to boot a full OS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low. Shares host OS kernel&lt;/td&gt;
&lt;td&gt;High. Each VM runs its own OS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Process-level isolation&lt;/td&gt;
&lt;td&gt;Full OS-level isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Portability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs anywhere with Docker installed&lt;/td&gt;
&lt;td&gt;Limited, requires hypervisor support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High, lightweight and fast&lt;/td&gt;
&lt;td&gt;Lower,  more overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only compatible with Linux distros e.g. Ubuntu, Arch Linux&lt;/td&gt;
&lt;td&gt;Supports multiple Operating Systems (OS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Virtualization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtualizes the OS Application layer&lt;/td&gt;
&lt;td&gt;Virtualizes the OS Application and Kernel layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;QUICK NOTE&lt;/strong&gt;: Kernel allows for the comunication between the machine/hardware and the application layer, see Traditional OS architecture chart below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdr2z136na4qujrrhlsy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdr2z136na4qujrrhlsy.jpg" alt="Traditional OS Architecture" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker Compatibility, its Reservations and Solution
&lt;/h2&gt;

&lt;p&gt;Initially, Docker was meant to run on Linux machines, hence does not run natively on Windows and Mac. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgd9vhnszmm874kgvdds6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgd9vhnszmm874kgvdds6.jpg" alt="Incompatibility Architecture diagram" width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To sort this issue out, Docker introduced Docker Desktop much later after the release of Docker at PyCon 2013 with a hypervisor layer (its own lightweight Linux OS kernel) to support Windows and Mac machines, enabling them to build and run containers as if they were on Linux. It bridges the gap seamlessly for developers and data engineers.On Windows, this is powered by WSL2 (Windows Subsystem for Linux 2) or Hyper-V. On macOS, Docker Desktop uses a LinuxKit VM. &lt;/p&gt;

&lt;p&gt;The chart below shows how the hypervisor layer helped Windows and Mac machines run Docker smoothly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wgz88h62x2f12mkm3oz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wgz88h62x2f12mkm3oz.jpg" alt="Compatibility diagram" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker Images and Docker Containers.
&lt;/h2&gt;

&lt;p&gt;A Docker image is a read-only template containing an application and all its necessary components, including the application code, libraries, dependencies, and configuration files. It serves as a blueprint for creating Docker containers, which are isolated, runnable instances of the application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8rddwwyvuhiyc6lmzx6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8rddwwyvuhiyc6lmzx6.jpg" alt="Docker image" width="800" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A Docker container is a lightweight, isolated, and executable software package that bundles an application and all its dependencies, including the runtime, libraries, and configuration files, into a single unit. It is a running instance of a Docker image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F522ibx3c7a8l80wblb19.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F522ibx3c7a8l80wblb19.jpg" alt="Docker containers" width="800" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker and Docker Compose
&lt;/h2&gt;

&lt;p&gt;Running one container is cool, but data engineers usually need multiple services for their pipelines to run effectively e.g., PostgreSQL + Airflow + Spark. That’s where Docker Compose comes in handy.&lt;/p&gt;

&lt;p&gt;With a simple &lt;code&gt;docker-compose.yml&lt;/code&gt; file, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Define multiple services e.g., &lt;code&gt;postgres&lt;/code&gt;, &lt;code&gt;spark&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Configure networks and volumes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start everything with a single command, &lt;code&gt;docker compose up&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is an example of a &lt;code&gt;docker-compose.yml&lt;/code&gt; file with defined services &lt;code&gt;producer&lt;/code&gt;, &lt;code&gt;trainer&lt;/code&gt;, &lt;code&gt;fraud_consumer&lt;/code&gt; and &lt;code&gt;transaction_consumer&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/producer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;USERNAME=${USERNAME}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PASSWORD=${PASSWORD}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;TOPIC=${TOPIC}&lt;/span&gt;

    &lt;span class="na"&gt;transaction_consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/transaction_consumer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;USERNAME=${USERNAME}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PASSWORD=${PASSWORD}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_URL=${DB_URL}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD=${DB_PASSWORD}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_USER=${DB_USER}&lt;/span&gt;

    &lt;span class="na"&gt;fraud_consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/fraud_consumer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;USERNAME=${USERNAME}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PASSWORD=${PASSWORD}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_URL=${DB_URL}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD=${DB_PASSWORD}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_USER=${DB_USER}&lt;/span&gt;

    &lt;span class="na"&gt;trainer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/model.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_URL=${DB_URL}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Public and Private Docker Registries
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Docker Registry&lt;/strong&gt; is where images are stored and shared.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public Registry&lt;/strong&gt;: &lt;a href="https://hub.docker.com/" rel="noopener noreferrer"&gt;Docker Hub&lt;/a&gt; is the most popular public Docker registry in the world. You can pull official images like &lt;code&gt;postgres&lt;/code&gt;, &lt;code&gt;python&lt;/code&gt;, or &lt;code&gt;spark&lt;/code&gt; using &lt;code&gt;docker pull IMAGE_NAME:TAG&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  docker pull postgres:13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Private Registry: Companies often host private registries for security and control. Example: AWS Elastic Container Registry (ECR), Google Container Registry (GCR), or self-hosted Harbor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a data engineer, you might pull public images from Docker Hub but pull/push your team’s custom images to a private registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Port Binding
&lt;/h2&gt;

&lt;p&gt;Containers run in isolation. To make a service inside a container accessible from your machine, you bind container ports to host ports.&lt;/p&gt;

&lt;p&gt;For example, PostgreSQL runs on port 5432. To access it from your laptop, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 5432:5432 &lt;span class="nt"&gt;--name&lt;/span&gt; my_postgres postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The first &lt;code&gt;5432&lt;/code&gt; is the host port (your machine).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The second &lt;code&gt;5432&lt;/code&gt; is the container port (inside the container).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, you can connect to &lt;code&gt;localhost:5432&lt;/code&gt; on your machine, and Docker routes traffic into the container.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building custom images using Dockerfile
&lt;/h2&gt;

&lt;p&gt;Sometimes the official images aren’t enough; you’ll need to build your own image. That’s where a Dockerfile comes in.&lt;/p&gt;

&lt;p&gt;The example shows a custom Python image with Pandas installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# start from official Python image&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.10&lt;/span&gt;

&lt;span class="c"&gt;# set working directory&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# copy project files&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . /app&lt;/span&gt;

&lt;span class="c"&gt;# install dependencies&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas

&lt;span class="c"&gt;# run the script&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "main.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, to build and run our Dockerfile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; my-python-app &lt;span class="nb"&gt;.&lt;/span&gt;

docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; pandas-prep
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To learn more about Docker commands, visit this &lt;a href="https://docs.docker.com/get-started/docker_cheatsheet.pdf" rel="noopener noreferrer"&gt;page&lt;/a&gt; to learn more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mini Project: PostgreSQL and pgAdmin4 User Interface with Docker and Docker Compose
&lt;/h2&gt;

&lt;p&gt;Let's practice what we've learnt by building a simple Docker project. We’ll run a &lt;strong&gt;PostgreSQL database&lt;/strong&gt; and &lt;strong&gt;pgAdmin&lt;/strong&gt;, a web user interface that allows data engineers to access PostgreSQL from the web.&lt;/p&gt;

&lt;h3&gt;
  
  
  a. Create a &lt;code&gt;docker-compose.yml&lt;/code&gt; file:
&lt;/h3&gt;

&lt;p&gt;Let's define our services in our &lt;code&gt;docker-compose.yml&lt;/code&gt; file. Since we are not building a custom Dockerfile in this mini project, we can pull images from Docker Hub using the &lt;code&gt;image&lt;/code&gt; parameter inside the &lt;code&gt;postgres&lt;/code&gt; and &lt;code&gt;pgadmin&lt;/code&gt; services&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:13&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secret&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo_db&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pg_data:/var/lib/postgresql/data&lt;/span&gt;

  &lt;span class="na"&gt;pgadmin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dpage/pgadmin4&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;PGADMIN_DEFAULT_EMAIL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin@admin.com&lt;/span&gt;
      &lt;span class="na"&gt;PGADMIN_DEFAULT_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5000:80"&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pg_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; Always remember to place your environment variables inside a &lt;code&gt;.env&lt;/code&gt; file to avoid leaking sensitive data. But, for this mini project, there's no need to place the credentials inside a &lt;code&gt;.env&lt;/code&gt; file, but do so for future projects. To reference an environment variable inside a &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:14&lt;/span&gt;
        &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres_db&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
           &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_USER=${POSTGRES_USER}&lt;/span&gt;
           &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PASSWORD=${POSTGRES_PASSWORD}&lt;/span&gt;
           &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_DB=${POSTGRES_DB}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  b. Start the services
&lt;/h3&gt;

&lt;p&gt;To start our services, run the following command in the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="c"&gt;# -d runs the services in the background&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Postgres will now be available on &lt;code&gt;localhost:5432&lt;/code&gt; and pgAdmin UI will be available on &lt;code&gt;localhost:5000&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  c. Access pgAdmin
&lt;/h3&gt;

&lt;p&gt;You can now access your pgAdmin instance on &lt;code&gt;localhost:5000&lt;/code&gt;.&lt;br&gt;
Let's log in with the credentials we specified in our &lt;code&gt;docker-compose.yml&lt;/code&gt; file, which are &lt;code&gt;PGADMIN_DEFAULT_EMAIL&lt;/code&gt; and &lt;code&gt;PGADMIN_DEFAULT_PASSWORD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdffa2dtk8lomuinrm5w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdffa2dtk8lomuinrm5w6.png" alt="pgAdmin dashboard" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can now also add a new server in pgAdmin with the following credentials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Host: &lt;code&gt;postgres&lt;/code&gt;, the service name from Docker Compose&lt;/li&gt;
&lt;li&gt;Username: &lt;code&gt;admin&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Password: &lt;code&gt;secret&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3b59iixit8yhabe7u6t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3b59iixit8yhabe7u6t.png" alt="psql database" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Congratulations! You now have a working PostgreSQL database with a nice web UI, all running inside Docker containers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For data engineers, Docker is like a Swiss Army knife, it simplifies workflows, ensures consistency, and allows rapid experimentation without breaking your system.&lt;/p&gt;

&lt;p&gt;Whether you’re spinning up a PostgreSQL database, deploying Airflow pipelines, or running Spark clusters, Docker makes it smooth and repeatable. So next time you hear &lt;em&gt;“it worked on my machine”&lt;/em&gt;, you’ll know Docker could’ve saved the day. Where did you first use Docker, and how has it helped you containerize your application/pipelines? Please share in the comment section&lt;/p&gt;

&lt;p&gt;Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at &lt;em&gt;&lt;a href="mailto:denzelkinyua11@gmail.com"&gt;denzelkinyua11@gmail.com&lt;/a&gt;&lt;/em&gt; or visit any of my social media platforms linked on my &lt;a href="https://github.com/dkkinyua" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; page.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Building a Fraud Detection Pipeline using Python, PostgreSQL, Apache Kafka, PySpark, Grafana and Scikit-learn</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Mon, 18 Aug 2025 12:49:14 +0000</pubDate>
      <link>https://forem.com/dkkinyua/building-a-fraud-detection-pipeline-using-python-postgresql-apache-kafka-pyspark-grafana-and-10d5</link>
      <guid>https://forem.com/dkkinyua/building-a-fraud-detection-pipeline-using-python-postgresql-apache-kafka-pyspark-grafana-and-10d5</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Fraud doesn’t happen once in a while, it happens every second. Every card swipe, every online purchase, every wire transfer has a chance of being fraudulent. And by the time fraud is detected in batch reports, the damage is usually done.&lt;/p&gt;

&lt;p&gt;I wanted to explore, What if fraud could be detected as it happens? Could a pipeline be built that not only processes financial transactions in real time, but also applies machine learning to flag suspicious activity on the fly?&lt;/p&gt;

&lt;p&gt;In this project, I built a fraud detection system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Produces synthetic transaction data using Faker. Read more &lt;a href="https://mailchimp.com/resources/what-is-synthetic-data/" rel="noopener noreferrer"&gt;here&lt;/a&gt; about why synthetic data is mostly used in model training than real data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Trains an Isolation Forest machine learning model on transactional data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uses Streamlit to visualize model performance and provide explainability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Streams transactions into PostgreSQL, which integrates with Grafana for real-time dashboards and monitoring.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is fully containerized with Docker for deployment.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Architecture
&lt;/h2&gt;

&lt;p&gt;The flowchart below shows the project architecture used in this project&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jv3t0b4r0nr9k9tk26v.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jv3t0b4r0nr9k9tk26v.jpg" alt="Project Architecture" width="800" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Kafka Producer
&lt;/h3&gt;

&lt;p&gt;The Kafka producer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses Faker to generate synthetic users and transactions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;faker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Faker&lt;/span&gt;

&lt;span class="n"&gt;faker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_users&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;email&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;country&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;last_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how data is structured into our Kafka topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"c5223ee0-c3b4-4149-ac2c-93daaedb16a7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8026c2d5-aaf6-4197-9242-0ecf633853f2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-08-06T06:12:33.763770"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"first_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Marie"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Walton"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Malaysia"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction_location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Malaysia"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;14129.53&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"device"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Tablet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"is_fraud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Simulates both legitimate and fraudulent events:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fraud rules include high transaction amounts, location mismatches, and unusual devices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Streams messages to the Kafka topic (transactions-data) with keys as hostnames and values as JSON payloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Configured with &lt;code&gt;KafkaProducer&lt;/code&gt; and SASL/SCRAM authentication for secure cloud Kafka hosted on Redpanda..&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Kafka Consumers,
&lt;/h3&gt;

&lt;h4&gt;
  
  
  a. &lt;code&gt;transaction_consumer&lt;/code&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Built on Spark Structured Streaming.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reads data continuously via &lt;code&gt;readStream()&lt;/code&gt; from the Kafka topic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Defines a schema for the transaction JSON payload.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Transforms raw Kafka records into a structured DataFrame with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;transaction_id&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;device&lt;/code&gt;, &lt;code&gt;timestamp&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Writes all non-fraud transactions via &lt;code&gt;writeStream()&lt;/code&gt; into the database table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;shop.transactions&lt;/code&gt;,  used for downstream analytics and model training.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  b. &lt;code&gt;fraud_consumer&lt;/code&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Also powered by Spark Structured Streaming.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reads from the same Kafka topic but filters only fraudulent events, &lt;code&gt;is_fraud&lt;/code&gt; = 1, for counterchecking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Converts timestamps into SQL timestamp type.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Persists flagged frauds into the database table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;shop.fraud_transactions&lt;/code&gt;, used for fraud analysis, validation, and dashboards&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;For &lt;code&gt;transaction_consumer&lt;/code&gt;, the code snippet below hows how data is written into Postgres using &lt;code&gt;writeStream()&lt;/code&gt; from PySpark's Structured Streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;
transfrom_data() which reads streaming data from topic and transforms it
write_as_batch which streams data in micro batches suitable for writing to Postgres

&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_to_transactions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/fraud_detection/transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;foreachBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;write_as_batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error streaming data to shop.transactions: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Database and visualization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;PostgreSQL acts as the sink for both consumers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Grafana connects to PostgreSQL for real-time analytics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transaction trends&lt;/li&gt;
&lt;li&gt;Fraud vs non-fraud comparisons&lt;/li&gt;
&lt;li&gt;Top 5 fraudulent transactions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Streamlit application that provides ML model performance visualization.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Below are snapshots of the Grafana dashboard and Streamlit visualization. To access the dashboard, please follow this &lt;a href="https://deecodes.grafana.net/public-dashboards/ac5cf512ae80435c8bdb1adad4d1a27c" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdk3ea5t3r5wcg3yux21i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdk3ea5t3r5wcg3yux21i.png" alt="Grafana dashboard" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkfc3c8tlxmvrym8xi44.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkfc3c8tlxmvrym8xi44.png" alt="Streamlit" width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8svmoqijw7odvltjaej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8svmoqijw7odvltjaej.png" alt="Streamlit" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Machine Learning, Teaching the System to Spot the Odd Ones Out
&lt;/h3&gt;

&lt;p&gt;Fraudulent behavior often hides in subtle patterns. To detect anomalies, Isolation Forest model is trained offline on features like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transaction amount&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Device frequency (how common a device is across transactions)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Location change (user’s home vs transaction location)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trained model is saved (jobs/isolation_forest.pkl) and fraud predictions are visualized in Streamlit, providing explainability alongside Grafana’s real-time metrics.&lt;/p&gt;

&lt;p&gt;The code snippet shows how the model is trained offline by reading data from a PostgreSQL table and saves the jobs inside a &lt;code&gt;.pkl&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# imports 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_sql_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transactions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;shop&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;device_freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;device_freq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transaction_location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
        &lt;span class="n"&gt;rare_devices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;device_freq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;device_freq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location_change&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;device_freq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location_change&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;n_estimators&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="n"&gt;contamination&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="c1"&gt;# expecting that 10% is fraudulent 
&lt;/span&gt;
        &lt;span class="n"&gt;iso_forest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IsolationForest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;contamination&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;contamination&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;iso_forest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs/isolation_forest.pkl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model already exists at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Overwriting...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iso_forest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Job loaded into &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error training model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Fraud detection is more than a data problem, it’s a streaming, scaling, real-time challenge.&lt;/p&gt;

&lt;p&gt;This project showed how Kafka + Spark can handle the firehose of financial events, while PostgreSQL + Grafana turn raw data into live insights. On top of that, machine learning with Isolation Forest gives the system predictive power to flag anomalies before they cause damage.&lt;/p&gt;

&lt;p&gt;What excites me most is how modular this architecture is. Want to replace Isolation Forest with a deep learning model? Easy. Want to scale Kafka across clusters? Done. Want to plug Grafana into Prometheus for infrastructure monitoring? Straightforward.&lt;/p&gt;

&lt;p&gt;With open-source tools, fraud detection at scale is no longer locked away in banks’ back rooms, anyone can build it, understand it, and improve it.&lt;/p&gt;

&lt;p&gt;You can access the GitHub repository &lt;a href="https://github.com/dkkinyua/FraudDetectionSystem" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re curious, clone the repo, run it locally with Docker, and see fraud detection in action in real time. I'll appreciate your feedback!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building a Real-Time Data Pipeline using Binance Websocket API, PySpark, Kafka and Grafana</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Mon, 04 Aug 2025 15:04:10 +0000</pubDate>
      <link>https://forem.com/dkkinyua/building-a-real-time-data-pipeline-using-binance-websocket-api-pyspark-kafka-and-grafana-59l4</link>
      <guid>https://forem.com/dkkinyua/building-a-real-time-data-pipeline-using-binance-websocket-api-pyspark-kafka-and-grafana-59l4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Ever wondered how data is consumed in real-time via websockets? In this project, I built a real-time streaming pipeline that captures live market data from Binance Websockets, processes and stores it, and visualizes it on a Grafana dashboard, which auto-refreshes every 30 seconds. The flow is designed with scalability, containerization, and observability in mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyr2oyke5hwzvis00sx7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyr2oyke5hwzvis00sx7.jpg" alt="Workflow diagram" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kafka Producer&lt;/strong&gt;: The producer connects to the Binance WebSocket API i.e., 1-minute Kline and order book streams, and pushes the data to Kafka topics hosted on Confluent Cloud.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kafka Topics&lt;/strong&gt;: Hosted on Confluent Cloud, topics named &lt;em&gt;kline_data&lt;/em&gt; and &lt;em&gt;order_book_data&lt;/em&gt; temporarily store the real-time streaming data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kafka Consumer with PySpark&lt;/strong&gt;: The Spark jobs consume data from the Kafka topics using &lt;strong&gt;PySpark Structured Streaming&lt;/strong&gt;, process it, and write it to a PostgreSQL database using &lt;code&gt;readStream()&lt;/code&gt; and &lt;code&gt;writeStream()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PostgreSQL Database&lt;/strong&gt;: The Postgres database stores the cleaned and structured data for analytics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-time Grafana Dashboard&lt;/strong&gt;: Connects to the PostgreSQL database as a data source and displays real-time insights with a 30-second auto-refresh interval.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Kafka producer and consumer are also containerized using Docker for efficient real-time data streaming, where the scripts run every minute.&lt;/p&gt;

&lt;p&gt;For the project's code and more information, visit the &lt;a href="https://github.com/dkkinyua/BinanceWebsocketPipeline" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Kafka hosted on Confluent Cloud for message streaming&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Docker for the containerization of all services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Binance WebSockets as the real-time data source&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PySpark Structured Streaming for transformation and ingestion&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PostgreSQL for data storage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana Cloud for visualization&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Why use websockets instead of the Binance REST APIs?
&lt;/h3&gt;

&lt;p&gt;When building trading bots, real-time dashboards, or streaming data pipelines from cryptocurrency exchanges like Binance, developers/engineers are often faced with a key question,&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should I use the Binance REST API or the WebSocket API?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The differences between the two have been highlighted below:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;REST API&lt;/th&gt;
&lt;th&gt;WebSocket API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Communication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Client initiates one request per data need using &lt;code&gt;requests&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;A persistent, real-time two-way connection using &lt;code&gt;websocket-client&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher latency, request → wait → response, meaning the client has to wait for some time to receive the response from the server&lt;/td&gt;
&lt;td&gt;Ultra-low latency as data is pushed as it changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inefficient for frequent updates&lt;/td&gt;
&lt;td&gt;Highly Efficient as there's a single connection for continuous updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best Use Case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One-time data queries, low-frequency polling&lt;/td&gt;
&lt;td&gt;Real-time market data, price tracking, live feeds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateless (no ongoing link)&lt;/td&gt;
&lt;td&gt;Stateful (persistent connection)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network Load&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High if polled frequently&lt;/td&gt;
&lt;td&gt;Low, since updates are sent only when needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Freshness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snapshot of current state only&lt;/td&gt;
&lt;td&gt;Stream of live updates as events occur&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For more details on the WebSocket API, visit this &lt;a href="https://developers.binance.com/docs/derivatives/usds-margined-futures/websocket-api-general-info" rel="noopener noreferrer"&gt;link&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;I have published another blog where I used the Binance REST APIs to extract, transform, and load data into a Postgres and Cassandra databases using Change Data Capture (CDC), you can access the blog &lt;a href="https://dev.to/dkkinyua/building-a-real-time-crypto-pipeline-with-binance-apis-postgresql-debezium-kafka-spark--5gbl"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Kafka producer and consumer
&lt;/h3&gt;

&lt;p&gt;Our Kafka producers and consumers are built in Python using the &lt;code&gt;confluent-kafka&lt;/code&gt; package, which enables us to use the &lt;code&gt;confluent_kafka.Producer&lt;/code&gt; and &lt;code&gt;confluent_kafka.Consumer&lt;/code&gt; clients to connect to Confluent Cloud to produce and consume data to/and from our Kafka topics.&lt;/p&gt;

&lt;h4&gt;
  
  
  Let's produce data to a Kafka topic, &lt;strong&gt;kline_data&lt;/strong&gt;.
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;First, install the &lt;code&gt;confluent-kafka&lt;/code&gt; and other necessary packages.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;confluent-kafka websocket-client rel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Import the &lt;code&gt;confluent_kafka.Producer&lt;/code&gt; to create a Producer client.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;confluent_kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Producer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebSocketApp&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_BOOTSTRAP_SERVER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sasl.username&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_CONFLUENT_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sasl.password&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_CONFLUENT_SECRET_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;security.protocol&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SASL_SSL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sasl.mechanisms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PLAIN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Producer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;To stream data from websockets, you need to use the &lt;code&gt;websocket-client&lt;/code&gt; package. Define the helper functions &lt;code&gt;on_open&lt;/code&gt;, &lt;code&gt;on_close&lt;/code&gt;, &lt;code&gt;on_error&lt;/code&gt;, and &lt;code&gt;on_message&lt;/code&gt;. The &lt;code&gt;on_message&lt;/code&gt; function will be used to consume data from the websocket, and send the resulting data to the Kafka topic as shown below.
For more information on consuming data from websockets, visit the &lt;a href="https://pypi.org/project/websocket-client/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# helper functions
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order book websocket open for connections...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;close_status_msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;close_msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order book connection closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;close_status_msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;close_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;There is an error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;book_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;u&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bestbidprice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bestbidqty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bestaskprice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bestaskqty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eventtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;E&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transactiontime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;produce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_book_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;book_data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data sent to order_book_data topic successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error producing data to topic: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Let's define a function &lt;code&gt;get_data()&lt;/code&gt; which configures &lt;code&gt;WebSocketApp&lt;/code&gt;. The &lt;strong&gt;Registered Event Listener&lt;/strong&gt; &lt;code&gt;rel&lt;/code&gt; package is a cross-platform asynchronous event dispatcher primarily designed for network applications and is used with &lt;code&gt;websocket-client&lt;/code&gt; to handle events as they occur during connection.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;streams&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@bookTicker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# explained in the README 
&lt;/span&gt;    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/stream?streams=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WebSocketApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_open&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_open&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_close&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_close&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_message&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_forever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dispatcher&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reconnect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;NOTE: Using &lt;code&gt;streams = '/'.join([f"{symbol}@bookTicker" for symbol in symbols])&lt;/code&gt; is good practice as it combines all symbols into one request, hence reducing the number of requests that would be needed to get data for 10 symbols. We are using one request for all symbols as opposed to one request per symbol. Also, the WebSocket API supports multiple streams, hence making this more efficient. So the request URL to the API would look like this: &lt;code&gt;wss://fstream.binance.com/stream?streams=btcusdt@bookTicker/dotusdt@bookTicker/avaxusdt@bookTicker...&lt;/code&gt; Check the Binance &lt;a href="https://developers.binance.com/docs/derivatives/usds-margined-futures/websocket-api-general-info" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for more information&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Let's use &lt;code&gt;if __name__ == '__main__'&lt;/code&gt; to invoke our script whenever the script is called. &lt;code&gt;producer.flush()&lt;/code&gt; is used to flush all pending messages to the topic during connection termination to prevent loss of data.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;get_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Flushing all pending messages...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is received in our topic as shown in the snapshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqrtlp2j92p7ymj3rmpd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqrtlp2j92p7ymj3rmpd.png" alt="Kline snapshot" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi736tfpgrp299pwmftcc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi736tfpgrp299pwmftcc.png" alt="Order Book Data snapshot" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Let's consume data from the Kafka topic and stream data to a Postgres database using PySpark's Structured Streaming.
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Install the necessary packages required.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Import necessary modules and functions, and configure Spark for our application using &lt;code&gt;SparkSession&lt;/code&gt; from &lt;code&gt;pyspark.sql&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BooleanType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LongType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SparkPostgresConsumer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local[*]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spark.ui.port&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4041&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.jars.packages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.postgresql:postgresql:42.7.7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Define a schema and its data types for our DataFrame. This is important to let our schema know beforehand what data and types it is dealing with.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;close&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;klineClosed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;BooleanType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numTrades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LongType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closeTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LongType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LongType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Define our DataFrame using &lt;code&gt;readStream()&lt;/code&gt;, and configure Kafka settings
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# code
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_CONFLUENT_SECRET_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_TOPIC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;startingOffsets&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka.security.protocol&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SASL_SSL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka.sasl.mechanism&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PLAIN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka.sasl.jaas.config&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;org.apache.kafka.common.security.plain.PlainLoginModule required username=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_CONFLUENT_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; password=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_CONFLUENT_SECRET_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Our data coming from the topic is not in our ideal structure as our desired data in inside &lt;code&gt;value&lt;/code&gt;. We need to extract this data and parse it into JSON format.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# code 
&lt;/span&gt;    &lt;span class="n"&gt;json_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectExpr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAST (value AS STRING) as parsed_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parsed_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Now, we can run our transformations, accessing our data from our &lt;code&gt;json_df&lt;/code&gt; dataframe with the parsed JSON data. We can do so using SQL queries inside our Python code. After running transformations, return our transformed dataframe for micro-batching.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# code
&lt;/span&gt;    &lt;span class="n"&gt;new_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;close&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numTrades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_trades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;klineClosed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isKlineClosed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closeTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;starttime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Define a helper function &lt;code&gt;write_each_batch()&lt;/code&gt; which writes data into the Postgres database in micro batches, ideal for streaming.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_each_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epoch_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DB_USER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;driver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.postgresql.Driver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;batch_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;jdbc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DB_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;websocket.kline_1m_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error writing batch data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Define a function &lt;code&gt;write_to_db&lt;/code&gt; which streams data into the Postgres database using the &lt;code&gt;writeStream()&lt;/code&gt; and &lt;code&gt;foreachBatch(write_each_batch)&lt;/code&gt; methods, and appends the data if the table exists in our database.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_to_db&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/binance_streaming/kline_stream_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d_%H-%M-%S&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;foreachBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;write_each_batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded into Postgres successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading data to db: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is now streaming into Postgres successfully. Run a &lt;code&gt;SELECT * FROM table&lt;/code&gt; query to check.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4kmph49qabuq75evl5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4kmph49qabuq75evl5k.png" alt="Database snapshot" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Containerization using Docker and Docker Compose.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt; is a virtualization and open-source platform that lets you package your application and its dependencies into containers, which are lightweight, standalone units that can run anywhere, regardless of the system.&lt;/p&gt;

&lt;p&gt;Think of a container like a "mini virtual machine", but faster and more efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Compose&lt;/strong&gt; is a tool for defining and running multi-container Docker applications.&lt;/p&gt;

&lt;p&gt;You need to install Docker for this project. Follow this &lt;a href="https://docs.docker.com/engine/install/" rel="noopener noreferrer"&gt;link&lt;/a&gt; to install Docker.&lt;/p&gt;

&lt;p&gt;For this project, different services have been defined into containers to run individually, which are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;kline-producer&lt;/code&gt;: Producer for the 1-minute klines&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kline-consumer&lt;/code&gt;: Consumer for the 1-minute klines&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;book-producer&lt;/code&gt;: Producer for the order book data&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;book-consumer&lt;/code&gt;: Consumer for the order book data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;First, define a &lt;code&gt;Dockerfile&lt;/code&gt; in the root of your project. A Dockerfile contains the project's instructions and what's required for the project to run smoothly in Docker.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# imports the Python image from docker hub&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.10-slim&lt;/span&gt;

&lt;span class="c"&gt;# installs java for pyspark to run smoothly&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;--no-install-recommends&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;openjdk-17-jre-headless &lt;span class="se"&gt;\
&lt;/span&gt;curl &lt;span class="se"&gt;\
&lt;/span&gt;gnupg &lt;span class="se"&gt;\
&lt;/span&gt;apt-transport-https &lt;span class="se"&gt;\
&lt;/span&gt;ca-certificates &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;apt-get clean &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="c"&gt;# sets java environment vars&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="$JAVA_HOME/bin:$PATH"&lt;/span&gt;

&lt;span class="c"&gt;# copies all contents into a folder app/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . /app/&lt;/span&gt;

&lt;span class="c"&gt;# sets app/ as the current working directory, same as cd in linux&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# installs all python packages&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Create a folder &lt;code&gt;docker-compose.yml&lt;/code&gt; file, which defines our different services.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kline-producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/kline_producer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_API_KEY=${KAFKA_API_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}&lt;/span&gt;

    &lt;span class="na"&gt;kline-consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/kline_consumer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_API_KEY=${KAFKA_API_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_URL=${DB_URL}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_USER=${DB_USER}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD=${DB_PASSWORD}&lt;/span&gt;

    &lt;span class="na"&gt;book-producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/book_producer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_API_KEY=${KAFKA_API_KEY}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}&lt;/span&gt;

    &lt;span class="na"&gt;book-consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python3 scripts/book_consumer.py&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;KAFKA_API_KEY=${KAFKA_API_KEY}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_URL=${DB_URL}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_USER=${DB_USER}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD=${DB_PASSWORD}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;To boot up our containers, run the following command in your terminal
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If you need to stop the containers at any time, run the following command or press &lt;code&gt;Ctrl + C&lt;/code&gt; on Windows or &lt;code&gt;Cmd + C&lt;/code&gt; on Mac.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below are some snapshots in Docker Desktop and the terminal&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzw8tgzhejshj75nkcj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzw8tgzhejshj75nkcj8.png" alt="Docker desktop snapshot" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gt4u0r4s0co8b7zixjd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gt4u0r4s0co8b7zixjd.png" alt="terminal snapshot" width="800" height="184"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualization in Grafana
&lt;/h3&gt;

&lt;p&gt;The data is then visualized on a Grafana dashboard with an auto-refresh rate of 30 seconds to ensure that the dashboard is displaying fresh data. &lt;/p&gt;

&lt;p&gt;Below is a snapshot of the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uk1xu6a1q5ko1hjdcw3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uk1xu6a1q5ko1hjdcw3.png" alt="PDF snapshot" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2wz4k13894sg1edguwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2wz4k13894sg1edguwo.png" alt="Live snapshot" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project showcases how to build a real-time, containerized data pipeline using Binance WebSockets, Kafka, PySpark, PostgreSQL, and Grafana. By leveraging WebSockets for low-latency streaming and Docker for seamless deployment, we created a scalable and efficient system for live crypto data processing and visualization.&lt;/p&gt;

&lt;p&gt;Be sure to check out the GitHub repository for the full codebase, and don’t forget to share your feedback or fork the project to make it your own.&lt;/p&gt;

&lt;p&gt;Happy hacking!&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>data</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Using Data Engineering to Track Food Prices and Inflation in Kenya from 2006 to 2025</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Wed, 23 Jul 2025 09:48:24 +0000</pubDate>
      <link>https://forem.com/dkkinyua/using-data-engineering-to-track-food-prices-and-inflation-in-kenya-from-2006-to-2025-47ja</link>
      <guid>https://forem.com/dkkinyua/using-data-engineering-to-track-food-prices-and-inflation-in-kenya-from-2006-to-2025-47ja</guid>
      <description>&lt;p&gt;Rising food prices continue to be a critical issue affecting many households in Kenya. As a data engineer passionate about impactful projects, I built a data pipeline to collect, model, and analyze food price and inflation trends in Kenya, shedding light on various patterns from 2006 to 2024.&lt;/p&gt;

&lt;p&gt;By building this pipeline, we can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Monitor inflation’s impact on staple food prices in various areas.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Provide data-backed insights to policy makers, NGOs, and supply chain analysts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improve food security planning through trend forecasting.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Objectives
&lt;/h2&gt;

&lt;p&gt;To build a scalable data pipeline and star-schema data warehouse to analyze historical food prices across Kenyan markets and derive insights into inflation trends, particularly for food categories such as vegetables, fruits, meat and eggs, milk and dairy among others.&lt;/p&gt;

&lt;p&gt;The Github repository for this project can be accessed &lt;a href="https://github.com/dkkinyua/FoodPricesMonitoring" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Source, Food Prices&lt;/strong&gt;: &lt;a href="https://data.humdata.org/dataset/kenya-food-prices" rel="noopener noreferrer"&gt;Humanitarian Data Exchange (HDX)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Source, Inflation&lt;/strong&gt;: &lt;a href="https://api.worldbank.org/v2/country/ke/indicator/FP.CPI.TOTL" rel="noopener noreferrer"&gt;World Bank Indicator API v2&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline&lt;/strong&gt;: Python + &lt;code&gt;pandas&lt;/code&gt; for extraction &amp;amp; transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: PostgreSQL with dimensional modeling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration and Automation&lt;/strong&gt;: Apache Airflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualization&lt;/strong&gt;: Grafana&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slideshow presentation&lt;/strong&gt;: MS Powerpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Architecture
&lt;/h2&gt;

&lt;p&gt;The flowchart below shows the project architecture flow from start to finish, with the tech stack used in every step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4kfhsty5p1ppektfi1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4kfhsty5p1ppektfi1v.png" alt="Project Architecture Diagram" width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Database Architecture
&lt;/h2&gt;

&lt;p&gt;The database design is in form of a star schema, having one central fact table and different dimension tables. The fact table contains the primary keys that map to a specific dimension table and measurements. Dimension tables contain the attributes or the units of analysis of the data. For more information on different schemas, visit &lt;a href="https://medium.com/@StefanoMeloccaro/data-warehouse-schemas-explained-star-snowflake-galaxy-d2c508d9ac5f" rel="noopener noreferrer"&gt;this&lt;/a&gt; blog to learn more.&lt;/p&gt;

&lt;p&gt;The database structure diagram below shows how the schema is set up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiqqnrfiu3otaepvl8j9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiqqnrfiu3otaepvl8j9.png" alt="Database Diagram" width="800" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline Overview
&lt;/h2&gt;

&lt;p&gt;This project follows a batch data pipeline that extracts food price and inflation data from APIs, transforms it for consistency, and loads it into a PostgreSQL database for analysis and downstream use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  a. Data Extraction.
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Monthly food price and inflation data is downloaded in CSV format from the relevant APIs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Only relevant data are retained such as geodata (latitude, longitude), commodities, dates e.t.c.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is an example of a code snippet to extract data from the HDX Kenya data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://data.humdata.org/dataset/e0d3fba6-f9a2-45d7-b949-140c455197ff/resource/517ee1bf-2437-4f8c-aa1b-cb9925b9d437/download/wfp_food_prices_ken.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading data from CSV file: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  b. Data Transformation
&lt;/h3&gt;

&lt;p&gt;Below are some of the transformations used to transform data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Clean inconsistent units and datatypes, e.g. kgs, liters, floats.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Normalize market names and commodity labels.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fill in missing dates and reformat for time-series alignment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Filter out years with no data e.g., 2006–2018 for the vegetables category.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is a code snippet for transforming food data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latitude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latitude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;longitude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;longitude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usdprice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usdprice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commodity_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commodity_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;admin1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;province&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;admin2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;county&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;any&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading dataframe from previous task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  c. Loading data into a PostgreSQL database
&lt;/h3&gt;

&lt;p&gt;After transformation, our data is ready for loading into the database. The dimension tables are loaded first, then the fact table is loaded for consistency.&lt;/p&gt;

&lt;p&gt;Below is a code snippet for loading market data into the market dimension table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;loading_market_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;market_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;market_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;province&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;county&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latitude&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;longitude&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;market_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;any&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PSQL_URI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;market_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dim_market&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;foodprices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded into PostgreSQL successfully&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading into dim_market table: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading clean dataframe from previous task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  d. Workflow Automation using Airflow
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Airflow DAGs are used to schedule the pipeline to run monthly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each task is logged, and failure alerts are handled through retries and error logging.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Logs are also tracked to ensure data lineage in the pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Airflow is then moved to production on an Azure Virtual Machine.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is an example of a DAG used to automate the Inflation data pipeline&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inflation_pipeline_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@monthly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inflation_pipeline_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.worldbank.org/v2/country/ke/indicator/FP.CPI.TOTL?format=json&amp;amp;date=2006:2024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;inflation_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cpi_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cpi_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;inflation_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indicator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indicator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error extracting CPI data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inflation_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_inflation_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PSQL_URI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inflation_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inflation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded into inflation_data successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading into inflation_data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                Inflation pipeline DAG has run successfully, please check it out
                &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;subject&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inflation Pipeline DAG Notification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;sender&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SENDER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;receipients&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RECEIPIENT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;pwd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MIMEText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;From&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sender&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;To&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;receipients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;smtplib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SMTP_SSL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;smtp.gmail.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;465&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;smtp_server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;smtp_server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pwd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;smtp_server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;receipients&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_string&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Email sent successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sending email to receipient error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;clean_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_inflation_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;email_sent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;email_sent&lt;/span&gt;

&lt;span class="n"&gt;inflation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;inflation_pipeline_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  e. Visualization and analysis
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transformed data is analyzed on Grafana.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Trends like monthly price inflation, food price difference per county, and average price change of different food categories over time are tracked.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is a snapshot of the Grafana dashboard. To access the dashboard, follow this &lt;a href="https://deecodes.grafana.net/goto/5YR4POQHR?orgId=1" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;INSERT DASHBOARD HERE!&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights
&lt;/h2&gt;

&lt;p&gt;For most food categories, prices hiked around the start of 2022 till the end of 2023. This might have been brought about by various factors which include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Currency depreciation&lt;/strong&gt;: As the inflation chart shows, the inflation in Kenya quadrupled between 2020 and 2023, and this might have led to the hike in food prices around 2022&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drought and other climatic conditions&lt;/strong&gt;: In 2022, Kenya experienced below-average rainfall, particularly in the northeastern regions, which suffered from prolonged drought conditions. The year was also marked by significant temperature increases, making it one of the warmest on record. This might lead to low food production and increase in food prices, especially in arid and semi-arid areas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Input costs&lt;/strong&gt;: The Russian-Ukrainian war negatively impacted some industries e.g. the fuel industry which is a key input factor in the agricultural sector as food stuff need to be transported from the farm to other areas, hence leading to an increase in food prices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges faced during building this pipeline
&lt;/h2&gt;

&lt;h4&gt;
  
  
  a. Missing data
&lt;/h4&gt;

&lt;p&gt;Some food categories and commodities missed data from certain time periods, for example, for the vegetables and fruits category, there was missing data from 2006 to 2018.&lt;/p&gt;

&lt;h4&gt;
  
  
  b. Data Normalization
&lt;/h4&gt;

&lt;p&gt;Inconsistent data had to be normalized into consistent units of mass, price e.t.c.&lt;br&gt;
Some commodities had units in kilograms and 50 kilogram bags, hence need to normalize this data for easier analysis.&lt;/p&gt;

&lt;h4&gt;
  
  
  Lack of data sources
&lt;/h4&gt;

&lt;p&gt;There was a lack of data sources for Kenyan counties, and other commodities, hence only analyzing the available data on commodities&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project was a deep dive into how data engineering can be used for social impact. If you're working on something similar or want to collaborate, feel free to connect!&lt;/p&gt;

&lt;p&gt;Also, like, comment and follow me for more data engineering content.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>A Real-Time Earthquake Monitoring Pipeline with Kafka, MySQL, PostgreSQL, and Grafana</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Mon, 14 Jul 2025 13:37:08 +0000</pubDate>
      <link>https://forem.com/dkkinyua/a-real-time-earthquake-monitoring-pipeline-with-kafka-mysql-postgresql-and-grafana-22cd</link>
      <guid>https://forem.com/dkkinyua/a-real-time-earthquake-monitoring-pipeline-with-kafka-mysql-postgresql-and-grafana-22cd</guid>
      <description>&lt;p&gt;In this project, I designed and built an end-to-end real-time data pipeline that monitors earthquakes from the USGS Earthquake API. The pipeline extracts and loads data into a MySQL database, captures changes via a MySQL Debezium CDC connector, streams it through into a Kafka topic, sinks it into PostgreSQL, and visualizes it in Grafana Cloud. The workflow is automated using Apache Airflow to run hourly.&lt;/p&gt;

&lt;p&gt;You can access the GitHub repository &lt;a href="https://github.com/dkkinyua/EarthquakeMonitoringPipeline" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Workflow Chart
&lt;/h2&gt;

&lt;p&gt;This pipeline ensures that earthquake data is extracted, streamed, stored, and visualized in near real-time, fully automated and orchestrated with Apache Airflow to run hourly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut1iciuhy9hk89g8zwk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut1iciuhy9hk89g8zwk2.png" alt="Project Workflow" width="800" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Earthquake Data Extraction from the USGS Earthquake API
&lt;/h3&gt;

&lt;p&gt;The pipeline begins by calling the USGS Earthquake GeoJSON API, which provides up-to-date global earthquake data. The script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Pulls earthquake events for the past 24 hours&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extracts key fields: magnitude, place, time, alert, coordinates, and more&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Triggers an email alert to Gmail if the quake magnitude ≥ 4.5&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This extraction script is written in Python using requests, &lt;code&gt;smtplib&lt;/code&gt; for sending emails in Python via SMTP, &lt;code&gt;sqlalchemy&lt;/code&gt; and &lt;code&gt;pymysql&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Loading into MySQL Database
&lt;/h3&gt;

&lt;p&gt;The extracted data is then cleaned and loaded into a MySQL database table called earthquake_data. This serves as the staging area where new and changed data is logged before flowing downstream.&lt;/p&gt;

&lt;p&gt;The schema includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Event ID&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Place and location coordinates&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Magnitude and depth&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alert level&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Timestamps (time, updated)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;USGS metadata like URL, IDs, and status&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is a snapshot of the data in a MySQL database:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyo5f8kn4wueuyytg8b4r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyo5f8kn4wueuyytg8b4r.png" alt="MySQL database" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Change Data Capture with MySQL Debezium Connector
&lt;/h3&gt;

&lt;p&gt;To stream changes in the database in real time, the Debezium MySQL CDC Connector on Confluent Cloud listens to the binlog (binary log) of the MySQL database. It captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;New earthquake events&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updates (e.g., revised magnitudes or alert levels)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deletions&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes are published to a Kafka topic in &lt;code&gt;JSON_SR&lt;/code&gt; (JSON Schema Registry) format. This is very important as our Postgres Sink connector only accepts input in &lt;code&gt;JSON_SR&lt;/code&gt; and not in &lt;code&gt;JSON&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Below is a snapshot of the MySQL CDC connector.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74xpu62w3d9jdk7f3ble.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74xpu62w3d9jdk7f3ble.png" alt="MySQL connector" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Kafka Streams Earthquake Events to a PostgreSQL Sink
&lt;/h3&gt;

&lt;p&gt;Next, the PostgreSQL Sink Connector subscribes to the same Kafka topic. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Reads the after field from the change event&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deserializes the JSON_SR format&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inserts or updates the data into a PostgreSQL table for analytics&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;TIP: You must configure the connector to match table naming formats and use primary key fields to support upserts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Below is a snapshot of the Postgres Sink Connector and the data in the Postgres sink:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqi3ty7ik9ae8zo87upzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqi3ty7ik9ae8zo87upzj.png" alt="Postgres Connector" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe21mplshsvogxg4e2xp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe21mplshsvogxg4e2xp2.png" alt="Postgres Sink" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Real-Time Analytics &amp;amp; Visualization in Grafana
&lt;/h3&gt;

&lt;p&gt;The final PostgreSQL database now holds real-time earthquake records, ready to be queried.&lt;/p&gt;

&lt;p&gt;Using Grafana Cloud, several dashboards were created, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Quakes Per Minute – a time series of seismic activity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alerts by Magnitude – categorize quakes into green/yellow/orange/red&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interactive World Map – plot quakes by longitude/latitude using heat maps or markers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Top 5 Quake Locations – based on frequency&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to interact with the dashboard, follow this &lt;a href="https://deecodes.grafana.net/public-dashboards/d71fa80f57fb4d68b98a04197f2c44b6" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is a snapshot of the dashboard on Grafana Cloud&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9gyn3b3oihbkl47smy4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9gyn3b3oihbkl47smy4.png" alt="Grafana Dash" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Automated Hourly Execution with Apache Airflow
&lt;/h3&gt;

&lt;p&gt;The entire process is scheduled and monitored using Apache Airflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A DAG (Directed Acyclic Graph) defines the workflow&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Runs every hour to ensure near real-time data ingestion and sends an email whenever a DAG run is successful or not&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Airflow provides visibility into task failures, retries, and pipeline health via its web UI.&lt;/p&gt;

&lt;p&gt;Below is a successful DAG run for the pipeline and email alert task.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08my62jts06o7m2m8wfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08my62jts06o7m2m8wfa.png" alt="DAG run" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why are the connectors running on Confluent Cloud?
&lt;/h2&gt;

&lt;p&gt;Running our Debezium CDC Connector and Sink connector in Confluent offers a lot of advantages as opposed to managing our connectors manually with Kafka. Below is a table of the advantages that Confluent Cloud offers over self managed Kafka.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;Confluent Cloud&lt;/th&gt;
&lt;th&gt;Self-Managed Kafka&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully managed by Confluent, no need to manage servers&lt;/td&gt;
&lt;td&gt;Requires manual setup, scaling, and maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connector Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easy UI for deploying and managing connectors&lt;/td&gt;
&lt;td&gt;Manual deployment and monitoring required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seamless automatic scaling based on workloads&lt;/td&gt;
&lt;td&gt;Must manually configure and monitor scaling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in enterprise-grade security (encryption, ACLs)&lt;/td&gt;
&lt;td&gt;Responsible for securing infrastructure yourself&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema Registry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Schema Registry with simple integration&lt;/td&gt;
&lt;td&gt;Needs separate setup and maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring &amp;amp; Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rich monitoring dashboards and alerting included&lt;/td&gt;
&lt;td&gt;Need to set up your own monitoring stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Upgrades &amp;amp; Patching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handled automatically by Confluent&lt;/td&gt;
&lt;td&gt;You are responsible for patching/upgrading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High Availability &amp;amp; Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SLA-backed uptime with high availability&lt;/td&gt;
&lt;td&gt;Need to architect HA setups manually&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faster Time to Market&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quick setup, lets you focus on building data pipelines&lt;/td&gt;
&lt;td&gt;Slower due to infrastructure and connector setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Access to Confluent support and managed services&lt;/td&gt;
&lt;td&gt;Community support or paid enterprise support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project enhanced my hands-on skills in API data extraction, real-time streaming with Kafka, and Change Data Capture (CDC) using Debezium. I gained practical experience building end-to-end data pipelines, automating workflows with Airflow, and creating real-time dashboards in Grafana. Overall, it was a great opportunity to apply data engineering concepts to a real-world monitoring use case.&lt;/p&gt;

&lt;p&gt;If you have any questions, don't hesitate to reach out. Please leave a like, comment and follow for more informative data engineering content!&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>data</category>
    </item>
    <item>
      <title>Building a Real-Time Crypto Pipeline with Binance APIs, PostgreSQL, Debezium, Kafka, Spark &amp; Cassandra</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Fri, 04 Jul 2025 18:57:11 +0000</pubDate>
      <link>https://forem.com/dkkinyua/building-a-real-time-crypto-pipeline-with-binance-apis-postgresql-debezium-kafka-spark--5gbl</link>
      <guid>https://forem.com/dkkinyua/building-a-real-time-crypto-pipeline-with-binance-apis-postgresql-debezium-kafka-spark--5gbl</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this blog post, I’ll walk you through a &lt;strong&gt;real-time crypto data pipeline&lt;/strong&gt; I built using a powerful stack:&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;PostgreSQL&lt;/strong&gt; for structured storage and a data source &lt;br&gt;
🔹 &lt;strong&gt;Debezium + Confluent Kafka&lt;/strong&gt; for Change Data Capture (CDC)&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;Apache Spark&lt;/strong&gt; for stream processing&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;Apache Cassandra&lt;/strong&gt; for fast NoSQL storage hosted on an Azure Ubuntu virtual Machine used as a data sink.&lt;/p&gt;

&lt;p&gt;This architecture enables near real-time data extraction and transformation, ideal for live dashboards, analytics, distributed systems, and downstream uses e.g., building of machine learning models.&lt;/p&gt;

&lt;p&gt;This project is divided into four phases:&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 1.
&lt;/h3&gt;

&lt;p&gt;Extracting data from the Binance API endpoints, transforming and loading the data into a PostgreSQL database using Python scripts.&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 2.
&lt;/h3&gt;

&lt;p&gt;Using a Debezium Source connector to connect into the PostgreSQL database to a Kafka topic to store change logs&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 3.
&lt;/h3&gt;

&lt;p&gt;Using a Spark Streaming job to act as a sink to send our replicated data from the Kafka topics into a Cassandra keyspace &lt;code&gt;binance_test&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 4.
&lt;/h3&gt;

&lt;p&gt;Visualizing the data in Apache Cassandra on a Grafana dashboard using a PostgreSQL bridge to pull data from Cassandra.&lt;/p&gt;

&lt;p&gt;For more technical details on the project, check the linked &lt;a href="https://github.com/dkkinyua/BinanceCryptoPipeline/edit/main/README.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; repository.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Change Data Capture (CDC)?
&lt;/h2&gt;

&lt;p&gt;Change Data Capture (CDC) is a technique used in data pipelines to identify and track changes made to data in a database. It captures these changes, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, and &lt;code&gt;DELETE&lt;/code&gt;, and makes them available for use in other systems, often in real-time or near real-time. CDC helps improve data efficiency and consistency across systems by only replicating the changed data, rather than the entire dataset. &lt;/p&gt;

&lt;p&gt;There are three types of CDC mechanisms namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log-based CDC&lt;/strong&gt;: This captures changes by reading the database’s transaction logs, which are used internally to ensure durability and consistency. For example, PostgreSQL uses Write-Ahead Logs (WALs) to record all changes before they're applied to the actual database. Tools like Debezium tap into these logs using logical decoding and convert them into structured JSON messages. These include:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;before&lt;/code&gt;: The state of the record before the change.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;after&lt;/code&gt;: The new state after the change.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;source&lt;/code&gt;: Metadata about the origin of the change e.g., database, table, LSN, timestamp.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An example of a log captured from Postgres into a Kafka Topic via a Debezium connector in JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"before"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"symbol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BTCUSDT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"61550.00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1724822831000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.5.0.Final"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"connector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"postgresql"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"binance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ts_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1724822831200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"snapshot"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"db"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crypto_db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sequence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[null,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;240300112&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"binance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"latest_prices"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"txId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;78321&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lsn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;240300112&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"xmin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ts_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1724822831210&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query-based CDC&lt;/strong&gt;: This method involves periodically querying the source database to identify changes, usually by comparing the last updated timestamps or using &lt;code&gt;ROWVERSION&lt;/code&gt; fields.
Below is an example:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2025-07-02 10:00:00'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trigger-based CDC&lt;/strong&gt;: This method uses database triggers to detect changes and write them to an audit/log table. An example includes:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;audit_changes&lt;/span&gt;
&lt;span class="k"&gt;AFTER&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;EACH&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="k"&gt;EXECUTE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;log_user_changes&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;log_user_changes()&lt;/code&gt; would write the old and new row values to a separate &lt;code&gt;user_audit_log&lt;/code&gt; table which will store the changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Requirements
&lt;/h2&gt;

&lt;p&gt;This project requires the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;python 3.x&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pyspark 3.5.x&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cassandra-driver&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;confluent-kafka&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sqlalchemy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;psycopg2-binary&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Confluent Cloud cluster&lt;/li&gt;
&lt;li&gt;Debezium's PostgresCDCConnectorV2 connector, available on Confluent Cloud&lt;/li&gt;
&lt;li&gt;Microsoft Azure Virtual Machine to host our Apache Cassandra. (You can host it locally if you want to)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7b24y8leracspsrb9xx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7b24y8leracspsrb9xx.png" alt="Project Workflow" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1.
&lt;/h3&gt;

&lt;p&gt;The Python script extracts and transforms 5 types of data from the Binance REST API endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Recent Trades&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Order Book Depth&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;30-minute Klines / Candlesticks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;24h Daily Ticker prices&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Latest Prices&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's an example of the extraction, transformation and loading of the &lt;code&gt;latest_prices&lt;/code&gt; prices data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_latest_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DB_URI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;coin_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/v3/ticker/price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;coin_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching recent trades: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coin_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DB_URI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latest_prices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;binance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latest price data loaded successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading latest prices to db: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After transformation, it loads the data into a PostgreSQL database's &lt;code&gt;Binance&lt;/code&gt; schema. The PostgreSQL tables created are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;binance.recent_trades&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;binance.order_book&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;binance.klines&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;binance.daily_ticker&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;binance.latest_prices&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2
&lt;/h3&gt;

&lt;p&gt;Head over to Confluent Cloud and if you do not have an account, please create a free account to claim free $400 credits and create an environment and cluster. When prompted to add a payment method.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1y8yofe0sfm1nduj2dxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1y8yofe0sfm1nduj2dxx.png" alt="Cluster Setup" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After creating a cluster, on the left-hand side, select &lt;strong&gt;Connectors&lt;/strong&gt; and search for &lt;strong&gt;Postgres CDC Source V2 (Debezium) connector&lt;/strong&gt; and start the setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fylzg0mrhgedfpddholtn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fylzg0mrhgedfpddholtn.png" alt="CDC Setup" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwovfzlhgs7cs97fj917.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwovfzlhgs7cs97fj917.png" alt="CDC Setup" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the connector receiving data from the Postgres database and storing it into Kafka topics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqep0pcdabt0r9iuud0j5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqep0pcdabt0r9iuud0j5.png" alt="Connecting to Kafka" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3
&lt;/h3&gt;

&lt;p&gt;A script containing a Spark Streaming job that acts as a data sink into Apache Cassandra.&lt;/p&gt;

&lt;p&gt;It consumes the logs from the various Kafka topics into JSON format, flattens after, source, op, ts_ms, and transaction fields, transforms them, and loads them into Cassandra hosted on an Azure Ubuntu VM&lt;/p&gt;

&lt;p&gt;The data received in Cassandra is as follows for the &lt;code&gt;klines&lt;/code&gt; table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuq4ju0wujm3uvwuuecy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuq4ju0wujm3uvwuuecy.png" alt="Klines example" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4.
&lt;/h3&gt;

&lt;p&gt;After streaming data into Cassandra, we need to visualize this data on a Grafana dashboard to derive insights from the data.&lt;/p&gt;

&lt;p&gt;For this project, I have used Grafana Cloud. There is a plugin that is used as a data source, but it only supports Cassandra 3.0.0 and older, and I am running 4.1.9 hence coming into some issues with connecting Grafana&lt;/p&gt;

&lt;p&gt;Hence, I built a simple Postgres-Cassandra bridge to send data to a PostgreSQL database which Grafana natively supports to draw insights from the data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE: By default, Cassandra comes with an authenticator and authorizer set to AllowAllAuthenticator, which disables authentication and allows any user to access Cassandra's shell &lt;code&gt;cqlsh&lt;/code&gt;. Remove the default configuration, and restart the Cassandra service as so:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nano /etc/cassandra/cassandra.yaml &lt;span class="c"&gt;# Cassandra's config file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;authenticator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PasswordAuthenticator&lt;/span&gt;

&lt;span class="na"&gt;authorizer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CassandraAuthorizer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then exit the editor and run the following comand to restart the Cassandra service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart cassandra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8boahgh72d0s14f7vpdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8boahgh72d0s14f7vpdf.png" alt="Dashboard image 1" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrla74vzxgx4573tuxqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrla74vzxgx4573tuxqu.png" alt="Dashboard image 2" width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Insights gained from the data
&lt;/h3&gt;

&lt;p&gt;The dashboards reveal several valuable insights into crypto trading patterns. The first chart shows that XRPUSDT, AVAXUSDT, BTCUSDT, and DOGEUSDT are among the most frequently traded symbols in the trading day, indicating high liquidity or market interest. &lt;/p&gt;

&lt;p&gt;The ETHUSDT volume trend indicates significant fluctuations, with notable spikes between 13:00 and 14:00, suggesting peak trading hours possibly influenced by market news or institutional actions. &lt;/p&gt;

&lt;p&gt;The candlestick chart for BTCUSDT displays hourly open, high, low, and close prices, offering a useful tool for technical analysis and identifying market trends or reversal patterns.&lt;/p&gt;

&lt;p&gt;Additionally, the quantity of coins ordered throughout the trading day highlights DOGEUSDT and ADAUSDT as dominant in trade volume, with DOGE exceeding 270,000 units, likely due to its lower price and retail trader appeal. Overall, the visualizations suggest strong market activity around a few key coins, with volume and price data providing critical insights for real-time monitoring, trading strategy, or machine learning applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project was a rewarding journey through the world of real-time data engineering. It demonstrates how Change Data Capture can save time and resources while updating our databases for streaming applications and data pipelines. This stack can easily be used for downstream use cases e.g. data analytics or even real-time ML predictions for crypto coins.&lt;/p&gt;

&lt;p&gt;Do you have any questions? Don't forget to like, comment and follow for more data engineering tutorials and content!&lt;/p&gt;

</description>
      <category>python</category>
      <category>cryptocurrency</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Tracking Kenya’s External Debt Using Python, PostgreSQL, and Grafana</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Fri, 20 Jun 2025 21:35:21 +0000</pubDate>
      <link>https://forem.com/dkkinyua/tracking-kenyas-external-debt-using-python-postgresql-and-grafana-1h61</link>
      <guid>https://forem.com/dkkinyua/tracking-kenyas-external-debt-using-python-postgresql-and-grafana-1h61</guid>
      <description>&lt;p&gt;We always hear about Kenya’s external rising debt in headlines, but how fast is it growing? And what are the trends year over year?&lt;/p&gt;

&lt;p&gt;As a data engineer, I wanted to answer these questions with data, not just opinions. So I built a pipeline that connects the dots: from pulling debt data via the World Bank API, transforming it using pandas, storing it in a PostgreSQL database, and visualizing the story through Grafana.&lt;/p&gt;

&lt;p&gt;You can get the project code in [GitHub].(&lt;a href="https://github.com/dkkinyua/KenyanExternalDebtAnalysis" rel="noopener noreferrer"&gt;https://github.com/dkkinyua/KenyanExternalDebtAnalysis&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Project requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Python for scripting and data processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pandas for data transformation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PostgreSQL for data storage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana for interactive dashboards&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;World Bank API as the data source&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Extracting Data from the World Bank API
&lt;/h3&gt;

&lt;p&gt;I used the World Bank API to fetch Kenya’s external debt stock from 2010 to 2023 using the &lt;code&gt;requests&lt;/code&gt; library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.worldbank.org/v2/country/KE/indicator/DT.DOD.DECT.CD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2010:2024&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rel_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;rel_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;countryName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;debtValue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indicator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indicator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rel_data&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Transforming Data with Pandas
&lt;/h3&gt;

&lt;p&gt;I cleaned and processed the data, by handling and dropping NaN/missing values,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Loading into a PostgreSQL database:
&lt;/h3&gt;

&lt;p&gt;Load into the data into a PostgreSQL database for easier integration with Grafana for the dashboards:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# code
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;external_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Data loaded successfully!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading into db: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Building Grafana dashboards
&lt;/h3&gt;

&lt;h4&gt;
  
  
  A. Creating and configuring Grafana.
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Go to &lt;a href=""&gt;Grafana Cloud&lt;/a&gt; and create a new account.&lt;/li&gt;
&lt;li&gt;Head over to Dashboards and click on Data Sources&lt;/li&gt;
&lt;li&gt;On your right-hand side, click on Add Data Source and connect your PostgreSQL database&lt;/li&gt;
&lt;li&gt;Click on &lt;em&gt;Create a dashboard&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  B. Dashboards
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2kfqehvyavwr0akt2aqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2kfqehvyavwr0akt2aqn.png" alt="Grafana Dashboard" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights Derived
&lt;/h2&gt;

&lt;p&gt;There has been steady growth of external debt between 2010 and 2023, which reflects a 383% increase in external debt over a period of 13 years, showing an overreliance on external borrowing to finance development&lt;/p&gt;

&lt;p&gt;From the bar chart in the dashboard, 2014-2015 and 2017-2018 stand out as periods of high borrowing. The 2017 spike was the highest of them all, which coincided with the 2017 election period.&lt;/p&gt;

&lt;p&gt;There has been slow debt growth between 2021 and 2023, which shows efforts to slow down external borrowing or external pressure from debt servicing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical or logical challenges encountered
&lt;/h2&gt;

&lt;p&gt;Some of the technical or logical challenges encountered include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistent time formats&lt;/li&gt;
&lt;li&gt;Dealing with missing/NaN values&lt;/li&gt;
&lt;li&gt;The API returns nested JSON or XML, not always straightforward for Pandas ingestion&lt;/li&gt;
&lt;li&gt;This data is just from external debts and does not use any other indicators to analyze any other debt patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project gives hands-on experience with end-to-end ETL pipeline design, from data extraction from the World Bank API using &lt;code&gt;requests&lt;/code&gt; to transformation using pandas, loading, and visualization using Grafana. It's a solid foundation for building more robust data pipelines.&lt;/p&gt;

&lt;p&gt;For more blogs like this, please like, comment and follow me to stay updated in the data engineering world!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Extracting Data from the Premier League YouTube Channel Using `googleapiclient`, PySpark, Airflow, PostgreSQL and Grafana</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Thu, 12 Jun 2025 14:34:56 +0000</pubDate>
      <link>https://forem.com/dkkinyua/extracting-data-from-the-premier-league-youtube-channel-using-googleapiclient-pyspark-airflow-29ie</link>
      <guid>https://forem.com/dkkinyua/extracting-data-from-the-premier-league-youtube-channel-using-googleapiclient-pyspark-airflow-29ie</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this post, I’ll walk you through how I built a scalable ETL data pipeline to extract and analyze video content from the &lt;a href="https://www.youtube.com/@PremierLeague" rel="noopener noreferrer"&gt;Premier League YouTube Channel&lt;/a&gt;, using the YouTube Data API, PySpark, Airflow, and PostgreSQL, and visualized the data on a Grafana dashboard.&lt;/p&gt;

&lt;p&gt;Here is the GitHub &lt;a href="https://github.com/dkkinyua/pl-youtube-analytics" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1: Extract Data from YouTube API
&lt;/h2&gt;

&lt;p&gt;Using &lt;code&gt;googleapiclient&lt;/code&gt;, I connected to the YouTube Data API and extracted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The channel's upload playlist ID&lt;/li&gt;
&lt;li&gt;Processed all video IDs (in batches of 50)&lt;/li&gt;
&lt;li&gt;Metadata for each video: title, view count, like count, comment count, and published date&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This data was written to a JSON file for further processing and transformation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F911z4wtyu915xqkl6wv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F911z4wtyu915xqkl6wv6.png" alt="JSON File Output" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2: Transform Data with PySpark and Load into Postgres
&lt;/h2&gt;

&lt;p&gt;Next, I used PySpark to clean and process the data by creating a Spark dataframe and using a temporary view to transform data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiLine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;videos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;videos&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;new_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT 
            title,
            videoId,
            CAST(viewCount as INT) as viewCount,
            CASE
                WHEN CAST(viewCount AS INT) &amp;gt;= 100000 THEN &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;very viral&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
                WHEN CAST(viewCount AS INT) BETWEEN 10000 AND 99999 THEN &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;viral&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
                ELSE &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;normal viewing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
            END AS video_virality,
            CAST(commentCount AS INT) AS commentCount,
            CAST(likeCount AS INT) AS likeCount,
            CAST(publishedAt AS DATE) AS date
        FROM videos
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The transformations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Casted viewCount, likeCount, and commentCount to integer values for easier analysis.&lt;/li&gt;
&lt;li&gt;Calculated engagement rate whose formula is: &lt;code&gt;(likes + comments) / views&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3m0xlc6lwgbcezoe2egf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3m0xlc6lwgbcezoe2egf.png" alt="PySpark Instance" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Loading into Postgres
&lt;/h3&gt;

&lt;p&gt;To load the data into a Postgres database, we should use the &lt;code&gt;df.write.jdbc()&lt;/code&gt; method from Pyspark.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE: Make sure that the &lt;code&gt;spark.jars&lt;/code&gt; package is installed to avoid running into errors while loading data into the database. Also, use the JDBC URI to connect to the database (it doesn't have the user and password as they have been set up in the properties variable)&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avnadmin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DB_PWD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;driver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.postgresql.Driver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# code
&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
     &lt;span class="n"&gt;new_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;jdbc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DB_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pl_analytics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded into Postgres DB successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
     &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loaded Spark Dataframe into db error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3: Orchestrate with Airflow
&lt;/h2&gt;

&lt;p&gt;To automate the workflow weekly, I created two Python modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;extract_data.py: Extracts data from YouTube and writes to JSON&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;transform_and_load.py: Reads the JSON, processes with PySpark, and loads into PostgreSQL&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then I defined a DAG in extract_transform.py and defined the extract, transform and load tasks using the PythonOperator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pl_analytics_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This DAG contains the pipeline for extracting, transforming and loading data from the Youtube API&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;extract&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract_task&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform_and_load&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform_task&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;extract&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;transform&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21zjg7acuoo03g04ojro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21zjg7acuoo03g04ojro.png" alt="Airflow Graph view" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Visualizing with Grafana
&lt;/h2&gt;

&lt;p&gt;I connected the PostgreSQL database to Grafana Cloud and created a dashboard to draw insights from the collected data. Some of the insights include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Line chart of views over time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Top 10 most viewed videos (bar chart)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Histogram of engagement rates&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scatter plot of video length vs. views/likes/comments&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpyld7gas9yte353ydsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpyld7gas9yte353ydsb.png" alt="Grafana dashboard" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project gives hands-on experience with end-to-end ETL pipeline design, from data extraction from Google APIs using &lt;code&gt;googleapiclient&lt;/code&gt; to transformation using Pyspark, loading, and visualization. It's a solid foundation for building more robust data pipelines.&lt;/p&gt;

&lt;p&gt;For more blogs like this, please like, comment and follow me to stay updated in the data engineering world!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Testing Airflow DAGs 101: A Practical Guide for Modern Data Teams.</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Fri, 06 Jun 2025 15:58:00 +0000</pubDate>
      <link>https://forem.com/dkkinyua/testing-airflow-dags-101-a-practical-guide-for-modern-data-teams-315j</link>
      <guid>https://forem.com/dkkinyua/testing-airflow-dags-101-a-practical-guide-for-modern-data-teams-315j</guid>
      <description>&lt;p&gt;Apache Airflow is an essential tool for data engineers. Data engineers use Airflow to automate, schedule and manage tasks within Directed Acyclic Graphs (DAGs) to ensure that data pipelines are being run effectively.&lt;/p&gt;

&lt;p&gt;When pushing our DAGs to production, we, as data engineers, run the risk of pushing broken DAGs to production, breaking data pipelines and costing the data team and the whole organization time and money in the process. This is the reason why testing our DAGs before pushing them to production is key in maintaining robust and scalable data pipelines, while saving time and money for the organization.&lt;/p&gt;

&lt;p&gt;Testing DAGs on Airflow before was a hard task as engineers had to create many DAGs on the production server to test their code. This takes up a lot of space and wastes a lot of time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Test Airflow DAGs?
&lt;/h2&gt;

&lt;p&gt;Testing Airflow DAGs is not just about catching bugs. It is about ensuring data integrity, reducing operational risk, and speeding up development. Without tests, a small change in your pipeline could have unintended consequences downstream ranging from data loss to duplicated reports and dashboards.&lt;/p&gt;

&lt;p&gt;The following are reasons why testing DAGs is important in the data lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevent failures before they happen. This helps in identifying breaking points in our code even before pushing the DAG to production&lt;/li&gt;
&lt;li&gt; Test logic and dependencies without waiting for a full run&lt;/li&gt;
&lt;li&gt;Catch regressions and breaking changes early&lt;/li&gt;
&lt;li&gt;To save time and resources for the data team and whole organization in the long run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Types of tests that can be done on Airflow DAGs.
&lt;/h2&gt;

&lt;p&gt;There are various tests that can be conducted on Airflow DAGs which include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests using the &lt;code&gt;pytest&lt;/code&gt; or &lt;code&gt;unittest&lt;/code&gt; modules from Python&lt;/li&gt;
&lt;li&gt;Integration tests by setting up test DAGs and mock DAGs&lt;/li&gt;
&lt;li&gt;End-to-end tests by setting up test environments and conducting full DAG runs using the Airflow CLI or &lt;code&gt;breeze&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Unit tests.
&lt;/h3&gt;

&lt;p&gt;Unit tests are all about separating your logic from the DAG and test each task or logic independently.&lt;br&gt;
This allows us to identify as many breaking points as possible in our code and fix them before pushing to production.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example of a unit test in Airflow.
&lt;/h4&gt;

&lt;p&gt;Assume we have a task in our DAG that calculates the average age of a user, and it takes in a list of user data as so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_data_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;user_data_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="c1"&gt;# code and other tasks
&lt;/span&gt;  &lt;span class="nd"&gt;@task&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_avg_age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="n"&gt;total_age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total_age&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's run unit tests on this task, based on various logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dags&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;calculate_avg_age&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_calculate_average_age_with_valid_users&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_average_age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_calculate_average_age_with_empty_list&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_average_age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_calculate_average_age_with_one_user&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Charlie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_average_age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catches bugs early&lt;/li&gt;
&lt;li&gt;Broad angle of logic testing&lt;/li&gt;
&lt;li&gt;Faster feedback&lt;/li&gt;
&lt;li&gt;Encourages writing modular code&lt;/li&gt;
&lt;li&gt;Easier to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Doesn't cover the entire scope of the project/DAG&lt;/li&gt;
&lt;li&gt;It is time consuming&lt;/li&gt;
&lt;li&gt;Doesn't uncover DAG related bugs and errors&lt;/li&gt;
&lt;li&gt;Provides a false sense of security as it doesn't guarantee that the DAG will run successfully in production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Integration tests.
&lt;/h3&gt;

&lt;p&gt;While unit tests focus on isolated logic, integration tests verify how multiple components e.g. tasks interact with each other such as tasks passing data via Airflow XCom, API calls, or database operations.&lt;/p&gt;

&lt;p&gt;These tests simulate a portion (or all) of your DAG logic to ensure everything works together as expected.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example of an integration test in Airflow
&lt;/h4&gt;

&lt;p&gt;Let's have two tasks &lt;code&gt;extract()&lt;/code&gt; which returns a list of dictionaries containing user data and &lt;code&gt;transform()&lt;/code&gt; which calculates the average age of users in the data set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;integration_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;avg_age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Average age is: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_age&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;avg_age&lt;/span&gt;

    &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;integration_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's an integration test for this DAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests/test_integration_dag.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DagBag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskInstance&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.utils.state&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;State&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_taskflow_xcom_passes_correctly&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;dag_bag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DagBag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dags/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;include_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dag_bag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integration_dag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;

    &lt;span class="n"&gt;extract_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;transform_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;exec_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;extract_ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TaskInstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;execution_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exec_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;extract_ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ignore_ti_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# XCom value from extract
&lt;/span&gt;    &lt;span class="n"&gt;users_from_xcom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extract_ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;return_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users_from_xcom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;users_from_xcom&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;transform_ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TaskInstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;execution_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exec_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;transform_ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ignore_ti_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;transform_ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SUCCESS&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validates real behaviour during DAG runs&lt;/li&gt;
&lt;li&gt;More realistic scope than unit tests&lt;/li&gt;
&lt;li&gt;Useful in test environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Harder to maintain&lt;/li&gt;
&lt;li&gt;Complex setup&lt;/li&gt;
&lt;li&gt;Slower in execution&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. End-to-end tests.
&lt;/h3&gt;

&lt;p&gt;End-to-end (E2E) testing is the practice of executing an entire DAG or a significant portion of it in an environment that mimics a production environment. It tests not just your task logic, but also task dependencies, retries, data movement, and overall orchestration.&lt;/p&gt;

&lt;p&gt;It simulates a DAG run using real data or test data and ensures the entire flow behaves as expected from task execution to XComs, file movement, database writes, or API calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tests the full workflow&lt;/li&gt;
&lt;li&gt;It mimics the production environment, hence exhibiting behaviour as it would in a production environment&lt;/li&gt;
&lt;li&gt;Uncovers integration issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow to run&lt;/li&gt;
&lt;li&gt;Harder to set up and debug&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Testing using &lt;code&gt;airflow dags test&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The airflow dags test command allows you to manually run a DAG from the command line for a given execution date — task by task without waiting for a scheduler or triggering a full DAG run in the database.&lt;/p&gt;

&lt;p&gt;It simulates a DAG run without scheduling, backfilling, or logging metadata to the database.&lt;/p&gt;

&lt;p&gt;To use this command in the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow dags &lt;span class="nb"&gt;test &lt;/span&gt;sample_etl_dag 2023-01-01
&lt;span class="c"&gt;# Syntax: airflow dags test &amp;lt;dag_id&amp;gt; &amp;lt;execution_date&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;NOTE: This only runs locally so it doesn't affect any external sources e.g. APIs&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  5. Using &lt;code&gt;dag.test()&lt;/code&gt; method.
&lt;/h3&gt;

&lt;p&gt;We can use &lt;code&gt;dag.test()&lt;/code&gt; method at the end of our code to test our code when run in the terminal.&lt;/p&gt;

&lt;p&gt;It is handy for local development testing, as our DAG can be automatically tested when the file is run.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example.
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deecodes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@hourly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;say_hello&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello there, Airflow is running!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;say_hello&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dag_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;test_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# let's test this DAG
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dag_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, run our file in the terminal to test.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 test_dag.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:51.238+0300] &lt;span class="o"&gt;{&lt;/span&gt;dag.py:4435&lt;span class="o"&gt;}&lt;/span&gt; INFO - dagrun &lt;span class="nb"&gt;id&lt;/span&gt;: test_dag
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:51.278+0300] &lt;span class="o"&gt;{&lt;/span&gt;dag.py:4451&lt;span class="o"&gt;}&lt;/span&gt; INFO - created dagrun &amp;lt;DagRun test_dag @ 2025-06-06 15:51:51.044158+00:00: manual__2025-06-06T15:51:51.044158+00:00, state:running, queued_at: None. externally triggered: False&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:51.349+0300] &lt;span class="o"&gt;{&lt;/span&gt;dag.py:4396&lt;span class="o"&gt;}&lt;/span&gt; INFO - &lt;span class="o"&gt;[&lt;/span&gt;DAG TEST] starting &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;say_hello &lt;span class="nv"&gt;map_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:51.350+0300] &lt;span class="o"&gt;{&lt;/span&gt;dag.py:4399&lt;span class="o"&gt;}&lt;/span&gt; INFO - &lt;span class="o"&gt;[&lt;/span&gt;DAG TEST] running task &amp;lt;TaskInstance: test_dag.say_hello manual__2025-06-06T15:51:51.044158+00:00 &lt;span class="o"&gt;[&lt;/span&gt;scheduled]&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06 18:51:53,606] &lt;span class="o"&gt;{&lt;/span&gt;taskinstance.py:3134&lt;span class="o"&gt;}&lt;/span&gt; INFO - Exporting &lt;span class="nb"&gt;env &lt;/span&gt;vars: &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_OWNER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'deecodes'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'test_dag'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_TASK_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'say_hello'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_EXECUTION_DATE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2025-06-06T15:51:51.044158+00:00'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_TRY_NUMBER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'1'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_RUN_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'manual__2025-06-06T15:51:51.044158+00:00'&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.606+0300] &lt;span class="o"&gt;{&lt;/span&gt;taskinstance.py:3134&lt;span class="o"&gt;}&lt;/span&gt; INFO - Exporting &lt;span class="nb"&gt;env &lt;/span&gt;vars: &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_OWNER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'deecodes'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'test_dag'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_TASK_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'say_hello'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_EXECUTION_DATE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2025-06-06T15:51:51.044158+00:00'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_TRY_NUMBER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'1'&lt;/span&gt; &lt;span class="nv"&gt;AIRFLOW_CTX_DAG_RUN_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'manual__2025-06-06T15:51:51.044158+00:00'&lt;/span&gt;
Task instance is &lt;span class="k"&gt;in &lt;/span&gt;running state
 Previous state of the Task instance: queued
Current task name:say_hello state:scheduled start_date:None
Dag name:test_dag and current dag run status:running
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.611+0300] &lt;span class="o"&gt;{&lt;/span&gt;taskinstance.py:732&lt;span class="o"&gt;}&lt;/span&gt; INFO - ::endgroup::
Hello there, Airflow is running!
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06 18:51:53,614] &lt;span class="o"&gt;{&lt;/span&gt;python.py:240&lt;span class="o"&gt;}&lt;/span&gt; INFO - Done. Returned value was: None
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.614+0300] &lt;span class="o"&gt;{&lt;/span&gt;python.py:240&lt;span class="o"&gt;}&lt;/span&gt; INFO - Done. Returned value was: None
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.624+0300] &lt;span class="o"&gt;{&lt;/span&gt;taskinstance.py:341&lt;span class="o"&gt;}&lt;/span&gt; INFO - ::group::Post task execution logs
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.625+0300] &lt;span class="o"&gt;{&lt;/span&gt;taskinstance.py:353&lt;span class="o"&gt;}&lt;/span&gt; INFO - Marking task as SUCCESS. &lt;span class="nv"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;test_dag, &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;say_hello, &lt;span class="nv"&gt;run_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;manual__2025-06-06T15:51:51.044158+00:00, &lt;span class="nv"&gt;execution_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20250606T155151, &lt;span class="nv"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;, &lt;span class="nv"&gt;end_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20250606T155153
Task instance &lt;span class="k"&gt;in &lt;/span&gt;success state
 Previous state of the Task instance: running
Dag name:test_dag queued_at:None
Task &lt;span class="nb"&gt;hostname&lt;/span&gt;:Denzo.localdomain operator:_PythonDecoratedOperator
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.696+0300] &lt;span class="o"&gt;{&lt;/span&gt;dag.py:4410&lt;span class="o"&gt;}&lt;/span&gt; INFO - &lt;span class="o"&gt;[&lt;/span&gt;DAG TEST] end task &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;say_hello &lt;span class="nv"&gt;map_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.707+0300] &lt;span class="o"&gt;{&lt;/span&gt;dagrun.py:854&lt;span class="o"&gt;}&lt;/span&gt; INFO - Marking run &amp;lt;DagRun test_dag @ 2025-06-06 15:51:51.044158+00:00: manual__2025-06-06T15:51:51.044158+00:00, state:running, queued_at: None. externally triggered: False&amp;gt; successful
Dag run &lt;span class="k"&gt;in &lt;/span&gt;success state
Dag run start:2025-06-06 15:51:51.044158+00:00 end:2025-06-06 15:51:53.708913+00:00
&lt;span class="o"&gt;[&lt;/span&gt;2025-06-06T18:51:53.710+0300] &lt;span class="o"&gt;{&lt;/span&gt;dagrun.py:905&lt;span class="o"&gt;}&lt;/span&gt; INFO - DagRun Finished: &lt;span class="nv"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;test_dag, &lt;span class="nv"&gt;execution_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2025-06-06 15:51:51.044158+00:00, &lt;span class="nv"&gt;run_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;manual__2025-06-06T15:51:51.044158+00:00, &lt;span class="nv"&gt;run_start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2025-06-06 15:51:51.044158+00:00, &lt;span class="nv"&gt;run_end_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2025-06-06 15:51:53.708913+00:00, &lt;span class="nv"&gt;run_duration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2.664755, &lt;span class="nv"&gt;state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;success, &lt;span class="nv"&gt;external_trigger&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;False, &lt;span class="nv"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;manual, &lt;span class="nv"&gt;data_interval_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2025-06-06 14:00:00+00:00, &lt;span class="nv"&gt;data_interval_end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2025-06-06 15:00:00+00:00, &lt;span class="nv"&gt;dag_hash&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allows for faster local testing&lt;/li&gt;
&lt;li&gt;Simple debugging&lt;/li&gt;
&lt;li&gt;No Airflow environment required to run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This method is not scalable&lt;/li&gt;
&lt;li&gt;It doesn't test DAG's integrity&lt;/li&gt;
&lt;li&gt;It doesn't simulate a true DAG run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Testing your Airflow DAGs may feel like overkill when you are moving fast, but the long-term stability and confidence it brings are well worth the effort. Treat your data pipelines like production code, because they are.&lt;/p&gt;

&lt;p&gt;By following the practical steps in this guide, you and your team will be better equipped to deliver robust, scalable pipelines that do not break when it matters most, during production.&lt;/p&gt;

&lt;p&gt;To learn more about testing in Airflow, visit the following links to learn more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/2.6.0/core-concepts/executor/debug.html" rel="noopener noreferrer"&gt;Airflow's testing documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.astronomer.io/docs/learn/testing-airflow/" rel="noopener noreferrer"&gt;Astronomer's Testing documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html" rel="noopener noreferrer"&gt;Airflow's best practices in building DAGs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=wkAb8mDDHMs" rel="noopener noreferrer"&gt;etsy-dagtest, on Apache Airflow's YouTube channel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope this post has enriched your knowledge in testing your Airflow DAGs. Please leave a like, comment, and follow for more data engineering posts just like this one.&lt;/p&gt;

&lt;p&gt;Do you have any other methods that you use to test your Airflow DAGs before pushing them to production? Let me know down in the comments!&lt;br&gt;
For any comments or communication, please email me at &lt;a href="mailto:denzelkinyua11@gmail.com"&gt;denzelkinyua11@gmail.com&lt;/a&gt; to get in touch.&lt;/p&gt;

</description>
      <category>dataeng</category>
      <category>python</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Building a Stock Data Pipeline with requests, Apache Airflow and PostgreSQL</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Mon, 26 May 2025 07:30:21 +0000</pubDate>
      <link>https://forem.com/dkkinyua/building-a-stock-data-pipeline-with-requests-apache-airflow-and-postgresql-42nf</link>
      <guid>https://forem.com/dkkinyua/building-a-stock-data-pipeline-with-requests-apache-airflow-and-postgresql-42nf</guid>
      <description>&lt;p&gt;In this tutorial, I will walk you through how I built a fully functional ETL pipeline using Apache Airflow to extract stock data from the AlphaVantage API, transform it using pandas, and load it into a PostgreSQL database. The pipeline is modular and scalable and is a great introduction to Airflow DAGs, task scheduling and orchestration.&lt;/p&gt;

&lt;p&gt;You can access the GitHub repository &lt;a href="https://github.com/dkkinyua/stock-data-pipeline" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The pipeline pulls daily, weekly and monthly stock data from AlphaVantage for IBM’s daily time series and performs three key steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt;: Retrieve data from the API using the &lt;code&gt;requests&lt;/code&gt; library.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt;: Clean and convert the data to a structured &lt;code&gt;pandas&lt;/code&gt; DataFrame.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt;: Store the cleaned data into a PostgreSQL database using SQLAlchemy's &lt;code&gt;create_engine&lt;/code&gt; method.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;The following packages are required to run this project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- apache-airflow==2.10.5
- python
- pandas
- requests
- sqlalchemy
- psycopg2-binary
- python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stock-data-pipeline/
├── dags/
│   ├── daily_pipeline.py
│   ├── weekly_pipeline.py
│   ├── monthly_pipeline.py
├── airflow.cfg
├── airflow.db
├── webserver_pid.py
├── .gitignore
├── .env
├── requirements.txt
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup Instructions
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE for Windows users:&lt;/strong&gt; Airflow does not support Windows natively. Use WSL (Windows Subsystem for Linux) to run this project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. Clone the repository
&lt;/h3&gt;

&lt;p&gt;To run this repository, open your terminal and run this command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dkkinyua/stock-data-pipeline.git
&lt;span class="nb"&gt;cd &lt;/span&gt;stock-data-pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Set up a virtual environment
&lt;/h3&gt;

&lt;p&gt;Install and activate your virtual environment using &lt;code&gt;venv&lt;/code&gt; or any other virtual environment provider, such as &lt;code&gt;virtualenv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Installing virtual environments for each project is essential and a good practice to prevent package mixup and allows each project to work with compatible package versions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv your_env
&lt;span class="nb"&gt;source &lt;/span&gt;your_env/bin/activate   &lt;span class="c"&gt;# Because Windows users will be &lt;/span&gt;
running Airflow on WSL, use this &lt;span class="nb"&gt;command &lt;/span&gt;to activate your environment.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When your virtual environment is active, the environment name will show next to your directory in the terminal as shown in the snapshot below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxag881qggfojfd50kft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxag881qggfojfd50kft.png" alt="Environment Variable Snapshot" width="709" height="22"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Install dependencies
&lt;/h3&gt;

&lt;p&gt;The project structure has a &lt;code&gt;requirements.txt&lt;/code&gt; file containing all the packages required for this project.&lt;/p&gt;

&lt;p&gt;To install these packages, run the following command in your terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Configure Apache Airflow
&lt;/h3&gt;

&lt;h4&gt;
  
  
  a. Set your Airflow home path and initialize Airflow's SQLite database:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/airflow
airflow db init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;airflow db init&lt;/code&gt; command creates a folder &lt;code&gt;airflow/&lt;/code&gt; in the project's root directory.&lt;/p&gt;

&lt;h4&gt;
  
  
  b. Create an admin user:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow &lt;span class="nb"&gt;users &lt;/span&gt;create &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--username&lt;/span&gt; admin &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--firstname&lt;/span&gt; YourName &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--lastname&lt;/span&gt; Admin &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt; Admin &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--email&lt;/span&gt; admin@example.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--password&lt;/span&gt; admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  c. Start the webserver and scheduler:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow webserver &amp;amp; airflow scheduler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Airflow UI will run on &lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkm3s90692lxvbfytdgj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkm3s90692lxvbfytdgj.png" alt="Airflow UI Login page" width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploring the DAG: &lt;code&gt;daily_pipeline.py&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;This DAG extracts IBM’s daily stock data using AlphaVantage's API and runs at midnight daily.&lt;/p&gt;

&lt;h4&gt;
  
  
  DAG setup
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deecodes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;daily_data_pipeline_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# code
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  a. &lt;code&gt;extract_daily_data&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_daily_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&amp;amp;symbol=IBM&amp;amp;apikey=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TIP&lt;/strong&gt;: Store your API keys and credentials in a &lt;code&gt;.env&lt;/code&gt; file and add it to &lt;code&gt;.gitignore&lt;/code&gt;. A .gitignore file instructs Git not to push certain files and folders that might have sensitive information&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  b. &lt;code&gt;transform_daily_data&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_daily_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  c. &lt;code&gt;load_to_db&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_to_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DB_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;daily_stock_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Setting Task Sequence with Airflow's XCom
&lt;/h3&gt;

&lt;p&gt;The tasks return data values when complete. We will use Airflow's XCom feature to share this data between tasks, and this will help us in setting up the task sequence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_daily_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_daily_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;load_to_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger the DAG from the Airflow UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff644bwbt5q3tox1xonpb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff644bwbt5q3tox1xonpb.png" alt="DAG overview" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check success status:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczk4gepi6s80kwjip2i8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczk4gepi6s80kwjip2i8.png" alt="Run DAG" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inspect task flow in the graph view:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g0hhy2hamrkzbaq1diu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g0hhy2hamrkzbaq1diu.png" alt="Task Sequence" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Confirm the data loaded in PostgreSQL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvkb6k4apz5oqcetko4w3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvkb6k4apz5oqcetko4w3.png" alt="Postgres check" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ✅ Conclusion
&lt;/h2&gt;

&lt;p&gt;We’ve walked through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to extract and clean stock data from AlphaVantage.&lt;/li&gt;
&lt;li&gt;How to configure and run DAGs with Apache Airflow.&lt;/li&gt;
&lt;li&gt;How to load structured data into a PostgreSQL database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have questions or suggestions, feel free to open a pull request or email me at &lt;strong&gt;&lt;a href="mailto:denzelkinyua11@gmail.com"&gt;denzelkinyua11@gmail.com&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Also, please like, share, comment and follow me to get more data engineering blogs and projects and stay in the know-how of the data engineering world.&lt;/p&gt;

</description>
      <category>python</category>
      <category>api</category>
      <category>dataengineering</category>
      <category>postgres</category>
    </item>
    <item>
      <title>dbt for Normies - A Complete Guide.</title>
      <dc:creator>Denzel Kanyeki</dc:creator>
      <pubDate>Sun, 11 May 2025 15:43:35 +0000</pubDate>
      <link>https://forem.com/dkkinyua/dbt-for-normies-a-complete-guide-4bj2</link>
      <guid>https://forem.com/dkkinyua/dbt-for-normies-a-complete-guide-4bj2</guid>
      <description>&lt;p&gt;Whether you are new to data engineering or you are quite conversant with the dataverse, you should have heard or used &lt;strong&gt;Data Build Tool (dbt)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Then, what is &lt;code&gt;dbt&lt;/code&gt;? &lt;code&gt;dbt&lt;/code&gt; is a transformation tool that uses 3 languages to be able to help data engineers transform data efficiently in ETL and ELT pipeline development. These three languages are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured Query Languages(SQL)&lt;/strong&gt;: This is the main language used in &lt;code&gt;dbt&lt;/code&gt;. It is used in transforming data efficiently in databases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YAML&lt;/strong&gt;: This is the language used to configure &lt;code&gt;dbt&lt;/code&gt; in configuration files i.e. &lt;code&gt;dbt_project.yml&lt;/code&gt; file which contains configuration settings for our models, tests among other &lt;code&gt;dbt&lt;/code&gt; essentials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jinja&lt;/strong&gt;: This is a powerful templating engine used in &lt;code&gt;dbt&lt;/code&gt;. It can also be used in other languages i.e. Python (Django, Flask) where it is combined to work with HTML files to create robust and functional web applications. Using Jinja in &lt;code&gt;dbt&lt;/code&gt; allows for reusability for our code in various files within our project, making it very important for users. Read more about Jinja &lt;a href="https://jinja.palletsprojects.com/en/stable/intro/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of writing massive SQL scripts, dbt helps you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write &lt;strong&gt;modular SQL&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;strong&gt;version-controlled&lt;/strong&gt; transformations&lt;/li&gt;
&lt;li&gt;Test and &lt;strong&gt;document&lt;/strong&gt; your models with easy to navigate UI.&lt;/li&gt;
&lt;li&gt;Embrace software engineering best practices for data analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;dbt (data build tool) works exceptionally well with CTEs (Common Table Expressions) because they align closely with dbt’s core principles: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clarity&lt;/li&gt;
&lt;li&gt;modularity&lt;/li&gt;
&lt;li&gt;reusability&lt;/li&gt;
&lt;li&gt;testability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is a small introduction for what &lt;code&gt;dbt&lt;/code&gt; is and what it entails, let's dive deeper into it and get to know its basics, install &lt;code&gt;dbt-core&lt;/code&gt;, write our first &lt;code&gt;dbt&lt;/code&gt; model and learn more about this powerful tool for data engineers.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Installing and initializing &lt;code&gt;dbt&lt;/code&gt;.
&lt;/h2&gt;

&lt;p&gt;To install dbt, we need to either install &lt;code&gt;dbt-core&lt;/code&gt; or &lt;code&gt;dbt-cloud&lt;/code&gt;. For this blog, we will install &lt;code&gt;dbt-core&lt;/code&gt; and we will install &lt;code&gt;dbt-cloud&lt;/code&gt; in another post.&lt;/p&gt;

&lt;p&gt;To work with &lt;code&gt;dbt&lt;/code&gt;, you need to have the following prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installing Python packages in ther terminal&lt;/li&gt;
&lt;li&gt;Beginner to advanced SQL knowledge&lt;/li&gt;
&lt;li&gt;Eagerness to learn!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before installing &lt;code&gt;dbt&lt;/code&gt; make sure you have installed and activated your virtual environment. This helps with preventing package version mixups and errors.&lt;br&gt;
To install and activating your virtual environment according to your operating system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv your_env

&lt;span class="nb"&gt;source &lt;/span&gt;your_env/bin/activate &lt;span class="c"&gt;# Linux / MacOS&lt;/span&gt;
your_env&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate &lt;span class="c"&gt;# Windows&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To install &lt;code&gt;dbt&lt;/code&gt;, run the following commands, following the database/data source/data warehouse that you would like to link to e.g. &lt;code&gt;postgres&lt;/code&gt;, &lt;code&gt;redshift&lt;/code&gt;, &lt;code&gt;snowflake&lt;/code&gt;, &lt;code&gt;bigquery&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-postgres &lt;span class="c"&gt;# If you choose postgres&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-redshift &lt;span class="c"&gt;# If you choose redshift&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-snowflake &lt;span class="c"&gt;# If you choose snowflake&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-bigquery &lt;span class="c"&gt;# If you choose bigquery&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the following command to initialize our dbt project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
dbt init your_project_name
&lt;span class="nb"&gt;cd &lt;/span&gt;your_project_name

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will prompt you to enter some data source credentials e.g. port, host, user and password. By running this command for the first time, it will set up &lt;code&gt;.dbt/&lt;/code&gt; folder in your &lt;code&gt;HOME&lt;/code&gt; directory for your global configuration files e.g. &lt;code&gt;profiles.yml&lt;/code&gt; that dbt uses to connect to your data warehouse.&lt;/p&gt;

&lt;p&gt;This also creates very important files for us, like &lt;code&gt;models&lt;/code&gt; which contains our models, &lt;code&gt;dbt_project&lt;/code&gt; which contains our project's configuration settings, &lt;code&gt;tests&lt;/code&gt; for our tests among other folders.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.  Creating our first &lt;code&gt;dbt&lt;/code&gt; model.
&lt;/h2&gt;

&lt;p&gt;To create our first model, head our to our &lt;code&gt;models/&lt;/code&gt; folder and create a file, name it as you like. Let's name it &lt;code&gt;high_cost_houses.sql&lt;/code&gt;. My data warehouse contains house data in Kenya from low range to high range cost houses. I needed to get all the high cost houses.&lt;/p&gt;

&lt;p&gt;In dbt, when running a model, it creates a materialized view in our tables where we can go and view the results in our database. We can change that in &lt;code&gt;dbt_project.yml&lt;/code&gt; file, I'll show you how to do that later on, don't worry!&lt;/p&gt;

&lt;p&gt;For our first model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;high_cost_houses&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;titles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prices&lt;/span&gt;
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;house_data&lt;/span&gt;
    &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;prices&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50000000&lt;/span&gt;
    &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;prices&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;high_cost_houses&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query fetches the houses whose price is above 50 million. To run this model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a view in our data warehouse and you can go check it out under Materialized views.&lt;/p&gt;

&lt;h3&gt;
  
  
  a. Modularity in &lt;code&gt;dbt&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;dbt&lt;/code&gt; uses both Jinja templates and SQL to ensure that we can use our models wherever in our models without repeating ourselves. This ensures that our models are reusable and maintainable. For example, to use our &lt;code&gt;high_cost_houses.sql&lt;/code&gt; in another model, we can use &lt;code&gt;ref()&lt;/code&gt; to reference to that model as so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Let's get houses that are in Runda from our high_cost_houses&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'high_cost_houses'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Runda'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have reused our model in our new model using &lt;code&gt;{{ ref('high_cost_houses') }}&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  b. Materialization in &lt;code&gt;dbt&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Remember when I said I would help you override dbt's default materialization from a view to a table? Let's do it.&lt;br&gt;
Head over to your dbt_project and change this from view to table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dbt_practice&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="c1"&gt;# change this from view to table&lt;/span&gt;
    &lt;span class="na"&gt;high_cost_houses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;+materialized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;table&lt;/span&gt; &lt;span class="c1"&gt;# previously: view&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tables are easier to access but take up more space than views which are cheaper to store and easier to run.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. dbt Sources, Seeds and Analyses.
&lt;/h2&gt;

&lt;p&gt;Sources are data sources from our data warehouse, seeds are csv files that we would like to load for analysis while analyses are SQL queries that are one off queries and are not materialized into our data warehouse as tables or views. If we would want to materialize them, we should move them to the &lt;code&gt;models/&lt;/code&gt; folder.&lt;/p&gt;

&lt;p&gt;To reference our source in dbt, we use the &lt;code&gt;source('db_name', 'table_name')&lt;/code&gt; method to do so.  Make sure you have set the source in our &lt;code&gt;dbt_project.yml&lt;/code&gt; file as so&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database_loading&lt;/span&gt;
    &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;house_data_loading&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then reference it as so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'database_loading'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'house_data_loading'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Testing in &lt;code&gt;dbt&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;In dbt, there are two types of tests which include:&lt;/p&gt;

&lt;p&gt;a.  &lt;strong&gt;Singular tests&lt;/strong&gt; - These are tests that are run in one model and don't need to be repeated in other models. This tests actually validate outcomes of models. For example, in our &lt;code&gt;high_cost_houses.sql&lt;/code&gt; model, our houses should not have a price less than 50 million, so let's write a test to validate this.&lt;/p&gt;

&lt;p&gt;Head over to the &lt;code&gt;tests/&lt;/code&gt; folder and create a new test file called &lt;code&gt;check_prices.sql&lt;/code&gt; and write the following test query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="n"&gt;high_cost_houses&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;prices&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50000000&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After writing our test, run this to run our test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If our test fails, it means our data has some values that have a price below 50 million and should check that.&lt;/p&gt;

&lt;p&gt;b. &lt;strong&gt;Generic tests&lt;/strong&gt; - These are tests that can be performed on many tables. These tests follow the Don't Repeat Yourself (DRY) principle. &lt;br&gt;
There are built-in and pre built-in generic tests in dbt. Examples of generic tests include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;not_null&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;unique&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;accepted_values&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;relationships&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, we would to check if there are any locations that are empty in the high cost and low cost tables in our data warehouse. Let's write a generic test to check for any empty values using SQL and Jinja:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="n"&gt;empty_locations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;TRIM&lt;/span&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endtest&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To enter the test for these tables, create a &lt;code&gt;test.yml&lt;/code&gt; file and load the test inside as so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low_cost_houses&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;locations&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;empty_locations&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high_cost_houses&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;locations&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;empty_locations&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpw4i8yxw6sg0hqw6pht8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpw4i8yxw6sg0hqw6pht8.png" alt="dbt Test Result"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will run our tests in our models and validate our output.&lt;/p&gt;

&lt;p&gt;We can also load built-in generic tests in our &lt;code&gt;test.yml&lt;/code&gt; file without writing much code. To do so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low_cost_houses&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;locations&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high_cost_houses&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;locations&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will work the same way!&lt;/p&gt;

&lt;h2&gt;
  
  
  5. &lt;code&gt;dbt&lt;/code&gt; Macros
&lt;/h2&gt;

&lt;p&gt;Macros are functions in dbt. They make code reusable, easier to use, modular, scalable and easier to maintain.&lt;/p&gt;

&lt;p&gt;They are constructed using the Jinja templates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;macro&lt;/span&gt; &lt;span class="n"&gt;macro_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;--code&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endmacro&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's create a macro of calculating the average house prices per location in Kenya.&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;macros/&lt;/code&gt; folder, create a file &lt;code&gt;avg_house_price.sql&lt;/code&gt; and write the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Macro to get avg prices of houses per location&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;macro&lt;/span&gt; &lt;span class="n"&gt;get_avg_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="p"&gt;}},&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;({{&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="p"&gt;}})&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_price&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endmacro&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To use our macro in our model, create a new model &lt;code&gt;avg_houses_per_location.sql&lt;/code&gt; in our &lt;code&gt;models&lt;/code&gt; folder and write the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;get_avg_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'locations'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'prices'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'house_data'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check how easy it is to create models with macros! It makes work easier as I can reuse that macro in other models.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. &lt;code&gt;dbt&lt;/code&gt; packages.
&lt;/h2&gt;

&lt;p&gt;Imagine what languages would be like without packages, e.g. Python without &lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;pandas&lt;/code&gt;? It would be tiring for any engineer or developer to call APIs or clean data using Python.&lt;/p&gt;

&lt;p&gt;dbt has packages that makes our work easier with running models, testing and building data pipelines. One of the widely used dbt package is &lt;code&gt;dbt expectations&lt;/code&gt; and &lt;code&gt;dbt utils&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Go through the &lt;a href="https://hub.getdbt.com/" rel="noopener noreferrer"&gt;dbt Hub&lt;/a&gt; for more dbt packages.&lt;/p&gt;

&lt;p&gt;Let's import &lt;code&gt;dbt utils&lt;/code&gt; from dbt Hub into our project.&lt;br&gt;
In the &lt;code&gt;packages.yml&lt;/code&gt; file, configure our package as so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;package&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dbt-labs/dbt_utils&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.3.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the following command in your terminal to install the package and use them in our code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt deps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtxcuf3loml04dxkpnkc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtxcuf3loml04dxkpnkc.png" alt="Install dbt packages"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we can use &lt;code&gt;dbt utils&lt;/code&gt; in our code!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;dbt_utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;star&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'users'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;dbt is quite diverse and we've covered only the fundamentals in this blog. For more information on dbt check the following links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=""&gt;Youtube, Introduction to dbt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=""&gt;dbt's Youtubve channel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you for reading this post, like, comment and share this post. This will be much appreciated!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>programming</category>
      <category>dbt</category>
    </item>
  </channel>
</rss>
