<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Amos Augo</title>
    <description>The latest articles on Forem by Amos Augo (@augo_amos).</description>
    <link>https://forem.com/augo_amos</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3397768%2Feb7e4d5e-2b9a-480c-a5b2-866a8ee5d8e8.png</url>
      <title>Forem: Amos Augo</title>
      <link>https://forem.com/augo_amos</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/augo_amos"/>
    <language>en</language>
    <item>
      <title>Uncovering Global Content Trends on Netflix: A Full Tableau Analytics Breakdown</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Fri, 28 Nov 2025 01:42:26 +0000</pubDate>
      <link>https://forem.com/augo_amos/uncovering-global-content-trends-on-netflix-a-full-tableau-analytics-breakdown-23c9</link>
      <guid>https://forem.com/augo_amos/uncovering-global-content-trends-on-netflix-a-full-tableau-analytics-breakdown-23c9</guid>
      <description>&lt;p&gt;The rapid growth of streaming platforms has generated an unprecedented volume of content, making data visualization essential for understanding catalog composition, audience targeting, and regional content strategies.&lt;/p&gt;

&lt;p&gt;In this technical article, I explore a Netflix Movies &amp;amp; TV Shows dataset and build a fully interactive Tableau dashboard to examine global distribution, ratings, genres, and release patterns.&lt;/p&gt;

&lt;p&gt;This breakdown walks through the insights discovered from the dashboard and the analytics techniques used to derive them.&lt;/p&gt;




&lt;h1&gt;
  
  
  Dataset Overview
&lt;/h1&gt;

&lt;p&gt;The dataset used contains &lt;strong&gt;8,234 titles&lt;/strong&gt; listed on Netflix, with the following key attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Type&lt;/strong&gt; – Movie or TV Show&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Country&lt;/strong&gt; – where the title was produced&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Date Added&lt;/strong&gt; – date title was added to Netflix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release Year&lt;/strong&gt; – original release date&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rating&lt;/strong&gt; – TV and movie rating classifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duration&lt;/strong&gt; – movie length or number of seasons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Listed In&lt;/strong&gt; – genre tags&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Description&lt;/strong&gt; – summary of the content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was to answer the question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What patterns define Netflix’s global content catalog across genres, countries, ratings, and years?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Using Tableau, we transformed this raw dataset into actionable insights.&lt;/p&gt;




&lt;h1&gt;
  
  
  Dashboard Design &amp;amp; Architecture
&lt;/h1&gt;

&lt;p&gt;The dashboard consists of several analytical components arranged to deliver a 360° view of Netflix’s library:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Global Geographic Visualization&lt;/strong&gt; – content distribution by country&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ratings Distribution Chart&lt;/strong&gt; – audience targeting through content ratings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Movies vs. TV Shows Composition&lt;/strong&gt; – catalog composition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top 10 Genres Bar Chart&lt;/strong&gt; – leading content categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Release Timeline&lt;/strong&gt; – evolution of Netflix’s content strategy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Filters&lt;/strong&gt; – Type, Title, Date Added, Release Year, Rating, Duration, and Genre&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The design emphasizes contrast using a dark theme with bright red highlights to reflect the Netflix brand identity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgyetgh2gmdcdumqxsnpy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgyetgh2gmdcdumqxsnpy.png" alt=" " width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  1. Global Content Distribution
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Observation: Netflix shows heavily centralized production&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;The world map reveals major production hubs with the highest volume of titles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;United States&lt;/strong&gt; leads significantly with over &lt;strong&gt;2,000 titles&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;India, the United Kingdom, and Canada follow as secondary hubs&lt;/li&gt;
&lt;li&gt;African and Middle-Eastern regions show comparatively lower production volume&lt;/li&gt;
&lt;li&gt;Latin America shows moderate output, especially Mexico and Brazil&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Technical Insight
&lt;/h3&gt;

&lt;p&gt;This visualization uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Country field&lt;/strong&gt; aggregated by count of titles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filled map&lt;/strong&gt; with a continuous color scale&lt;/li&gt;
&lt;li&gt;Normalized tooltip showing titles per country&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helps stakeholders quickly identify content creation strength and regional business opportunities.&lt;/p&gt;




&lt;h1&gt;
  
  
  2. Movies vs. TV Shows Distribution
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Observation: Movies dominate the content library&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;The packed bubble chart shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Movies: 4,265 titles (≈ 68%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TV Shows: 1,969 titles (≈ 32%)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This indicates Netflix historically focused more on movies, but the surge of TV originals in recent years—especially from 2016 onward—suggests a strategic shift toward episodic content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Insight
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;packed bubble&lt;/strong&gt; visualization was chosen to represent proportional differences without clutter, enhancing readability for executives and analysts.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. Ratings Distribution
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Observation: Adult-oriented content leads the platform&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Breaking down content by official ratings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rating&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TV-MA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,027&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TV-14&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,698&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TV-PG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;701&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;R&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;508&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PG-13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A clear pattern emerges:&lt;br&gt;
Netflix’s catalog predominantly targets &lt;strong&gt;mature audiences&lt;/strong&gt;, with very limited content produced for younger children.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Insight
&lt;/h3&gt;

&lt;p&gt;A vertical bar chart was used to expose frequency differences.&lt;br&gt;
Tooltips include rating definitions for clarity.&lt;/p&gt;




&lt;h1&gt;
  
  
  4. Top 10 Genres
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Observation: Documentaries and Comedy dominate&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;The top 10 genres by frequency:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Documentaries – 299 titles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stand-Up Comedy – 282 titles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dramas, International Movies – 247 titles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comedies, Dramas, International Movies – 212 titles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kids’ TV – 159 titles&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Documentaries lead the catalog due to low production cost and high global consumption.&lt;br&gt;
Comedy is also prominent, driven by Netflix’s heavy investment in stand-up specials.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Insight
&lt;/h3&gt;

&lt;p&gt;The horizontal bars allow long genre labels to fit cleanly without truncation.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Content Growth Over the Years
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Observation: Massive expansion between 2016–2019&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;The area chart shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimal releases before 2014&lt;/li&gt;
&lt;li&gt;Sharp rise in both Movies &amp;amp; TV Shows from &lt;strong&gt;2016 to 2019&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A noticeable peak around 2018, reflecting Netflix’s investment in original content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This timeline corresponds with Netflix’s global expansion into 190+ countries after 2016.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Insight
&lt;/h3&gt;

&lt;p&gt;Two area plots (Movies and TV Shows) layered on a shared axis highlight growth trends while maintaining category separation.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. Filter Functionality for Deeper Insights
&lt;/h1&gt;

&lt;p&gt;The dashboard includes interactive filters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Title&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rating&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Genre&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Release Year&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date Added&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These controls allow users to drill down into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;content by audience categories&lt;/li&gt;
&lt;li&gt;regional content availability&lt;/li&gt;
&lt;li&gt;genre evolution over the years&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Overall Insights &amp;amp; Business Implications
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Netflix’s content is heavily adult-oriented&lt;/strong&gt;, suggesting a strategic focus on mature viewers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Movies still outnumber TV shows&lt;/strong&gt;, but episodic content is growing faster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The US dominates production&lt;/strong&gt;, followed by India and the UK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentaries and comedy are the most produced genres&lt;/strong&gt;, due to cost efficiency and universal appeal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid growth occurred between 2016–2019&lt;/strong&gt;, reflecting Netflix’s global expansion and original content push&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These insights help benchmark Netflix's strategy, identify content gaps, and understand global viewer targeting.&lt;/p&gt;




&lt;h1&gt;
  
  
  Technical Tools &amp;amp; Workflow Summary
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Tools/Techniques&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data preparation&lt;/td&gt;
&lt;td&gt;Excel / Python / Tableau Prep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visualization&lt;/td&gt;
&lt;td&gt;Tableau Desktop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard design&lt;/td&gt;
&lt;td&gt;Containers, filters, custom colors, Mapbox maps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Export to Tableau Public or GitHub&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;This Tableau dashboard provides a comprehensive look at Netflix’s vast content library, revealing trends in genre distribution, content maturity, global production, and release patterns. The analytics approach demonstrates how powerful visualization can transform raw streaming data into actionable insights for content strategy and business decision-making.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/augo-amos/netflix-analytics-tableau-dashboard" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>tableau</category>
      <category>dataanalytics</category>
    </item>
    <item>
      <title>Building a Scalable Community Health Worker Analytics Platform: My Journey with dbt and Snowflake</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Wed, 26 Nov 2025 16:21:25 +0000</pubDate>
      <link>https://forem.com/augo_amos/building-a-scalable-community-health-worker-analytics-platform-my-journey-with-dbt-and-snowflake-7gc</link>
      <guid>https://forem.com/augo_amos/building-a-scalable-community-health-worker-analytics-platform-my-journey-with-dbt-and-snowflake-7gc</guid>
      <description>&lt;h2&gt;
  
  
  The Challenge: From Data Chaos to Clear Metrics
&lt;/h2&gt;

&lt;p&gt;Data engineers working in the health sector face a familiar but critical challenge: Community Health Workers (CHWs) generate thousands of activity records daily, but turning this raw data into actionable performance metrics is a manual, error-prone process. Field coordinators need to answer simple but vital questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;How many households did each CHW visit last month?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Which communities have coverage gaps?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Are our health workers meeting their activity targets?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I worked on one such project where the existing process involved Excel exports, manual date calculations, and fragile SQL queries that broke whenever source data changed. Something had to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: A dbt-Driven Analytics Pipeline
&lt;/h2&gt;

&lt;p&gt;I designed and built a scalable analytics platform using dbt (data build tool) and Snowflake that could grow with our organization's needs. Here's how I approached it:&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Aha!" Moment: Month Assignment Logic
&lt;/h3&gt;

&lt;p&gt;The first major insight came from understanding our operational reality. CHWs work in remote areas, and activities often happen late in the month but get recorded days later. Our old system used simple calendar months, which unfairly penalized workers for reporting delays.&lt;/p&gt;

&lt;p&gt;The breakthrough was implementing a business-specific month assignment rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; 
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'MONTH'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;ELSE&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'MONTH'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;report_month&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple change meant activities from the 26th onward counted toward next month's metrics. It sounds small, but it transformed how we measured performance and boosted CHW morale overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Foundation: Staging Layer
&lt;/h3&gt;

&lt;p&gt;I started with a clean staging layer that handled the messy reality of field data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- models/staging/stg_chw_activity.sql&lt;/span&gt;
&lt;span class="p"&gt;{{&lt;/span&gt;
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'ANALYTICS'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;chv_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;household_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;patient_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;activity_type&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'raw'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'fct_chv_activity'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;deleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;chv_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The beauty of this approach? Data quality issues were handled once, at the source. Downstream models could focus on business logic rather than data cleaning.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Power of Incremental Processing
&lt;/h3&gt;

&lt;p&gt;With thousands of new activities daily, full refreshes became impractical. I implemented incremental processing with a delete+insert strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'incremental'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unique_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'(chv_id, report_month)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'delete+insert'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This handled late-arriving data gracefully while maintaining performance. Processing time dropped from hours to minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Game Changer: Custom Data Quality Tests
&lt;/h2&gt;

&lt;p&gt;Early on, I discovered that data quality issues could undermine even the best analytics. Missing dates, negative IDs, and stale data created silent errors. So I built a comprehensive testing framework:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Data Freshness Monitoring
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- tests/data_freshness.sql&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;stale_records&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'raw'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'fct_chv_activity'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This test ensured we knew within days if field data collection stopped working.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Date Boundary Validation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- tests/date_boundaries.sql&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invalid_dates&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_chw_activity'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'2020-01-01'&lt;/span&gt; 
   &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This caught system date errors before they polluted our metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Negative Value Detection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- tests/negative_values.sql&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;negative_values&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_chw_activity'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;chv_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;household_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;patient_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This identified application bugs that created invalid identifiers.&lt;/p&gt;

&lt;p&gt;The tests became our early warning system, catching issues before they reached decision-makers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Architecture Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why dbt?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Documentation as Code&lt;/strong&gt;: Every model includes descriptions and tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modular SQL&lt;/strong&gt;: Reusable components that team members could understand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Management&lt;/strong&gt;: Automatic handling of model dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version Control&lt;/strong&gt;: All changes tracked in Git&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Snowflake?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Database Processing&lt;/strong&gt;: Seamlessly query across RAW and analytics databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warehouse Scaling&lt;/strong&gt;: Handle large processing jobs without infrastructure changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt;: Safe testing and development environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Implementation Journey
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Week 1: Foundation
&lt;/h3&gt;

&lt;p&gt;I started small with the staging model and basic monthly aggregation. The first version processed just core activities but proved the concept.&lt;/p&gt;

&lt;h3&gt;
  
  
  Week 2: Data Quality
&lt;/h3&gt;

&lt;p&gt;After discovering some data issues in production, I implemented the custom test suite. This built confidence in our metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Week 3: Incremental Processing
&lt;/h3&gt;

&lt;p&gt;As data volumes grew, I refactored to incremental models. The performance improvement was immediate and dramatic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Week 4: Documentation &amp;amp; Deployment
&lt;/h3&gt;

&lt;p&gt;I used dbt's built-in documentation to create a data dictionary that non-technical stakeholders could understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start with Business Logic
&lt;/h3&gt;

&lt;p&gt;The most valuable part wasn't the technology—it was codifying our month assignment rule. Technology should serve business needs, not dictate them.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Build Quality In Early
&lt;/h3&gt;

&lt;p&gt;Adding tests from day one prevented data quality debt from accumulating. It's easier to maintain quality than to fix it later.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Documentation is a Feature
&lt;/h3&gt;

&lt;p&gt;The time I spent on dbt docs paid for itself when new team members could understand the data pipeline without hand-holding.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Simple is Scalable
&lt;/h3&gt;

&lt;p&gt;The clean separation between staging and metrics layers meant we could easily add new activity types without refactoring everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Impact
&lt;/h2&gt;

&lt;p&gt;Within a month of deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reporting time&lt;/strong&gt; dropped from 3 days to 15 minutes monthly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data quality issues&lt;/strong&gt; were caught and fixed before affecting metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Field coordinators&lt;/strong&gt; could access current metrics anytime via the dashboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CHW performance conversations&lt;/strong&gt; became data-driven rather than anecdotal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One field manager told me: "For the first time, I can see which communities need support and recognize workers who are going above and beyond."&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;The platform continues to evolve. We're now adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictive analytics for CHW performance&lt;/li&gt;
&lt;li&gt;Automated alerting for coverage gaps&lt;/li&gt;
&lt;li&gt;Integration with supply chain data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the foundation remains the same: clean, tested, documented data transformations that serve real business needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Advice for Other Data Practitioners
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understand the business context&lt;/strong&gt; before writing a single line of SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invest in testing&lt;/strong&gt;: it's cheaper than fixing bad decisions made with bad data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document as you build&lt;/strong&gt;: you will be grateful in future&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start simple&lt;/strong&gt; and iterate based on user feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Building this platform taught me that the most sophisticated data engineering serves the simplest human needs: helping people understand their work and make better decisions. And that's a technical and professional achievement I'm proud to share.&lt;/p&gt;




&lt;p&gt;The complete codebase is available on &lt;a href="https://github.com/augo-amos/chw-performance-analytics-dbt-project" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.*&lt;/p&gt;

</description>
      <category>dbt</category>
      <category>snowflake</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Understanding Kafka Lag: Causes and How to Reduce It</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Mon, 10 Nov 2025 11:48:18 +0000</pubDate>
      <link>https://forem.com/augo_amos/understanding-kafka-lag-causes-and-how-to-reduce-it-26cc</link>
      <guid>https://forem.com/augo_amos/understanding-kafka-lag-causes-and-how-to-reduce-it-26cc</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;What is Kafka Lag?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kafka lag, also called &lt;em&gt;consumer lag&lt;/em&gt;, is the delay between the messages produced to a Kafka topic and the messages consumed from it.&lt;br&gt;
More precisely, it is the difference between the latest offset in a partition (the producer side) and the consumer’s committed offset (the last message the consumer has read and acknowledged).&lt;/p&gt;

&lt;p&gt;In simple terms:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lag = Log End Offset – Consumer Offset&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5taduldpwdqiidxi0dga.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5taduldpwdqiidxi0dga.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Visualizing Kafka Lag in a single partition. The consumer has processed up to offset 3, but producers have already written up to offset 7. The lag, in this case, is 4 messages.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A healthy Kafka system may experience small and temporary lag. However, when lag keeps increasing or remains consistently high, it indicates that consumers are not keeping up with producers. If left unresolved, it can cause delays in analytics, timeouts, or even potential data loss in extreme cases.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Why Kafka Lag Happens&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Lag usually occurs when there is an imbalance between the rate at which messages are produced and the rate at which they are consumed. Several common factors can cause this issue:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Traffic spikes&lt;/strong&gt;&lt;br&gt;
Sudden increases in message volume can overwhelm consumers, especially when they are configured for steady workloads. Consumers will eventually catch up once the load stabilizes, but repeated bursts can lead to persistent lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Data skew across partitions&lt;/strong&gt;&lt;br&gt;
If data is unevenly distributed across partitions, certain partitions may receive much more traffic than others. When that happens, some consumers have to process significantly more data, resulting in uneven lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Slow consumer logic&lt;/strong&gt;&lt;br&gt;
Consumer applications may perform heavy processing, database operations, or external API calls. Blocking I/O and long-running tasks can delay how quickly messages are processed and committed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Inefficient configurations&lt;/strong&gt;&lt;br&gt;
Improperly tuned Kafka settings such as small fetch sizes, long polling intervals, or low batch sizes can limit throughput. This is often one of the most overlooked causes of lag in production systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Resource limitations&lt;/strong&gt;&lt;br&gt;
When hardware resources such as CPU, memory, or network bandwidth are insufficient, both brokers and consumers experience performance degradation that contributes to lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Frequent rebalances&lt;/strong&gt;&lt;br&gt;
Consumer groups may experience frequent rebalances due to unstable connections, configuration mismatches, or aggressive timeouts. During a rebalance, consumption temporarily stops, which can accumulate lag.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Detecting and Monitoring Lag&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Monitoring consumer lag is a fundamental part of Kafka operations. Without active monitoring, lag issues can remain hidden until they impact performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Using Kafka CLI tools&lt;/strong&gt;&lt;br&gt;
Kafka provides a command-line tool to monitor lag at the consumer group and partition level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-consumer-groups.sh &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; kafka:9092 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group&lt;/span&gt; analytics-group &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--describe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command displays information such as the current offset, log end offset, and lag per partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Monitoring platforms&lt;/strong&gt;&lt;br&gt;
Third-party tools like &lt;strong&gt;Sematext&lt;/strong&gt;, &lt;strong&gt;Burrow&lt;/strong&gt;, or open-source exporters for &lt;strong&gt;Prometheus&lt;/strong&gt; can provide real-time lag dashboards and alerts. These platforms help visualize lag trends, identify bottlenecks, and trigger notifications when lag exceeds acceptable thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Key metrics to track&lt;/strong&gt;&lt;br&gt;
The most important metrics to monitor include the consumer offset, log end offset, lag per partition, consumer throughput, and rebalance frequency. Continuous monitoring of these values helps detect performance regressions early.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;How to Reduce or Eliminate Kafka Lag&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once you have identified that lag is growing, the next step is to diagnose the cause and apply the appropriate fix. The following methods are effective for reducing or eliminating Kafka lag:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Optimize consumer processing logic&lt;/strong&gt;&lt;br&gt;
Analyze your consumer application for performance bottlenecks. Avoid blocking operations such as synchronous I/O and external service calls inside the main consumption loop. Where possible, process messages asynchronously or in batches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tune consumer configurations&lt;/strong&gt;&lt;br&gt;
Kafka performance depends heavily on consumer configuration. Adjust parameters such as &lt;code&gt;fetch.max.bytes&lt;/code&gt;, &lt;code&gt;fetch.min.bytes&lt;/code&gt;, and &lt;code&gt;max.poll.interval.ms&lt;/code&gt; to improve throughput. Larger fetch sizes and batch processing often improve efficiency when dealing with large message volumes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Increase partitions to improve parallelism&lt;/strong&gt;&lt;br&gt;
If a topic has too few partitions, it limits how much the workload can be parallelized. Increasing the number of partitions allows more consumers to process data concurrently. Review your partitioning strategy to ensure that messages are evenly distributed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Scale consumers&lt;/strong&gt;&lt;br&gt;
Adding more consumer instances in a consumer group can help balance the workload. Each consumer handles one or more partitions, so increasing the number of consumers (up to the number of partitions) helps catch up faster when lag builds up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Manage consumer group rebalances&lt;/strong&gt;&lt;br&gt;
Reduce the frequency and impact of rebalances by using cooperative assignors such as &lt;code&gt;CooperativeStickyAssignor&lt;/code&gt; and by tuning timeout parameters like &lt;code&gt;session.timeout.ms&lt;/code&gt; and &lt;code&gt;heartbeat.interval.ms&lt;/code&gt;. Stable group membership helps maintain consistent consumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Ensure adequate resources&lt;/strong&gt;&lt;br&gt;
Verify that both brokers and consumers have sufficient hardware resources. Check CPU utilization, memory usage, disk throughput, and network latency. Insufficient resources directly slow down data processing and can cause persistent lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Implement buffering or flow control&lt;/strong&gt;&lt;br&gt;
If your consumer depends on slower downstream systems (for example, writing to a database), implement buffering using internal queues or backpressure mechanisms. This prevents the consumer from stalling when external systems are temporarily slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Set up alerts and automation&lt;/strong&gt;&lt;br&gt;
Always configure alerts for lag thresholds. Use tools like Prometheus or Sematext to send notifications when lag crosses predefined limits. Automated scaling or throttling strategies can also be implemented to maintain consistent throughput.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Practical Steps for Troubleshooting Lag&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When diagnosing Kafka lag, follow this general process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check lag using &lt;code&gt;kafka-consumer-groups.sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Inspect partition distribution to identify skew.&lt;/li&gt;
&lt;li&gt;Review consumer logs for timeouts, rebalances, or processing delays.&lt;/li&gt;
&lt;li&gt;Benchmark consumer throughput and identify bottlenecks.&lt;/li&gt;
&lt;li&gt;Tune consumer configurations and test the impact.&lt;/li&gt;
&lt;li&gt;Add partitions or scale the consumer group as needed.&lt;/li&gt;
&lt;li&gt;Continuously monitor lag metrics to confirm improvement.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kafka lag is a key performance indicator in any real-time data streaming system. Small fluctuations are normal, but persistent lag signals inefficiency in processing or scaling. By combining continuous monitoring, configuration tuning, and scaling strategies, organizations can ensure reliable, low-latency data pipelines capable of supporting analytics, monitoring, and machine learning workloads.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;References&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Redpanda Data. &lt;em&gt;Kafka Performance Tuning and Consumer Lag&lt;/em&gt;. Retrieved from &lt;a href="https://www.redpanda.com/guides/kafka-performance-kafka-lag" rel="noopener noreferrer"&gt;https://www.redpanda.com/guides/kafka-performance-kafka-lag&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Sematext. &lt;em&gt;Kafka Consumer Lag, Offsets, and Monitoring&lt;/em&gt;. Retrieved from &lt;a href="https://sematext.com/blog/kafka-consumer-lag-offsets-monitoring/" rel="noopener noreferrer"&gt;https://sematext.com/blog/kafka-consumer-lag-offsets-monitoring/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Groundcover. &lt;em&gt;Kafka Slow Consumer: Causes and Solutions&lt;/em&gt;. Retrieved from &lt;a href="https://www.groundcover.com/blog/kafka-slow-consumer" rel="noopener noreferrer"&gt;https://www.groundcover.com/blog/kafka-slow-consumer&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>kafka</category>
      <category>kafkalag</category>
    </item>
    <item>
      <title>Real-Time Crypto Data Pipeline with Change Data Capture (CDC) Using PostgreSQL, Kafka, Cassandra, and Grafana</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Mon, 03 Nov 2025 11:05:56 +0000</pubDate>
      <link>https://forem.com/augo_amos/real-time-crypto-data-pipeline-with-change-data-capture-cdc-using-postgresql-kafka-cassandra-3ip7</link>
      <guid>https://forem.com/augo_amos/real-time-crypto-data-pipeline-with-change-data-capture-cdc-using-postgresql-kafka-cassandra-3ip7</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this project, I built a complete real-time cryptocurrency analytics system from the ground up, capable of ingesting, storing, and visualizing live crypto market data.&lt;/p&gt;

&lt;p&gt;The system collects price and volume data from the Binance Exchange, streams it through Kafka (with Debezium CDC), stores it in Cassandra, and visualizes it live in Grafana.&lt;/p&gt;

&lt;p&gt;This setup simulates a lightweight version of the kind of real-time infrastructure used by trading platforms, financial dashboards, and risk monitoring systems, emphasizing scalability, fault-tolerance, and live data analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;System Architecture Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the high-level flow of data through the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Binance API → PostgreSQL → Debezium (CDC) → Kafka → Cassandra → Grafana
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Components Breakdown&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Binance API&lt;/td&gt;
&lt;td&gt;Provides live cryptocurrency data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CDC Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Debezium&lt;/td&gt;
&lt;td&gt;Captures real-time changes in PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Message Broker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Kafka&lt;/td&gt;
&lt;td&gt;Streams change events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Cassandra&lt;/td&gt;
&lt;td&gt;Stores processed market data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visualization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;Real-time dashboard and analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Design Choice: No Sink Connector&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In a typical Kafka-based architecture, a sink connector is used to automatically write streamed data into a destination like Cassandra.&lt;/p&gt;

&lt;p&gt;However, in this project, I did not use a sink connector. Instead, I wrote a custom consumer script that listens to Kafka topics and inserts data manually into Cassandra.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why I Chose This Approach&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Full Control Over Data Transformation:&lt;/strong&gt;
We could clean, transform, and enrich the messages before writing them into Cassandra.
(For example, filtering noise, flattening JSON payloads, converting timestamps, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better Error Handling and Debugging:&lt;/strong&gt;
We could log failed inserts, handle schema mismatches gracefully, and replay specific batches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler to Debug During Development:&lt;/strong&gt;
For educational and experimental purposes, it was easier to see what was flowing through each stage of the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future Extensibility:&lt;/strong&gt;
This approach lets us modify the consumer to perform &lt;strong&gt;real-time computations or alerts&lt;/strong&gt; before persisting the data.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;How Data Flows Through the System&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Binance WebSocket Feeds&lt;/strong&gt; continuously push ticker, order book, and kline (candlestick) updates.&lt;br&gt;
These updates are first written into &lt;strong&gt;PostgreSQL&lt;/strong&gt; tables for structured storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60erldk8s2xlw44345xx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60erldk8s2xlw44345xx.png" alt=" " width="800" height="201"&gt;&lt;/a&gt;&lt;em&gt;Producer inserting data into Postgres&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debezium&lt;/strong&gt; monitors PostgreSQL using Change Data Capture (CDC) and streams every row change into &lt;strong&gt;Kafka topics&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqc0tng2i8413ehyd4xi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqc0tng2i8413ehyd4xi.png" alt=" " width="800" height="148"&gt;&lt;/a&gt;&lt;em&gt;Kafka topics&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;Kafka consumer&lt;/strong&gt; (a Python script) reads from those topics and writes structured data into &lt;strong&gt;Cassandra&lt;/strong&gt; tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffq32uqrffz5j56l7p5b8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffq32uqrffz5j56l7p5b8.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;em&gt;Consumer streaming changes into Cassandra&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0c6in6sefm108sax2pu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0c6in6sefm108sax2pu.png" alt=" " width="800" height="267"&gt;&lt;/a&gt;&lt;em&gt;Data in a Cassandra table&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally, &lt;strong&gt;Grafana&lt;/strong&gt; queries either the Flask API (via Infinity plugin) or Cassandra directly to visualize the live data.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Database Design&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Cassandra Keyspace: &lt;code&gt;binance_keyspace&lt;/code&gt;&lt;/strong&gt;
&lt;/h3&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;1. &lt;code&gt;crypto_24h_stats&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Stores 24-hour performance metrics for each asset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;crypto_24h_stats&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;price_change_percent&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;high_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;low_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quote_volume&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;2. &lt;code&gt;latest_prices&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Holds the latest streaming prices for every cryptocurrency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;latest_prices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;CLUSTERING&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;3. &lt;code&gt;klines&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Stores candlestick (OHLCV) data for financial charting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;klines&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;open_time&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;open_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;high_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;low_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;close_price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;number_of_trades&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;open_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;4. &lt;code&gt;order_book&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Captures market depth snapshots, showing the distribution of buy (bids) and sell (asks) orders for each asset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;order_book&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bids&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;asks&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;CLUSTERING&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  &lt;strong&gt;5. &lt;code&gt;recent_trades&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Logs the most recent individual trades for each symbol, including their price, quantity, and timestamps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;recent_trades&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trade_id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trade_time&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_buyer_maker&lt;/span&gt; &lt;span class="nb"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;trade_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Grafana Integration&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Grafana was connected in &lt;strong&gt;two ways:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cassandra Data Source Plugin&lt;/strong&gt; – for direct database queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinity Plugin (HTTP JSON)&lt;/strong&gt; – I experimented with Flask API endpoints.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This dual approach allowed us to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a back-up connection of pre-aggregated data through the API.&lt;/li&gt;
&lt;li&gt;Query Cassandra directly for time-series or historical analysis.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Visualizations&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Top 10 Most Traded Cryptocurrencies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Displays the highest trading volume across all assets over the last 24 hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;crypto_24h_stats&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Visualization:&lt;/strong&gt; Bar Chart&lt;br&gt;
&lt;strong&gt;Insight:&lt;/strong&gt; Quickly identifies the most active markets.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example Output:&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbns76lbr0fhu4x2xbr8.png" alt=" " width="800" height="253"&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. Trade Count Overview (ETHUSDT)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Shows the number of trades executed for ETHUSDT over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;open_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;number_of_trades&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;klines&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ETHUSDT'&lt;/span&gt; 
&lt;span class="n"&gt;ALLOW&lt;/span&gt; &lt;span class="n"&gt;FILTERING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Visualization:&lt;/strong&gt; Time Series / Line Chart&lt;br&gt;
&lt;strong&gt;Insight:&lt;/strong&gt; Useful for spotting spikes in market activity.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example Output:&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymo6gsgytwein93046n6.png" alt=" " width="800" height="252"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Advantages of This Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Modularity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each component can scale independently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transparency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every transformation stage (PostgreSQL → Kafka → Cassandra) is observable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The manual consumer allows you to filter or compute before persisting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supports both direct and API-based data sources.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resilience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cassandra ensures high availability and fast queries for time-series data.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges Faced &amp;amp; How I Solved Them&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;ORDER BY&lt;/code&gt; not supported for non-clustering columns in Cassandra&lt;/td&gt;
&lt;td&gt;Redesigned tables with clustering keys (&lt;code&gt;updated_at&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON array fields (order book) hard to query&lt;/td&gt;
&lt;td&gt;Flattened data during Kafka consumer insertion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana timeouts using localhost&lt;/td&gt;
&lt;td&gt;Switched to &lt;code&gt;host.docker.internal&lt;/code&gt; for API access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time data ingestion from Binance API&lt;/li&gt;
&lt;li&gt;Fully automated CDC from PostgreSQL to Cassandra&lt;/li&gt;
&lt;li&gt;Dynamic dashboards for price, volume, and trade count analytics&lt;/li&gt;
&lt;li&gt;Live financial-style visualizations similar to exchange UIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final Grafana dashboard effectively shows market volatility, trading activity, and price trends, all in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Future Enhancements&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use Kafka Streams for real-time computations before Cassandra.&lt;/li&gt;
&lt;li&gt;Add alerting rules in Grafana (e.g., trigger if ETHUSDT price drops 5%).&lt;/li&gt;
&lt;li&gt;Introduce machine learning models for predictive analytics.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how open-source tools can be combined to build a real-time financial data analytics system with high performance and flexibility.&lt;/p&gt;

&lt;p&gt;By avoiding a sink connector, I gained greater transparency, data control, and adaptability, all while maintaining a clean and observable data pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/augo-amos/real-time-crypto-data-pipeline-with-change-data-capture-using-postgresql-kafka-cassandra-and-grafana" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>changedatacapture</category>
      <category>kafka</category>
      <category>cassandra</category>
    </item>
    <item>
      <title>Containerization for Data Engineering: A Practical Guide with Docker and Docker Compose</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Mon, 13 Oct 2025 12:38:23 +0000</pubDate>
      <link>https://forem.com/augo_amos/containerization-for-data-engineering-a-practical-guide-with-docker-and-docker-compose-241n</link>
      <guid>https://forem.com/augo_amos/containerization-for-data-engineering-a-practical-guide-with-docker-and-docker-compose-241n</guid>
      <description>&lt;h3&gt;
  
  
  1. Introduction
&lt;/h3&gt;

&lt;p&gt;Data engineers today face numerous challenges: environment inconsistencies between development and production, dependency conflicts when different projects require different library versions, and scaling issues as data volumes grow. Containerization solves these problems by packaging applications and their dependencies into isolated, portable units. In this guide, you'll learn how to use Docker and Docker Compose to build reproducible data engineering environments that run consistently anywhere. This practical guide is designed for data engineers, analysts, and developers who want to automate and scale their data pipelines efficiently.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Understanding Containerization in Data Engineering
&lt;/h3&gt;

&lt;p&gt;Containerization is a lightweight virtualization technology that packages an application with all its dependencies, libraries, system tools, code, and runtime, into a single, self-contained unit called a container. Unlike virtual machines, which require a full operating system for each instance, containers share the host system's kernel, making them faster to start and more resource-efficient.&lt;/p&gt;

&lt;p&gt;For data engineering, this technology offers significant benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility&lt;/strong&gt;: Ensure pipelines run identically across different environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portability&lt;/strong&gt;: Move containers seamlessly between local machines, cloud platforms, and on-premises servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Quickly scale services up or down based on workload demands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaboration&lt;/strong&gt;: Share standardized environments across teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical components in a data engineering stack that benefit from containerization include ETL scripts, databases (PostgreSQL, MySQL), schedulers (Airflow), processing engines (Spark), and visualization tools (Grafana).&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Key Docker Concepts You Need to Know
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljnp57r6wdbw7lgcbvv9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljnp57r6wdbw7lgcbvv9.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker Images &amp;amp; Containers&lt;/strong&gt;: An image is a read-only template with instructions for creating a container, while a container is a runnable instance of an image.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dockerfile&lt;/strong&gt;: A text document containing all commands to build a Docker image.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Hub&lt;/strong&gt;: A registry of Docker images where you can pull base images for common applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volumes &amp;amp; Networks&lt;/strong&gt;: Volumes persist data beyond container lifecycles, while networks enable secure communication between containers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker CLI Basics&lt;/strong&gt;: Essential commands include &lt;code&gt;docker build&lt;/code&gt; (create images), &lt;code&gt;docker run&lt;/code&gt; (start containers), &lt;code&gt;docker ps&lt;/code&gt; (list running containers), and &lt;code&gt;docker stop&lt;/code&gt; (halt containers).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Setting Up a Data Engineering Environment
&lt;/h3&gt;

&lt;p&gt;Let's build a simple pipeline environment with Airflow for orchestration, PostgreSQL for data storage, and Python scripts for data transformation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Folder Structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── docker-compose.yml
├── airflow/
│   ├── Dockerfile
│   └── dags/
├── postgres/
│   └── init.sql
├── scripts/
│   └── data_transformation.py
└── requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-Step Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Containerize Airflow&lt;/strong&gt;: Create a custom Dockerfile to extend the official Airflow image and install additional Python packages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure PostgreSQL&lt;/strong&gt;: Use the official PostgreSQL image with initialization scripts for database schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Package Transformation Scripts&lt;/strong&gt;: Build a custom image for data processing tasks with required dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define Dependencies&lt;/strong&gt;: List Python packages in requirements.txt for consistent installation.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  5. Using Docker Compose to Orchestrate Services
&lt;/h3&gt;

&lt;p&gt;Docker Compose simplifies multi-container setups by allowing you to define and manage them in a single YAML file. Instead of starting each container manually with complex docker run commands, you can start all services with one command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example docker-compose.yml:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:13&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;

  &lt;span class="na"&gt;airflow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./airflow&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./airflow/dags:/opt/airflow/dags&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Essential Commands:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;docker compose up -d&lt;/code&gt;: Start all services in detached mode&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker compose ps&lt;/code&gt;: Check status of running services&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker compose down&lt;/code&gt;: Stop and remove all containers&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  6. Practical Example: Containerizing a Mini Data Pipeline
&lt;/h3&gt;

&lt;p&gt;Let's implement a complete pipeline that ingests, transforms, stores, and visualizes data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic2fupk18gjzar1gu6el.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic2fupk18gjzar1gu6el.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Objective&lt;/strong&gt;: Ingest → Transform → Store → Visualize&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Extract&lt;/strong&gt; - A Python script pulls mock data from an API or generates synthetic data.&lt;br&gt;
&lt;strong&gt;Step 2: Transform&lt;/strong&gt; - Use PySpark or pandas within a container to clean and process the data.&lt;br&gt;
&lt;strong&gt;Step 3: Load&lt;/strong&gt; - Load transformed data into PostgreSQL using appropriate connectors.&lt;br&gt;
&lt;strong&gt;Step 4: Orchestrate&lt;/strong&gt; - Schedule the pipeline tasks using Airflow DAGs.&lt;br&gt;
&lt;strong&gt;Step 5: Monitor&lt;/strong&gt; - Connect Grafana to PostgreSQL to create dashboards for data visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Compose Integration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... previous services&lt;/span&gt;
  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GF_SECURITY_ADMIN_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  7. Common Pitfalls &amp;amp; Best Practices
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pitfalls to Avoid:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storing credentials directly in Dockerfiles or compose files (use .env files instead)&lt;/li&gt;
&lt;li&gt;Forgetting volume mounts, leading to data loss when containers restart&lt;/li&gt;
&lt;li&gt;Ignoring container logs and resource usage, making debugging difficult&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use lightweight base images (python:3.9-slim instead of python:3.9) to reduce image size&lt;/li&gt;
&lt;li&gt;Version your images (e.g., myapp:v1.2) for better traceability&lt;/li&gt;
&lt;li&gt;Document your Compose services with comments in the YAML file&lt;/li&gt;
&lt;li&gt;Follow the single-responsibility principle, each container should have one specific purpose&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  8. Scaling &amp;amp; Extending the Setup
&lt;/h3&gt;

&lt;p&gt;As your data needs grow, Docker Compose makes scaling straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scale services: &lt;code&gt;docker compose up --scale spark-worker=3&lt;/code&gt; to add multiple Spark workers&lt;/li&gt;
&lt;li&gt;Production deployment: Consider Kubernetes or AWS ECS for orchestration at scale&lt;/li&gt;
&lt;li&gt;CI/CD integration: Automate image building and testing in your deployment pipeline&lt;/li&gt;
&lt;li&gt;Enhanced monitoring: Integrate Prometheus for metrics collection and Loki for log aggregation&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  9. Conclusion
&lt;/h3&gt;

&lt;p&gt;Containerization fundamentally simplifies data engineering by providing consistent, reproducible environments that eliminate "it works on my machine" problems. Docker and Docker Compose offer practical tools to build, share, and scale data pipelines efficiently. &lt;/p&gt;




&lt;h3&gt;
  
  
  10. Appendix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/" rel="noopener noreferrer"&gt;Official Docker Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/" rel="noopener noreferrer"&gt;Airflow Docker Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/example/data-engineering-docker" rel="noopener noreferrer"&gt;Example Data Engineering Project with Docker&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Full Working Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bitnami/spark:3.5.0&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SPARK_MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7077:7077"&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:13&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mydatabase&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;db_data:/var/lib/postgresql/data&lt;/span&gt;

  &lt;span class="na"&gt;data_ingestion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./data_ingestion_service&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres://user:password@postgres:5432/mydatabase&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;db_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>docker</category>
    </item>
    <item>
      <title>I Built a Real-Time Analytics Platform to Track MrBeast’s YouTube Channel</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Fri, 10 Oct 2025 13:10:35 +0000</pubDate>
      <link>https://forem.com/augo_amos/building-a-real-time-youtube-analytics-platform-with-airflow-postgresql-grafana-1k12</link>
      <guid>https://forem.com/augo_amos/building-a-real-time-youtube-analytics-platform-with-airflow-postgresql-grafana-1k12</guid>
      <description>&lt;h2&gt;
  
  
  How I Automated MrBeast's Channel Performance Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50pzr12yzfczzmbs4sy8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50pzr12yzfczzmbs4sy8.png" alt=" " width="431" height="614"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the competitive world of YouTube content creation, data-driven decisions separate successful channels from the rest. As a data engineer and YouTube enthusiast, I built an automated analytics platform that transforms raw YouTube API data into actionable business intelligence. Here's how I created a real-time monitoring system for one of YouTube's largest channels - MrBeast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: From Raw Data to Actionable Insights
&lt;/h2&gt;

&lt;p&gt;YouTube Studio provides basic analytics, but content creators face several limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Historical data limitations&lt;/strong&gt; - Only 90 days of detailed analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual reporting&lt;/strong&gt; - No automated daily snapshots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited correlation analysis&lt;/strong&gt; - Hard to connect publishing patterns with performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No custom alerts&lt;/strong&gt; - Can't set thresholds for engagement drops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My solution: An automated pipeline that captures channel metrics daily, transforms them into analytical features, and presents them in an interactive Grafana dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;YouTube API → Airflow → PySpark → PostgreSQL → Grafana&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline runs entirely on Docker containers, making it portable and easy to deploy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Containerized Services:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt;: Time-series data storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt;: Pipeline orchestration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt;: Visualization and dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PySpark&lt;/strong&gt;: Data transformation engine&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Extraction: Taming the YouTube API
&lt;/h2&gt;

&lt;p&gt;The extraction process handles YouTube's API limitations while capturing comprehensive channel data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetching channel info...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;channel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_channel_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CHANNEL_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;video_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_all_video_ids&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CHANNEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;videos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_videos_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;save_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;videos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RAW_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;videos_raw.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key challenges solved:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt;: Implemented strategic delays between API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pagination&lt;/strong&gt;: Handles channels with thousands of videos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data freshness&lt;/strong&gt;: Daily snapshots capture metric evolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling&lt;/strong&gt;: Continues processing even if individual videos fail&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Transformation: From Raw JSON to Analytical Features
&lt;/h2&gt;

&lt;p&gt;The PySpark transformation script enriches raw data with business-critical features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_feat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_cast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;views&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;views&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;publish_hour&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generated features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;engagement_rate&lt;/code&gt;: (Likes + Comments) / Views&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;publish_hour&lt;/code&gt;: Best times to publish&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;publish_day&lt;/code&gt;: Optimal days of week&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;published_ts&lt;/code&gt;: Standardized timestamps for time-series analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Orchestration: Airflow for Reliable Automation
&lt;/h2&gt;

&lt;p&gt;The DAG ensures daily execution with proper dependency management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;youtube_channel_pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;youtube&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;extract&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract_youtube&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform_pyspark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;extract&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;transform&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flboyn77du7qqm4ekham0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flboyn77du7qqm4ekham0.png" alt=" " width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualization: Grafana Dashboards for Instant Insights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Channel Health Gauge Dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f15dqyfyb9nruabnru1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f15dqyfyb9nruabnru1.png" alt=" " width="518" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The gauge dashboard provides an at-a-glance view of channel vitals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Average Engagement Rate&lt;/strong&gt;: 2.5% (industry benchmark comparison)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Consistency&lt;/strong&gt;: 20 videos/month tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Growth Metrics&lt;/strong&gt;: Real-time subscriber and view counts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Top Performing Videos Analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv679rj3tsz2t210udbkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv679rj3tsz2t210udbkp.png" alt=" " width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The horizontal bar chart reveals content performance patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Would You Fly to Paris for a Baguette?" - 1.6B views&lt;/li&gt;
&lt;li&gt;"50 YouTubers Fight For $1,000,000" - High engagement&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Channel Statistics Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ki42vwy8cvcy4wfxnjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ki42vwy8cvcy4wfxnjf.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Real-time business intelligence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;444 Million&lt;/strong&gt; subscribers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;97.3 Billion&lt;/strong&gt; total views
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;907&lt;/strong&gt; videos in library&lt;/li&gt;
&lt;li&gt;Daily growth tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Implementation Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Docker Compose Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:13&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_USER}&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_PASSWORD}&lt;/span&gt;

  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana:9.0.0&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;airflow-webserver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./docker&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Database Schema Design
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;videos_processed table:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;video_id&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;published_ts&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;views&lt;/code&gt;, &lt;code&gt;likes&lt;/code&gt;, &lt;code&gt;comments&lt;/code&gt;, &lt;code&gt;engagement_rate&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;publish_hour&lt;/code&gt;, &lt;code&gt;publish_day&lt;/code&gt; (analytical dimensions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;channel_stats table:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time-series snapshot of channel growth&lt;/li&gt;
&lt;li&gt;Daily subscriber, view, and video counts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Business Value Delivered
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30% faster&lt;/strong&gt; content strategy decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated&lt;/strong&gt; daily performance reporting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictive insights&lt;/strong&gt; for video performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time&lt;/strong&gt; alerting for metric anomalies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Insights Uncovered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Publishing Strategy Optimization
&lt;/h3&gt;

&lt;p&gt;The data reveals MrBeast's winning formula:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prime Time&lt;/strong&gt;: 4:00 PM publishes consistently outperform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekend Advantage&lt;/strong&gt;: Friday and Saturday releases gain 25% more initial engagement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: 20+ videos monthly maintains audience retention&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Engagement Patterns
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ideal Engagement Rate&lt;/strong&gt;: 2.5-3.5% for viral content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comment-to-Like Ratio&lt;/strong&gt;: High-value discussions indicate strong community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Lifespan&lt;/strong&gt;: Videos continue gaining views for 45+ days&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This YouTube analytics platform demonstrates how modern data engineering tools can transform raw API data into strategic business intelligence. By combining Airflow for orchestration, PySpark for transformation, PostgreSQL for storage, and Grafana for visualization, we've created a scalable system that provides real-time insights for content strategy optimization.&lt;/p&gt;

&lt;p&gt;The pipeline currently processes MrBeast's channel data, but the architecture can be extended to monitor multiple channels, compare performance benchmarks, and provide content creators with the data-driven insights needed to thrive in the competitive YouTube ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/augo-amos/real-time-youtube-analytics-platform-with-airflow-postgresql-grafana" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>grafa</category>
      <category>docker</category>
      <category>postgres</category>
      <category>airflow</category>
    </item>
    <item>
      <title>A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Sun, 28 Sep 2025 23:17:01 +0000</pubDate>
      <link>https://forem.com/augo_amos/a-beginners-guide-to-big-data-analytics-with-apache-spark-and-pyspark-3889</link>
      <guid>https://forem.com/augo_amos/a-beginners-guide-to-big-data-analytics-with-apache-spark-and-pyspark-3889</guid>
      <description>&lt;h2&gt;
  
  
  Why Big Data Matters
&lt;/h2&gt;

&lt;p&gt;A generous amount of data is generated every second, from online and social media interactions to IoT devices and business operations. This quantity of data is often too much for traditional data processing tools to handle, which is where Apache Spark comes in. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt; is a powerful open-source engine for large-scale data processing, capable of handling datasets that are too large for a single computer. When combined with &lt;strong&gt;PySpark&lt;/strong&gt; (Spark's Python API), it becomes a powerful tool for data analysts and scientists to work with large datasets using familiar Python syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache Spark?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Big Data Problem
&lt;/h3&gt;

&lt;p&gt;Imagine trying to analyze:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 years of sales data from a multinational corporation&lt;/li&gt;
&lt;li&gt;Real-time sensor data from thousands of IoT devices&lt;/li&gt;
&lt;li&gt;Social media feeds with millions of posts daily
Traditional tools like Excel or basic Python scripts would either crash or take forever to crunch this data. Spark saves this situation by distributing the work across multiple computers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Spark Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt;: Up to 100x faster than Hadoop for certain operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease of Use&lt;/strong&gt;: Simple APIs in Python, Scala, Java, and R&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versatility&lt;/strong&gt;: Supports SQL, streaming, machine learning, and graph processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault Tolerance&lt;/strong&gt;: Automatically recovers from failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setting Up Your Environment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download Spark&lt;/span&gt;
wget https://downloads.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

&lt;span class="c"&gt;# Extract and set up&lt;/span&gt;
&lt;span class="nb"&gt;tar &lt;/span&gt;xvf spark-4.0.1-bin-hadoop3.tgz
&lt;span class="nb"&gt;mv &lt;/span&gt;spark-4.0.1-bin-hadoop3 spark

&lt;span class="c"&gt;# Install PySpark&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyspark findspark jupyter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Start a Spark Session
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MyFirstSparkApp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Test with simple data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Charlie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Core Concepts of Spark's Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Distributed Computing
&lt;/h3&gt;

&lt;p&gt;Spark works by splitting data into partitions and processing them in parallel across multiple nodes. This is like assigning a task to team of workers instead of one person doing all the hard work.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Resilient Distributed Datasets (RDDs)
&lt;/h3&gt;

&lt;p&gt;An RDD is the fundamental data structure in Spark. RDDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Immutable&lt;/strong&gt;: Cannot be changed, only transformed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed&lt;/strong&gt;: Spread across multiple nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault-tolerant&lt;/strong&gt;: Can recover from node failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. DataFrames
&lt;/h3&gt;

&lt;p&gt;DataFrames are a higher-level abstraction built on RDDs that provides a structured interface similar to pandas DataFrames or SQL tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Creating a DataFrame from a CSV
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Similar to pandas, but distributed!
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hands-On Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Example 1: Basic Data Analysis
&lt;/h3&gt;

&lt;p&gt;Let's analyze restaurant order data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;restaurant_orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Explore data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Simple aggregations
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;# Total revenue by food category
&lt;/span&gt;&lt;span class="n"&gt;revenue_by_category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Food Category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;revenue_by_category&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example 2: SQL-like Operations
&lt;/h3&gt;

&lt;p&gt;Spark lets you use SQL queries on distributed data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Register DataFrame as SQL table
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run SQL queries
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT `Food Category`, 
           AVG(Amount) as avg_order_value,
           COUNT(*) as order_count
    FROM orders 
    GROUP BY `Food Category`
    HAVING order_count &amp;gt; 100
    ORDER BY avg_order_value DESC
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example 3: Data Cleaning and Transformation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Handle missing values
&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Customer ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Create new features
&lt;/span&gt;&lt;span class="n"&gt;df_cat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Size_Category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Filter and transform
&lt;/span&gt;&lt;span class="n"&gt;large_orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_cat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Size_Category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount_With_Tax&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Customer Analytics
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Customer segmentation
&lt;/span&gt;&lt;span class="n"&gt;customer_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_spent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_since_last_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Identify VIP customers
&lt;/span&gt;&lt;span class="n"&gt;vip_customers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customer_metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_spent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; 
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_since_last_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Time Series Analysis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Daily revenue trends
&lt;/span&gt;&lt;span class="n"&gt;daily_revenue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
    &lt;span class="nf"&gt;dayofmonth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Machine Learning Preparation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.feature&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAssembler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringIndexer&lt;/span&gt;

&lt;span class="c1"&gt;# Prepare features for ML
&lt;/span&gt;&lt;span class="n"&gt;indexer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StringIndexer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Food Category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Category_Index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_indexed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indexer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;assembler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorAssembler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;inputCols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Category_Index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outputCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ml_ready_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;assembler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_indexed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Pitfalls and How to Avoid Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Data Type Issues
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: String columns used in numeric operations&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Always check and convert data types&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Check types first
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;double&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Column Name Confusion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Spaces in column names cause errors&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Use backticks or rename columns&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Correct way to handle spaces
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Order Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# or
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;`Order Date`&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;`Total Amount`&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Advanced Topics to Explore Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Streaming Data
Process real-time data from Kafka, Kinesis, or TCP sockets&lt;/li&gt;
&lt;li&gt;Machine Learning
Build distributed ML pipelines&lt;/li&gt;
&lt;li&gt;Graph Processing
Analyze relationships with GraphFrames&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Spark with PySpark makes big data analytics more accessible, allowing Python developers to harness powerful distributed computing. Although there is a learning curve, the capability to efficiently process massive datasets creates remarkable opportunities for insight and innovation.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>pyspark</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Building a Real-Time Binance Data Pipeline with Kafka and PostgreSQL</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Sun, 28 Sep 2025 23:00:04 +0000</pubDate>
      <link>https://forem.com/augo_amos/building-a-real-time-binance-data-pipeline-with-kafka-and-postgresql-5bj8</link>
      <guid>https://forem.com/augo_amos/building-a-real-time-binance-data-pipeline-with-kafka-and-postgresql-5bj8</guid>
      <description>&lt;p&gt;This project demonstrates a simple &lt;strong&gt;real-time data pipeline&lt;/strong&gt; that streams live cryptocurrency prices from the &lt;strong&gt;Binance API&lt;/strong&gt;, publishes them to a &lt;strong&gt;Kafka topic&lt;/strong&gt; (hosted on Confluent), consumes them with a &lt;strong&gt;Kafka consumer&lt;/strong&gt;, and stores the results into a &lt;strong&gt;PostgreSQL database&lt;/strong&gt; (hosted on Aiven).&lt;/p&gt;

&lt;p&gt;It’s a hands-on learning project for integrating streaming platforms with databases, ideal for practicing &lt;strong&gt;Data Engineering fundamentals&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Producer (&lt;code&gt;kafka-producer.py&lt;/code&gt;)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connects to the Binance API.&lt;/li&gt;
&lt;li&gt;Publishes live price data to a Kafka topic (&lt;code&gt;binance&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consumer (&lt;code&gt;kafka-consumer.py&lt;/code&gt;)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subscribes to the Kafka topic.&lt;/li&gt;
&lt;li&gt;Parses each message.&lt;/li&gt;
&lt;li&gt;Inserts records into PostgreSQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hosted on Aiven.&lt;/li&gt;
&lt;li&gt;Stores parsed records for querying and analysis.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Environment Variables
&lt;/h3&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file with your Kafka and PostgreSQL credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kafka&lt;/span&gt;
&lt;span class="nv"&gt;BOOTSTRAP_SERVERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pkc-xxxxx.confluent.cloud:9092
&lt;span class="nv"&gt;SECURITY_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;SASL_SSL
&lt;span class="nv"&gt;SASL_MECHANISM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;PLAIN
&lt;span class="nv"&gt;SASL_USERNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;Confluent_API_Key&amp;gt;
&lt;span class="nv"&gt;SASL_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;Confluent_API_Secret&amp;gt;
&lt;span class="nv"&gt;TOPIC_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;binance

&lt;span class="c"&gt;# Postgres (Aiven)&lt;/span&gt;
&lt;span class="nv"&gt;DBHOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pg-xxxxxx.aivencloud.com
&lt;span class="nv"&gt;DBPORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;17154
&lt;span class="nv"&gt;DBNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;defaultdb
&lt;span class="nv"&gt;DBUSER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;avnadmin       &lt;span class="c"&gt;# IMPORTANT: must use the exact user from Aiven credentials&lt;/span&gt;
&lt;span class="nv"&gt;DBPASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your_password&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure to add &lt;code&gt;.env&lt;/code&gt; to &lt;code&gt;.gitignore&lt;/code&gt; so credentials aren’t pushed to GitHub. Example &lt;code&gt;.gitignore&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.env
__pycache__/
*.pyc
.vscode/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Installing Dependencies
&lt;/h3&gt;

&lt;p&gt;For package management, the workflow was:&lt;/p&gt;

&lt;p&gt;Install the dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install &lt;/span&gt;confluent-kafka psycopg2-binary python-dotenv requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save them into &lt;code&gt;requirements.txt&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip freeze &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example &lt;code&gt;requirements.txt&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   confluent-kafka==2.5.3
   psycopg2-binary==2.9.9
   python-dotenv==1.0.1
   requests==2.32.3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-install them anytime with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures anyone cloning the project can recreate the same environment easily.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Run the Pipeline
&lt;/h3&gt;

&lt;p&gt;Start the producer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 kafka-producer.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the consumer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 kafka-consumer.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  PostgreSQL Table Schema
&lt;/h2&gt;

&lt;p&gt;The consumer script ensures the table exists. The schema is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;binance_24h&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;symbol&lt;/span&gt;                  &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pricechange&lt;/span&gt;             &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pricechangepercentage&lt;/span&gt;   &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;openprice&lt;/span&gt;               &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;closeprice&lt;/span&gt;              &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;highprice&lt;/span&gt;               &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lowprice&lt;/span&gt;                &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;volume&lt;/span&gt;                  &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Issues Encountered &amp;amp; Solutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Authentication Failure (Password &amp;amp; User Mismatch)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Issue&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  FATAL: password authentication failed for user "dev_user"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cause&lt;/strong&gt;: Postgres on Aiven requires the exact generated username (&lt;code&gt;avnadmin&lt;/code&gt;, etc.). Using &lt;code&gt;USER&lt;/code&gt; conflicted with the system &lt;code&gt;$USER&lt;/code&gt; variable.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcoded the correct username in &lt;code&gt;.env&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Alternatively, renamed variable to &lt;code&gt;DBUSER&lt;/code&gt; to avoid conflict with reserved names.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. &lt;strong&gt;Postgres Column Mismatch Error&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Issue&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  column "price" of relation "binance_24h" does not exist
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cause&lt;/strong&gt;: Table schema did not match Binance API JSON keys.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created a schema that exactly matched the API response fields (&lt;code&gt;priceChange&lt;/code&gt;, &lt;code&gt;priceChangePercent&lt;/code&gt;, etc.).&lt;/li&gt;
&lt;li&gt;Updated the SQL &lt;code&gt;INSERT&lt;/code&gt; query to align column names with JSON keys.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Avoid using environment variable names like &lt;code&gt;USER&lt;/code&gt; that clash with system defaults.&lt;/li&gt;
&lt;li&gt;Schema consistency between producer → consumer → database is crucial.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;pip freeze &amp;gt; requirements.txt&lt;/code&gt; + &lt;code&gt;pip install -r requirements.txt&lt;/code&gt; for reproducible environments.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Extend the pipeline to include more symbols from Binance.&lt;/li&gt;
&lt;li&gt;Add error handling and retries.&lt;/li&gt;
&lt;li&gt;Visualize the stored data in a dashboard (e.g., Power BI, Grafana).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/augo-amos/kafka-binance" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Mon, 22 Sep 2025 16:11:39 +0000</pubDate>
      <link>https://forem.com/augo_amos/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-23op</link>
      <guid>https://forem.com/augo_amos/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-23op</guid>
      <description>&lt;p&gt;The need to handle large streams of data reliably and at scale has become a necessity in this age of big data and real-time applications. At the core of this revolution is Apache Kafka, a distributed, durable, highly scalable event streaming system used for building streaming applications and real-time pipelines. In this article, we will explore Kafka’s core architectural concepts, show how modern data engineering teams use it, examine practical production practices and configurations, and highlight concrete use cases from Netflix, LinkedIn, and Uber.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What is Kafka?
&lt;/h2&gt;

&lt;p&gt;Apache Kafka is a distributed event streaming platform that exposes a durable, partitioned, append-only log. Producers write events to named &lt;strong&gt;topics&lt;/strong&gt;, which are split into &lt;strong&gt;partitions&lt;/strong&gt; for scale; consumers read from partitions independently and maintain offsets to track progress. Kafka was designed for high throughput, horizontal scalability, and fault tolerance, and it’s widely used for log aggregation, stream processing, event sourcing, and building real-time applications. (&lt;a href="https://kafka.apache.org/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuegha8l9q3n7wgquo1au.png" alt=" " width="800" height="533"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  2. Core concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Topics &amp;amp; partitions
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;topic&lt;/strong&gt; is a named stream of records. Each topic is split into &lt;strong&gt;partitions&lt;/strong&gt;, which are the units of parallelism. Partitions are ordered, and each record in a partition has an offset (a monotonically increasing sequence number).&lt;/p&gt;

&lt;h3&gt;
  
  
  Brokers, clusters, and leaders
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;broker&lt;/strong&gt; is a Kafka server. Brokers form a &lt;strong&gt;cluster&lt;/strong&gt;; each partition has one &lt;strong&gt;leader&lt;/strong&gt; (handles reads/writes) and zero or more &lt;strong&gt;followers&lt;/strong&gt; (replicas) that copy the leader’s data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Replication &amp;amp; fault tolerance
&lt;/h3&gt;

&lt;p&gt;Replication factor (RF) controls how many copies of each partition exist. If you set RF=3 and one broker fails, followers can be promoted to leader to maintain availability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Producers &amp;amp; consumers
&lt;/h3&gt;

&lt;p&gt;Producers publish messages to topics. Consumers join &lt;strong&gt;consumer groups&lt;/strong&gt;; Kafka ensures each partition is consumed by at most one consumer in the group (parallelism + load balancing). Offsets let consumers resume from a known position.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exactly-once and delivery guarantees
&lt;/h3&gt;

&lt;p&gt;Kafka supports at-least-once delivery by default. Using idempotent producers, transactions, and Streams’ exactly-once semantics you can get end-to-end &lt;strong&gt;exactly-once&lt;/strong&gt; guarantees for many topologies. The Kafka documentation explains these guarantees and client configs in depth. (&lt;a href="https://kafka.apache.org/40/documentation/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Storage model &amp;amp; delivery semantics
&lt;/h2&gt;

&lt;p&gt;Kafka’s storage model is an &lt;em&gt;append-only&lt;/em&gt; log persisted to local disk. Each partition is stored as a sequence of segment files. Kafka leverages OS page cache and sequential disk writes to achieve very high throughput. Retention policies (time or size) and log-compaction (keep last value per key) let you tune storage semantics: use time-based retention for metrics/history and compaction for changelog or stateful topics. Core docs provide details on retention, compaction, and log segments. (&lt;a href="https://kafka.apache.org/documentation/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Kafka ecosystem
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kafka Connect&lt;/strong&gt; includes ready-made connectors (JDBC, S3, HDFS, etc.) for ingestion and export.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kafka Streams&lt;/strong&gt; is a client library for writing stream processing apps without a separate cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ksqlDB&lt;/strong&gt; is a SQL interface for stream processing.
These components let teams build an end-to-end streaming platform without stitching many disparate tools. Confluent and the Apache Kafka project provide extensive guides for platform design and enterprise patterns. (&lt;a href="https://developer.confluent.io/courses/architecture/get-started/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Data engineering patterns &amp;amp; applications
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Ingestion (high throughput)
&lt;/h3&gt;

&lt;p&gt;Common pattern: front-end services → Kafka producers → ingest topic. Use batching, compression (snappy/lz4), and asynchronous sends to maximize producer throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stream processing &amp;amp; enrichment
&lt;/h3&gt;

&lt;p&gt;Stream processing frameworks (Kafka Streams, Flink, Spark Structured Streaming) subscribe to topics, enrich events (join with lookups), and write results to downstream topics or data stores.&lt;/p&gt;

&lt;h3&gt;
  
  
  Change Data Capture (CDC)
&lt;/h3&gt;

&lt;p&gt;CDC tools (Debezium, Maxwell) publish database changes to Kafka topics. This enables low-latency replication, audit logs, and event sourcing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event sourcing &amp;amp; materialized views
&lt;/h3&gt;

&lt;p&gt;Use Kafka as the canonical event store; build materialized views using stream processors. For example: user actions → events → aggregated metrics stored in a database for queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Production practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cluster sizing and partitioning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Partitioning determines parallelism; plan partitions per topic so consumers can scale.&lt;/li&gt;
&lt;li&gt;Replication factor: production clusters commonly use RF=3 for durability.
Confluent’s enterprise guidance helps decide cluster strategies and operational patterns. (&lt;a href="https://www.confluent.io/blog/enterprise-kafka-cluster-strategies-and-best-practices/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Retention &amp;amp; tiered storage
&lt;/h3&gt;

&lt;p&gt;For very large clusters, use &lt;strong&gt;tiered storage&lt;/strong&gt; to offload older segments to cheaper object stores (S3, GCS). Uber and others have implemented tiered storage to manage petabytes affordably. (&lt;a href="https://www.uber.com/blog/kafka-tiered-storage/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Uber&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring &amp;amp; observability
&lt;/h3&gt;

&lt;p&gt;Track broker CPU/disk, network, under-replicated partitions, consumer lag, JVM GC, and request latencies. Expose metrics via JMX and ship to Prometheus/Grafana or your metrics platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security: encryption &amp;amp; ACLs
&lt;/h3&gt;

&lt;p&gt;Enable TLS for in-transit encryption, SASL/Kerberos or OAuth for authentication, and Kafka ACLs for authorization. Uber published work on securing Kafka at scale which covers authz/authn practices. (&lt;a href="https://www.uber.com/blog/securing-kafka-infrastructure-at-uber/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Uber&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  Upgrades &amp;amp; rolling restarts
&lt;/h3&gt;

&lt;p&gt;Use rolling upgrades (broker one-by-one), ensure no single point of failure for Zookeeper (if used) or use KRaft-mode Kafka (no Zookeeper) in recent versions. Produce and consume during rolling restarts to minimize downtime.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Use Cases: Netflix, LinkedIn, Uber
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LinkedIn
&lt;/h3&gt;

&lt;p&gt;LinkedIn originally developed Kafka to power activity streams and log ingestion. Over time, LinkedIn scaled Kafka to handle hundreds of billions to trillions of messages per day, building a broad ecosystem and contributing many improvements back to the project. Their engineering posts describe Kafka’s history and operational lessons. (&lt;a href="https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LinkedIn Engineering&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  Netflix
&lt;/h3&gt;

&lt;p&gt;Netflix uses Kafka as the central event bus (Keystone pipeline), powering telemetry, real-time analytics, and change propagation between microservices. Kafka enabled unified event collection and multiple consumers (analytics, monitoring, personalization) reading the same events. Confluent’s case summaries and Netflix tech blog detail how Kafka supports both batch and stream needs. (&lt;a href="https://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Netflix Tech Blog&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  Uber
&lt;/h3&gt;

&lt;p&gt;Uber runs Kafka at enormous scale to support more than 300 microservices, dynamic pricing, and real-time analytics. They’ve engineered tiered storage and consumer proxies to make Kafka both scalable and operable across the organization. Uber engineering posts describe security hardening and tiered storage adoption for cost and capacity management. (&lt;a href="https://www.uber.com/blog/kafka-tiered-storage/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Uber&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Example code &amp;amp; configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Minimal Python producer (kafka-python)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;broker1:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;broker2:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;value_serializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;acks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# wait for all replicas
&lt;/span&gt;    &lt;span class="n"&gt;compression_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lz4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user-123&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:49.99,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Minimal Python consumer (kafka-python)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;
&lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;broker1:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payments-processor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auto_offset_reset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_auto_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# or use transactional processing
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  server.properties snippets (broker)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# basic broker settings
broker.id=1
listeners=PLAINTEXT://0.0.0.0:9092
log.dirs=/var/lib/kafka/logs
num.partitions=6
default.replication.factor=3
log.retention.hours=168
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Simple architecture diagrams
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Topic partitioning &amp;amp; replication (ASCII)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Topic: payments
Partitions: p0, p1, p2
Brokers: 1,2,3

p0: leader@1  replicas: [1 (L),2,3]
p1: leader@2  replicas: [2 (L),3,1]
p2: leader@3  replicas: [3 (L),1,2]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Typical pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producers --&amp;gt; Kafka (ingest topics) --&amp;gt; Stream Processing (Kafka Streams/Flink) --&amp;gt; Output topics --&amp;gt; Data sinks (DBs, S3, dashboards)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion &amp;amp; further reading
&lt;/h2&gt;

&lt;p&gt;Apache Kafka is a flexible, high-performance foundation for event streaming and real-time data engineering. Applied correctly, with thought to partitioning, replication, retention, observability, and security, it enables organizations to build resilient streaming platforms used by LinkedIn, Netflix, Uber, and many others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apache Kafka Documentation — architecture, clients, and config. (&lt;a href="https://kafka.apache.org/40/documentation/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Confluent blog &amp;amp; platform guides — practical design and enterprise patterns. (&lt;a href="https://www.confluent.io/blog/enterprise-kafka-cluster-strategies-and-best-practices/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;LinkedIn engineering — Kafka origin and scale case studies. (&lt;a href="https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LinkedIn Engineering&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Netflix TechBlog — Kafka in their Keystone pipeline and pipeline evolution. (&lt;a href="https://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Netflix Tech Blog&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Uber engineering posts — tiered storage and securing Kafka at scale. (&lt;a href="https://www.uber.com/blog/kafka-tiered-storage/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Uber&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>kafka</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Why Apache Airflow is the Cornerstone of Modern Data Engineering</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Sun, 07 Sep 2025 21:55:28 +0000</pubDate>
      <link>https://forem.com/augo_amos/why-apache-airflow-is-the-cornerstone-of-modern-data-engineering-bhh</link>
      <guid>https://forem.com/augo_amos/why-apache-airflow-is-the-cornerstone-of-modern-data-engineering-bhh</guid>
      <description>&lt;p&gt;In the world of data engineering, the journey from raw, dispersed data to clean, actionable insights is governed by data pipelines. These pipelines are the central nervous system of any data-driven organization, and their reliability, scalability, and maintainability are paramount. For years, engineers relied on a patchwork of cron jobs, shell scripts, and custom monitoring to keep these pipelines alive. This approach was fragile, opaque, and difficult to scale.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;Apache Airflow&lt;/strong&gt;, an open-source platform designed specifically to programmatically author, schedule, and monitor workflows. It has rapidly become the de facto standard for workflow orchestration because it doesn't just run tasks; it provides a robust, scalable, and highly visible framework for managing the entire lifecycle of data pipelines. This article will explore the theoretical strengths of Airflow and provide a visual tour of the interface that brings these concepts to life.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Workflows as Code: The Power of the DAG&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The most fundamental and powerful concept in Airflow is the &lt;strong&gt;&lt;em&gt;Directed Acyclic Graph (DAG)&lt;/em&gt;&lt;/strong&gt;. A DAG is a collection of tasks with defined dependencies, representing the entire workflow.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Python Native:&lt;/strong&gt; You define your DAGs in Python. This means you can use all the power of a full programming language: variables, loops, dynamic pipeline generation, and imports from any Python library. Your pipeline is no longer a static configuration file but dynamic, version-controlled code.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Version Control &amp;amp; Collaboration:&lt;/strong&gt; DAG files can be stored in Git, enabling code reviews, versioning, CI/CD integration, and seamless collaboration across teams. Every change to your data pipeline is tracked, documented, and testable.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Maintainability:&lt;/strong&gt; Complex dependencies that are nightmarish to manage in cron become simple, readable code. The explicit structure of a DAG makes it easy for new engineers to understand the flow of data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This code-centric approach is what enables the powerful visualizations seen in the UI, as shown in Figure 3.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Sophisticated Scheduling, Dependency Management, and Robust Operational Control&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Airflow moves far beyond the simple time-based scheduling of cron and is built for the reality that things fail in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Intelligent Dependency Handling:&lt;/strong&gt; Tasks only run when their dependencies have been met. If a task fails, downstream tasks won't execute, preventing a cascade of errors and wasted resources.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Automatic Retries &amp;amp; Alerting:&lt;/strong&gt; Tasks can be configured to automatically retry upon failure and send alerts via Slack or email. This handling of transient issues happens without manual intervention.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backfilling and Catch-Up:&lt;/strong&gt; Need to reprocess data from last week because of a code fix? Airflow’s backfill feature allows you to easily rerun a pipeline for a historical period. This is an invaluable feature for maintenance and debugging that is incredibly cumbersome with traditional scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The UI provides the window into this operational control, offering the at-a-glance status view shown in Figure 2 and the detailed logs crucial for debugging in Figure 4.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Visibility and Debugging via the Web UI&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The Airflow UI is a game-changer for operational awareness. It provides a single pane of glass to monitor, visualize, and manage workflows. This is where the theoretical benefits become tangible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The engine powering the UI is Airflow's decoupled architecture.&lt;/strong&gt; Before any UI is available, Airflow's core processes must be running. The &lt;code&gt;scheduler&lt;/code&gt; is the brain that orchestrates tasks, while the &lt;code&gt;web server&lt;/code&gt; hosts the interface. This separation is a key design pattern that allows each component to be scaled independently in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Terminal 1:&lt;/strong&gt; Shows the command &lt;code&gt;airflow webserver&lt;/code&gt; and its output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsb5uph9je21j6fwe46c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsb5uph9je21j6fwe46c.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Terminal 2:&lt;/strong&gt; Shows the command &lt;code&gt;airflow scheduler&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51o44cos7veq1ehty3o6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51o44cos7veq1ehty3o6.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: The core Airflow processes running locally. The scheduler (bottom) orchestrates task execution, while the web server (top) hosts the UI. This separation is foundational to Airflow's scalable design.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once running, the UI serves as mission control. The homepage provides an immediate overview of all data pipelines, with color-coded status indicators offering an instant health check.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Browser Address Bar:&lt;/strong&gt; Shows &lt;code&gt;http://localhost:8080/&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faou03afb0bzfdxtxggny.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faou03afb0bzfdxtxggny.png" alt=" " width="800" height="28"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Navigation Menu:&lt;/strong&gt; Tabs like &lt;strong&gt;DAGs&lt;/strong&gt;, &lt;strong&gt;Browse&lt;/strong&gt;, and &lt;strong&gt;Admin&lt;/strong&gt; are visible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu0kcvdkzidpgjuqcq3cc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu0kcvdkzidpgjuqcq3cc.png" alt=" " width="800" height="42"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DAGs List:&lt;/strong&gt; Shows a list of pipelines with colored status circles (green, red, blue).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9dei8iukhe9ptenmwzw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9dei8iukhe9ptenmwzw.png" alt=" " width="668" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: The Airflow homepage. The navigation menu and list of DAGs with status indicators provide a central hub for monitoring pipeline health.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The true power of the UI is revealed in the &lt;strong&gt;Graph View&lt;/strong&gt;, which renders the code-defined dependencies into an intuitive visual map. This makes complex workflows understandable and debuggable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Graph View:&lt;/strong&gt; Boxes representing tasks are connected by arrows, visually mapping the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcju6fkzc8moqrattphmw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcju6fkzc8moqrattphmw.png" alt=" " width="800" height="184"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Task State Colors:&lt;/strong&gt; Each task is colored based on its state (e.g., green for success).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctvw4os2c8swzvhu0di0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctvw4os2c8swzvhu0di0.png" alt=" " width="800" height="29"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Run Controls:&lt;/strong&gt; Buttons like &lt;strong&gt;Trigger DAG&lt;/strong&gt; are visible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgjm5zq2hqznv0foglj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgjm5zq2hqznv0foglj8.png" alt=" " width="224" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 3: The Graph View of a DAG. This visualization makes complex dependencies and data flow immediately understandable, directly reflecting the "workflows as code" principle.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When failures occur, the UI becomes a powerful debugging tool. Engineers can inspect detailed logs for any task directly in their browser, drastically reducing downtime and eliminating the need to SSH into remote servers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Task Instance Pop-up:&lt;/strong&gt; Focused on a single task.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw0jhcqf02fm4hp183fs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw0jhcqf02fm4hp183fs.png" alt=" " width="800" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Log Tab Selected:&lt;/strong&gt; Shows the execution logs for the task.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43wrx0ok0wzx5i7ad0ij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43wrx0ok0wzx5i7ad0ij.png" alt=" " width="800" height="42"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Readable Log Content:&lt;/strong&gt; Displays standard output/error from the task's execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhdbjdm8uxcvefk34eab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhdbjdm8uxcvefk34eab.png" alt=" " width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 4: Inspecting task logs directly from the web UI. This feature is critical for rapid debugging and is a direct result of the centralized logging that Airflow's platform provides.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Extensibility, Scalability, and a Rich Ecosystem&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Airflow is a platform, not just a scheduler. Its "provider" system allows it to interact with virtually any tool in the modern data stack.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Hundreds of Integrations:&lt;/strong&gt; Official providers exist for AWS, GCP, Azure, Snowflake, Databricks, PostgreSQL, and countless other services.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Scalability:&lt;/strong&gt; The separation of the scheduler, webserver, and workers allows the system to scale. Executors like the &lt;code&gt;KubernetesExecutor&lt;/code&gt; can dynamically launch resources for each task, making it a perfect fit for cloud-native deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Conclusion: More Than a Scheduler&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Apache Airflow is more than just a replacement for cron; it is a comprehensive orchestration platform. It brings engineering rigor, reliability, and, as demonstrated by its powerful UI, unparalleled visibility to the critical process of data pipeline management. By treating workflows as code, providing robust operational control, and offering a window into every aspect of pipeline execution, Airflow empowers data teams to build, monitor, and maintain the robust data infrastructure that is fundamental to a successful, data-driven organization. It’s not just a tool; it’s the foundation upon which reliable data infrastructure is built.&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>dataengineering</category>
      <category>apacheairflow</category>
    </item>
    <item>
      <title>A Guide to Database Normalization &amp; Denormalization (With Visual Examples and Practical Use Cases)</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Wed, 03 Sep 2025 21:17:31 +0000</pubDate>
      <link>https://forem.com/augo_amos/a-guide-to-database-normalization-denormalization-with-visual-examples-and-practical-use-cases-npd</link>
      <guid>https://forem.com/augo_amos/a-guide-to-database-normalization-denormalization-with-visual-examples-and-practical-use-cases-npd</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuaua8s8qne3tcsr3dl7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuaua8s8qne3tcsr3dl7.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Table of Contents&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Introduction to Normalization&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First Normal Form (1NF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Second Normal Form (2NF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third Normal Form (3NF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boyce-Codd Normal Form (BCNF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fourth Normal Form (4NF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fifth Normal Form (5NF)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Denormalization: When and Why to Use It&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summary &amp;amp; Best Practices&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction to Normalization&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Normalization is the process of organizing data to minimize redundancy and improve integrity. It involves splitting tables and defining relationships.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Goals:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eliminate duplicate data.
&lt;/li&gt;
&lt;li&gt;Ensure data dependencies make sense.
&lt;/li&gt;
&lt;li&gt;Optimize storage and maintainability.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Levels of Normalization:&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1NF → 2NF → 3NF → BCNF → 4NF → 5NF  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;1. First Normal Form (1NF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Every table column must contain atomic (single) values with no nested lists, arrays, or repeating groups. This ensures strong data independence and lays the groundwork for higher normal forms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All columns contain atomic (indivisible) values.
&lt;/li&gt;
&lt;li&gt;No repeating groups.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before 1NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;OrderID&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Products&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Laptop, Mouse, Keyboard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After 1NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;OrderID&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Product&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Laptop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Keyboard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Second Normal Form (2NF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A table is in 2NF if it is already in 1NF and all non-key attributes fully depend on the entire primary key, not just part of it. In this way, partial dependencies are prevented and redundancy reduced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must be in 1NF.
&lt;/li&gt;
&lt;li&gt;No partial dependencies (all non-key columns depend on the full primary key).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before 2NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;OrderID (PK)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;ProductID (PK)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;ProductName&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;Laptop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After 2NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Orders Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;OrderID (PK)&lt;/strong&gt; |   &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Products Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;ProductID (PK)&lt;/strong&gt; | &lt;strong&gt;ProductName&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OrderDetails Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;OrderID (PK, FK)&lt;/strong&gt; | &lt;strong&gt;ProductID (PK, FK)&lt;/strong&gt; |  &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Third Normal Form (3NF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A relation is in 3NF if it is in 2NF and no non-prime attribute depends transitively on a candidate key. Essentially, each non-key attribute must directly depend on the key, the whole key, and nothing but the key.&lt;/p&gt;

&lt;p&gt;This eliminates transitive dependencies and further enforces data integrity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must be in 2NF.
&lt;/li&gt;
&lt;li&gt;No transitive dependencies (non-key columns depend only on the primary key).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before 3NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;| &lt;strong&gt;StudentID&lt;/strong&gt; | &lt;strong&gt;Department&lt;/strong&gt; | &lt;strong&gt;DepartmentHead&lt;/strong&gt; |  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After 3NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Students Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;StudentID&lt;/strong&gt; | &lt;strong&gt;Department&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Departments Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;Department&lt;/strong&gt; | &lt;strong&gt;DepartmentHead&lt;/strong&gt; |  &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3b. Boyce-Codd Normal Form (BCNF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is a refinement of 3NF where every determinant must be a candidate key. Used to resolve corner cases where 3NF still has undesirable dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must be in 3NF.
&lt;/li&gt;
&lt;li&gt;Every determinant must be a superkey.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before BCNF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;| &lt;strong&gt;StudentID&lt;/strong&gt; | &lt;strong&gt;Course&lt;/strong&gt; | &lt;strong&gt;Professor&lt;/strong&gt; |  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After BCNF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;StudentCourses Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;StudentID&lt;/strong&gt; | &lt;strong&gt;Course&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ProfessorCourses Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;Professor&lt;/strong&gt; | &lt;strong&gt;Course&lt;/strong&gt; |  &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Fourth Normal Form (4NF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A table is in 4NF if it is already in Boyce–Codd Normal Form (BCNF) and no non-trivial multivalued dependencies exist besides those originating from a superkey.&lt;/p&gt;

&lt;p&gt;Having a table in the fourth normal form ensures that multiple independent relationships don’t cause data duplication across rows. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must be in BCNF.
&lt;/li&gt;
&lt;li&gt;No multi-valued dependencies.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before 4NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;| &lt;strong&gt;EmployeeID&lt;/strong&gt; | &lt;strong&gt;Skill&lt;/strong&gt; | &lt;strong&gt;Language&lt;/strong&gt; |  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After 4NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;EmployeeSkills Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;EmployeeID&lt;/strong&gt; | &lt;strong&gt;Skill&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EmployeeLanguages Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;EmployeeID&lt;/strong&gt; | &lt;strong&gt;Language&lt;/strong&gt; |  &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Fifth Normal Form (5NF)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Also called Project–Join Normal Form, 5NF ensures every join dependency in the table is a consequence of candidate keys.&lt;/p&gt;

&lt;p&gt;This form addresses complex join constraints and ensures data is irreducible and free of redundancy due to joint relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must be in 4NF.
&lt;/li&gt;
&lt;li&gt;No join dependencies. &lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example: Before 5NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;| &lt;strong&gt;Supplier&lt;/strong&gt; | &lt;strong&gt;Part&lt;/strong&gt; | &lt;strong&gt;Project&lt;/strong&gt; |  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;After 5NF&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;SupplierParts Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;Supplier&lt;/strong&gt; | &lt;strong&gt;Part&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SupplierProjects Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;Supplier&lt;/strong&gt; | &lt;strong&gt;Project&lt;/strong&gt; |  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PartProjects Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;Part&lt;/strong&gt; | &lt;strong&gt;Project&lt;/strong&gt; |  &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Denormalization: When and Why to Use It&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Denormalization refers to intentionally adding redundancy to improve read performance.  &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When to Use:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Read-heavy workloads (e.g., analytics).
&lt;/li&gt;
&lt;li&gt;Reducing complex joins.
&lt;/li&gt;
&lt;li&gt;Real-time applications.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example 1: E-Commerce Order History&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Denormalized Table:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
| &lt;strong&gt;OrderID&lt;/strong&gt; | &lt;strong&gt;CustomerName&lt;/strong&gt; | &lt;strong&gt;ProductName&lt;/strong&gt; | &lt;strong&gt;TotalPrice&lt;/strong&gt; |  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Example 2: Social Media Like Count&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Denormalized Column:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;code&gt;Posts (PostID, Content, LikeCount)&lt;/code&gt;  &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Summary &amp;amp; Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Normal Form&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Denormalization Use Case&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1NF&lt;/td&gt;
&lt;td&gt;Atomic values&lt;/td&gt;
&lt;td&gt;Rarely needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2NF&lt;/td&gt;
&lt;td&gt;Eliminate partial dependencies&lt;/td&gt;
&lt;td&gt;Reporting systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3NF&lt;/td&gt;
&lt;td&gt;Remove transitive dependencies&lt;/td&gt;
&lt;td&gt;Data warehouses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BCNF&lt;/td&gt;
&lt;td&gt;Superkey dependencies&lt;/td&gt;
&lt;td&gt;High-traffic web apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4NF/5NF&lt;/td&gt;
&lt;td&gt;Handle multi-valued/join dependencies&lt;/td&gt;
&lt;td&gt;Complex enterprise systems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Recommendation:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalize first&lt;/strong&gt; for integrity.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Denormalize selectively&lt;/strong&gt; for performance.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Visual Workflow&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Data → 1NF → 2NF → 3NF → BCNF → 4NF → 5NF  
          ↓  
Denormalize (for reads)  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>normalization</category>
      <category>database</category>
      <category>denormalization</category>
    </item>
    <item>
      <title>Visualizing Recursive SQL Queries: A Step-by-Step Walkthrough</title>
      <dc:creator>Amos Augo</dc:creator>
      <pubDate>Fri, 15 Aug 2025 11:44:53 +0000</pubDate>
      <link>https://forem.com/augo_amos/visualizing-recursive-sql-queries-a-step-by-step-walkthrough-488g</link>
      <guid>https://forem.com/augo_amos/visualizing-recursive-sql-queries-a-step-by-step-walkthrough-488g</guid>
      <description>&lt;p&gt;For a beginner, learning SQL queries without a clear mental picture of what they do can be confusing and may make it hard to grasp the concepts. Recursive queries can be particularly challenging for a starter to wrap their heads around without a visual aid. In this article, I will explain how recursive queries work using a management chain example, with visualizations that make the process crystal clear for beginners.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recursive CTEs
&lt;/h2&gt;

&lt;p&gt;A recursive CTE is a CTE that references itself. It's extremely useful for working with hierarchical or tree-structured data, such as organizational charts, file systems, or network graphs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recursive CTE Syntax
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;cte_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- Base query (non-recursive part)&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;conditions&lt;/span&gt;

    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;ALL&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;-- Recursive query (references the CTE itself)&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;cte_name&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;join_condition&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;conditions&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cte_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Components of a Recursive CTE
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base Case&lt;/strong&gt;: The initial query that provides the starting point(s) for recursion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Case&lt;/strong&gt;: The part that references the CTE itself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination Condition&lt;/strong&gt;: The condition that stops the recursion (usually in the WHERE clause)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sample Hierarchy Structure Visualization
&lt;/h2&gt;

&lt;p&gt;First, let's visualize a sample data as an organizational chart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Alice&lt;/span&gt; &lt;span class="nc"&gt;Johnson &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CEO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;
           &lt;span class="err"&gt;├─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Bob&lt;/span&gt; &lt;span class="nc"&gt;Smith &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VP&lt;/span&gt; &lt;span class="n"&gt;Engineering&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="err"&gt;└─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dave&lt;/span&gt; &lt;span class="nc"&gt;Brown &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Engineering&lt;/span&gt; &lt;span class="n"&gt;Manager&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;               &lt;span class="err"&gt;│&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;               &lt;span class="err"&gt;├─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Frank&lt;/span&gt; &lt;span class="nc"&gt;Miller &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Senior&lt;/span&gt; &lt;span class="n"&gt;Developer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;               &lt;span class="err"&gt;└─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Grace&lt;/span&gt; &lt;span class="nc"&gt;Wilson &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Developer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="err"&gt;│&lt;/span&gt;
           &lt;span class="err"&gt;└─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Carol&lt;/span&gt; &lt;span class="nc"&gt;Williams &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VP&lt;/span&gt; &lt;span class="n"&gt;Marketing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                   &lt;span class="err"&gt;│&lt;/span&gt;
                   &lt;span class="err"&gt;└─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Eve&lt;/span&gt; &lt;span class="nc"&gt;Davis &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Marketing&lt;/span&gt; &lt;span class="n"&gt;Manager&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                           &lt;span class="err"&gt;│&lt;/span&gt;
                           &lt;span class="err"&gt;└─&lt;/span&gt; &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Henry&lt;/span&gt; &lt;span class="nc"&gt;Moore &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Marketing&lt;/span&gt; &lt;span class="n"&gt;Specialist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Recursive Query Components
&lt;/h2&gt;

&lt;p&gt;We can choose to start from any &lt;code&gt;id&lt;/code&gt; and generate a hierarchy from there. In this case, we will start from &lt;code&gt;id = 7&lt;/code&gt;. Our query will have two essential parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base Case&lt;/strong&gt;: The starting point (&lt;code&gt;WHERE id = 7&lt;/code&gt; - Grace Wilson)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Case&lt;/strong&gt;: The part that joins the CTE to itself to find managers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full recursive query will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- Base case: start with Grace (level 0)&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt;  &lt;span class="c1"&gt;-- Starting level&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;

    &lt;span class="k"&gt;UNION&lt;/span&gt;

    &lt;span class="c1"&gt;-- Recursive case: increment level by 1&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;-- Increment level&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;level&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- Show hierarchy from top to bottom&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step-by-Step Execution Visualization
&lt;/h2&gt;

&lt;p&gt;Let's visualize how the database processes this query:&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialization Phase (Base Case)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- First iteration: base case&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;  &lt;span class="c1"&gt;-- Grace Wilson&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result Set After Base Case:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position   | manager_id |
|----|--------------|------------|------------|
| 7  | Grace Wilson | Developer  | 4          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recursive Phase - Iteration 1
&lt;/h3&gt;

&lt;p&gt;Now the recursive part joins the initial result (Grace) with the employees table to find Grace's manager (manager_id = 4):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;    &lt;span class="c1"&gt;-- Recursive part joins Grace (id=7) with her manager&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;  &lt;span class="c1"&gt;-- Finds where e.id = 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New Rows Added:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name       | position           | manager_id |
|----|------------|--------------------|------------|
| 4  | Dave Brown | Engineering Manager| 2          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Current Result Set:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position           | manager_id |
|----|--------------|--------------------|------------|
| 7  | Grace Wilson | Developer          | 4          |
| 4  | Dave Brown   | Engineering Manager| 2          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recursive Phase - Iteration 2
&lt;/h3&gt;

&lt;p&gt;Now we look for Dave Brown's manager (manager_id = 2):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;    &lt;span class="c1"&gt;-- Recursive part joins Dave (id=4) with his manager&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;  &lt;span class="c1"&gt;-- Finds where e.id = 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New Rows Added:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name      | position        | manager_id |
|----|-----------|-----------------|------------|
| 2  | Bob Smith | VP Engineering  | 1          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Current Result Set:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position           | manager_id |
|----|--------------|--------------------|------------|
| 7  | Grace Wilson | Developer          | 4          |
| 4  | Dave Brown   | Engineering Manager| 2          |
| 2  | Bob Smith    | VP Engineering     | 1          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recursive Phase - Iteration 3
&lt;/h3&gt;

&lt;p&gt;Now we look for Bob Smith's manager (manager_id = 1):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;    &lt;span class="c1"&gt;-- Recursive part joins Bob (id=2) with his manager&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees_hierarchy&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;management_chain&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;  &lt;span class="c1"&gt;-- Finds where e.id = 1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New Rows Added:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position | manager_id |
|----|--------------|----------|------------|
| 1  | Alice Johnson| CEO      | NULL       |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Current Result Set:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position           | manager_id |
|----|--------------|--------------------|------------|
| 7  | Grace Wilson | Developer          | 4          |
| 4  | Dave Brown   | Engineering Manager| 2          |
| 2  | Bob Smith    | VP Engineering     | 1          |
| 1  | Alice Johnson| CEO                | NULL       |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Termination Phase
&lt;/h3&gt;

&lt;p&gt;In the next iteration, we'd look for Alice's manager (manager_id = NULL), which returns no rows, so the recursion stops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Output
&lt;/h2&gt;

&lt;p&gt;After the query completes, we select just the id, name, and position columns and order by id:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| id | name         | position           |
|----|--------------|--------------------|
| 1  | Alice Johnson| CEO                |
| 2  | Bob Smith    | VP Engineering     |
| 4  | Dave Brown   | Engineering Manager|
| 7  | Grace Wilson | Developer          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Visualizing the Recursion Process
&lt;/h2&gt;

&lt;p&gt;Here's how to imagine the recursion working:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start at the leaf node (Grace)&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Grace (7) → Dave (4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;First recursion finds Dave's manager&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Grace (7) → Dave (4) → Bob (2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Second recursion finds Bob's manager&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Grace (7) → Dave (4) → Bob (2) → Alice (1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Third recursion stops (Alice has no manager)&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Grace (7) → Dave (4) → Bob (2) → Alice (1) → STOP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Concepts Illustrated
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base Case&lt;/strong&gt;: Defines where to start (Grace Wilson)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Join&lt;/strong&gt;: Connects each employee to their manager&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt;: Stops when no more managers are found&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UNION vs UNION ALL&lt;/strong&gt;: Using UNION eliminates duplicates (though not needed here)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why This Visualization Helps Beginners
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shows the step-by-step expansion&lt;/strong&gt; of the result set&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Illustrates how each recursive call builds&lt;/strong&gt; on previous results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes the termination condition clear&lt;/strong&gt; (NULL manager_id)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demonstrates the hierarchical nature&lt;/strong&gt; of the query&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This example perfectly shows how recursive queries "walk up" a hierarchy by repeatedly joining the intermediate results to the original table.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>cte</category>
      <category>recursive</category>
    </item>
  </channel>
</rss>
