<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: John Wakaba</title>
    <description>The latest articles on Forem by John Wakaba (@john_analytics).</description>
    <link>https://forem.com/john_analytics</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1752671%2F8a1f42a0-3708-4440-a5e4-40fa766ff763.png</url>
      <title>Forem: John Wakaba</title>
      <link>https://forem.com/john_analytics</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/john_analytics"/>
    <language>en</language>
    <item>
      <title>Building an AI-Powered Personalized Learning Platform with FastAPI, PostgreSQL, and Mistral AI</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 10 Mar 2026 08:55:50 +0000</pubDate>
      <link>https://forem.com/john_analytics/building-an-ai-powered-personalized-learning-platform-with-fastapi-postgresql-and-mistral-ai-3834</link>
      <guid>https://forem.com/john_analytics/building-an-ai-powered-personalized-learning-platform-with-fastapi-postgresql-and-mistral-ai-3834</guid>
      <description>&lt;p&gt;Artificial Intelligence is transforming education by enabling systems&lt;br&gt;
that adapt to individual learning needs. In this article, I'll walk&lt;br&gt;
through how I built an &lt;strong&gt;AI-powered personalized learning platform&lt;/strong&gt;&lt;br&gt;
that generates quizzes, tracks student progress, and provides real-time&lt;br&gt;
insights for teachers.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Traditional learning platforms often deliver the same content to every&lt;br&gt;
student, regardless of their performance. However, students learn at&lt;br&gt;
different speeds and struggle with different topics.&lt;/p&gt;

&lt;p&gt;The goal of this project was to build a system that:&lt;/p&gt;

&lt;p&gt;• Generates quizzes automatically using AI&lt;br&gt;
• Tracks student learning behavior&lt;br&gt;
• Detects struggling students&lt;br&gt;
• Provides teachers with data-driven insights&lt;/p&gt;


&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The system consists of four main components:&lt;/p&gt;

&lt;p&gt;Student Interaction Layer&lt;/p&gt;

&lt;p&gt;FastAPI Backend&lt;/p&gt;

&lt;p&gt;PostgreSQL Database&lt;/p&gt;

&lt;p&gt;AI Engine (Mistral)&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture overview:
&lt;/h2&gt;

&lt;p&gt;Students&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;FastAPI API&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;PostgreSQL Database&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Mistral AI&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Analytics Dashboard&lt;/p&gt;


&lt;h2&gt;
  
  
  AI Quiz Generation
&lt;/h2&gt;

&lt;p&gt;Instead of manually creating quizzes, the platform uses &lt;strong&gt;Mistral AI&lt;/strong&gt;&lt;br&gt;
to generate questions dynamically.&lt;/p&gt;

&lt;p&gt;Example API endpoint:&lt;/p&gt;

&lt;p&gt;GET /generate-quiz/algebra&lt;/p&gt;

&lt;p&gt;The AI returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Question&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple choice answers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Correct answer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explanation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the platform to generate unlimited quizzes for any topic.&lt;/p&gt;


&lt;h2&gt;
  
  
  Real-Time Feedback
&lt;/h2&gt;

&lt;p&gt;When students submit answers, the backend evaluates correctness and generates explanations.&lt;/p&gt;

&lt;p&gt;POST /submit-answer&lt;/p&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;correct&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="s"&gt;\&lt;/span&gt;
&lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100\&lt;/span&gt;
&lt;span class="na"&gt;feedback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Explanation of the answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Students receive immediate feedback, improving engagement and learning efficiency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Adaptive Learning
&lt;/h2&gt;

&lt;p&gt;One of the most important features is &lt;strong&gt;adaptive difficulty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a student performs well, the system generates harder questions.&lt;/p&gt;

&lt;p&gt;If a student struggles, the system provides simpler explanations and&lt;br&gt;
easier quizzes.&lt;/p&gt;

&lt;p&gt;This creates a personalized learning experience.&lt;/p&gt;


&lt;h2&gt;
  
  
  Data Analytics with SQL
&lt;/h2&gt;

&lt;p&gt;Every interaction is stored in PostgreSQL, allowing powerful analytics.&lt;/p&gt;

&lt;p&gt;Example insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Average student performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Topic difficulty analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Learning trends over time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Detection of struggling students&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example SQL query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;quiz_results&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Teacher Dashboard
&lt;/h2&gt;

&lt;p&gt;To visualize insights, I built a &lt;strong&gt;Streamlit dashboard&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teachers can view:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Student performance &lt;/li&gt;
&lt;li&gt;Difficult topics&lt;/li&gt;
&lt;li&gt;Performance trends&lt;/li&gt;
&lt;li&gt;At-risk students&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows educators to identify learning gaps early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;FastAPI&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Mistral AI&lt;/li&gt;
&lt;li&gt;SQL Analytics&lt;/li&gt;
&lt;li&gt;Streamlit&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI-powered learning platforms have the potential to transform education by making learning &lt;strong&gt;personalized, adaptive, and data-driven&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This project is a simplified prototype of what modern EdTech platforms can achieve using open-source tools and AI models.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>fastapi</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title># Understanding Joins and Window Functions in SQL</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:31:49 +0000</pubDate>
      <link>https://forem.com/john_analytics/-understanding-joins-and-window-functions-in-sql-293f</link>
      <guid>https://forem.com/john_analytics/-understanding-joins-and-window-functions-in-sql-293f</guid>
      <description>&lt;p&gt;When working with relational databases, data is rarely stored in one single table. Instead, it is organized into multiple related tables to reduce redundancy and improve structure.&lt;/p&gt;

&lt;p&gt;In a typical transactional database, you might have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;customers&lt;/code&gt; → customer information
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders&lt;/code&gt; → order transactions
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;books&lt;/code&gt; → product details
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To analyze meaningful business insights, we must first combine this data. That’s where &lt;strong&gt;SQL Joins&lt;/strong&gt; come in.&lt;/p&gt;

&lt;p&gt;Once the data is combined, we can apply &lt;strong&gt;Window Functions&lt;/strong&gt; to perform advanced analysis such as ranking, running totals, and trend comparisons.&lt;/p&gt;

&lt;p&gt;This article walks you through both concepts in a logical flow — starting with joins and finishing with window functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  SQL JOINS
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why Joins Matter
&lt;/h2&gt;

&lt;p&gt;Relational databases are built on relationships.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A customer places many orders.&lt;/li&gt;
&lt;li&gt;An order references one book.&lt;/li&gt;
&lt;li&gt;A book can appear in many orders.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To analyze this properly, we must join the tables together using a shared key — usually &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;book_id&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  INNER JOIN
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;INNER JOIN&lt;/strong&gt; returns only rows that exist in both tables.&lt;/p&gt;

&lt;p&gt;If we want to see which customers placed orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Returns only customers who have placed orders&lt;br&gt;&lt;br&gt;
❌ Excludes customers with no orders  &lt;/p&gt;

&lt;p&gt;This is the most commonly used join in data analysis.&lt;/p&gt;


&lt;h2&gt;
  
  
  LEFT JOIN (LEFT OUTER JOIN)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;LEFT JOIN&lt;/strong&gt; returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All rows from the left table
&lt;/li&gt;
&lt;li&gt;Matching rows from the right table
&lt;/li&gt;
&lt;li&gt;NULL where no match exists
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Show all customers, even those who haven’t ordered anything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Every customer appears&lt;br&gt;&lt;br&gt;
✔ Customers without orders will show NULL in order columns  &lt;/p&gt;

&lt;p&gt;This is extremely useful for identifying inactive customers.&lt;/p&gt;


&lt;h2&gt;
  
  
  RIGHT JOIN
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;RIGHT JOIN&lt;/strong&gt; does the opposite of a LEFT JOIN.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures all orders appear, even if customer data is missing.&lt;/p&gt;

&lt;p&gt;In practice, RIGHT JOIN is less common because we can usually rewrite it using LEFT JOIN by switching table order.&lt;/p&gt;




&lt;h2&gt;
  
  
  FULL JOIN (FULL OUTER JOIN)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;FULL JOIN&lt;/strong&gt; returns all rows from both tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;FULL&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Shows matches&lt;br&gt;&lt;br&gt;
✔ Shows unmatched customers&lt;br&gt;&lt;br&gt;
✔ Shows unmatched orders  &lt;/p&gt;

&lt;p&gt;This is useful for data auditing and reconciliation.&lt;/p&gt;


&lt;h2&gt;
  
  
  Joining Multiple Tables
&lt;/h2&gt;

&lt;p&gt;Real-world analysis often requires more than two tables.&lt;/p&gt;

&lt;p&gt;Example: See which customer ordered which book.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer name
&lt;/li&gt;
&lt;li&gt;Book title
&lt;/li&gt;
&lt;li&gt;Quantity ordered
&lt;/li&gt;
&lt;li&gt;Order date
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This joined dataset becomes the foundation for deeper analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  CROSS JOIN
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;CROSS JOIN&lt;/strong&gt; produces all possible combinations between two tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 customers
&lt;/li&gt;
&lt;li&gt;5 books
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You get 50 rows.&lt;/p&gt;

&lt;p&gt;This is useful when generating combinations for simulations or recommendation systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Anti-Join (Finding Missing Records)
&lt;/h2&gt;

&lt;p&gt;SQL doesn’t have a direct &lt;code&gt;ANTI JOIN&lt;/code&gt;, but we simulate it using LEFT JOIN + NULL filtering.&lt;/p&gt;

&lt;p&gt;Example: Find customers who have never placed an order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is powerful for churn analysis and business reporting.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Joins Prepare Data for Window Functions
&lt;/h2&gt;

&lt;p&gt;Notice something important:&lt;/p&gt;

&lt;p&gt;Most advanced SQL analysis begins like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Joins combine related data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Window functions analyze that combined dataset&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Joins prepare the structure.&lt;br&gt;&lt;br&gt;
Window functions perform the analytics.&lt;/p&gt;

&lt;p&gt;Now that we understand joins, let’s move into advanced analytics.&lt;/p&gt;


&lt;h2&gt;
  
  
  WINDOW FUNCTIONS
&lt;/h2&gt;

&lt;p&gt;They allow you to perform calculations across a set of table rows that are somehow related to the current row.&lt;/p&gt;

&lt;p&gt;Window functions can be used to  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rank Rows.
&lt;/li&gt;
&lt;li&gt;Calculate cumulative totals.
&lt;/li&gt;
&lt;li&gt;Find the difference between consecutive rows in a dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Window functions return a value for each row while still providing information from the related rows.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  ROW_NUMBER ()
&lt;/h2&gt;

&lt;p&gt;Assign a unique row number to each row in the result set.&lt;/p&gt;

&lt;p&gt;In a real world scenario it can help us track which order each customer made first, second.......&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Assigns a unique number to each row, starting from 1 based on the order specified by the ORDER BY clause.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The number will reset for each position if PARTITION BY is used.&lt;/p&gt;

&lt;p&gt;Assign a unique row number to each order based on the order date and we want to reset numbering for each customer&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ROW_NUMBER () : Assigns a unique number to each order.&lt;br&gt;&lt;br&gt;
PARTITION BY o.customer_id : Ensures that the row numbering starts fresh for each customer.&lt;br&gt;&lt;br&gt;
Query will list orders for each customer showing their row number (1,2,3---) in the sequence of orders.&lt;/p&gt;

&lt;p&gt;Ranking the orders globally based on order date without resetting the numbering for each customer&lt;br&gt;&lt;br&gt;
What is needed just remove the PARTITION BY Clause&lt;/p&gt;

&lt;p&gt;SQL Query Without Resetting Row Number:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Without PARTITION BY the numbering is continous for all orders across customers based on their order_date.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  RANK() AND DENSE_RANK()
&lt;/h2&gt;

&lt;p&gt;RANK() assigns a rank to each row, with ties getting the same rank but leaving gaps in subsequent ranks.&lt;br&gt;&lt;br&gt;
DENSE_RANK() works similarly but without leaving gaps in the ranking.&lt;/p&gt;
&lt;h3&gt;
  
  
  RANK() SQL QUERY
&lt;/h3&gt;

&lt;p&gt;Rank the customers based on the total quantity of books as they are ordered&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RANK() assigns a rank based on the SUM(o.quantity) in descending order.&lt;br&gt;&lt;br&gt;
If two customers have the same total quantity ordered they will receive the same rank and the next rank will have a gap. *Two customers rank 1 will result in the next customer being ranked 3rd.&lt;/p&gt;
&lt;h3&gt;
  
  
  USING DENSE_RANK()
&lt;/h3&gt;

&lt;p&gt;Assigns a rank without gaps for ties.&lt;/p&gt;

&lt;p&gt;Assigns a rank to each row but it does not leave gaps in the rankings if there are ties.&lt;/p&gt;

&lt;p&gt;Calculate the dense rank of customers based on the total quantity of books they ordered&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dense_rank&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;dense_rank&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If two customers are tied they will receive the same rank but the next customer will receive the next consecutive rank.&lt;br&gt;&lt;br&gt;
1, 1, next will be 2nd&lt;br&gt;&lt;br&gt;
This key difference in RANK() and DENSE_RANK() is crucial for how you want to treat tied values in your analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  LEAD() AND LAG()
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LEAD()
&lt;/h3&gt;

&lt;p&gt;LEAD() access next row's value.&lt;br&gt;&lt;br&gt;
It is used to access a row that follows the current row at a specific physical offset.&lt;br&gt;&lt;br&gt;
Generally employed to compare the value of the current row with the value of the next row following the current row.&lt;/p&gt;

&lt;p&gt;Compare quantity ordered by each customer in the current row with the quantity ordered in the next row&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;next_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;LEAD(o.quantity) allows you to access the quantity of the next row for each customer.&lt;br&gt;&lt;br&gt;
Query gives the quantity ordered by the customer in the current row and the quantity ordered by the same customer in the next row.&lt;br&gt;&lt;br&gt;
For the last row for each customer the next quantity will be NULL because there is no next row.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  LAG()
&lt;/h3&gt;

&lt;p&gt;Access previous rows value&lt;br&gt;&lt;br&gt;
It is crucial for analyzing trends or behavior change over time.&lt;br&gt;&lt;br&gt;
Allows you to access data from a previous row within the same result set and is crucial for comparing values in the current row with the preceding row.&lt;br&gt;&lt;br&gt;
Operates on partitions created by the PARTITION BY clause.&lt;/p&gt;

&lt;p&gt;Compare quantity ordered by each customer in the current row with the quantity ordered in the previous row using the LAG() function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;LAG(o.quantity) allows you to access the quantity ordered in the previous row for each customer&lt;br&gt;&lt;br&gt;
The query shows the quantity ordered in the current row and the quantity ordered in the previous row for the same customer.&lt;br&gt;&lt;br&gt;
First row previous quantity will be NULL as there is no previous row.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  NTILE() FUNCTION
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Partitions data into specified number of buckets.&lt;br&gt;&lt;br&gt;
Crucial for data analysis and reporting as it allows users to efficiently distribute rows and analyze data in a structured manner.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We want to divide customers into 2 groups (quartiles) based on their total order quantity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;NTILE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;quantity_tile&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;quantity_tile&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NTILE(2) divides customers into two equal groups (quartiles) based on their total quantity ordered.&lt;/p&gt;

&lt;h2&gt;
  
  
  PARTITION BY
&lt;/h2&gt;

&lt;p&gt;Divides result set into partitions to apply window functions independently.&lt;/p&gt;

&lt;p&gt;This clause divides the result set into partitions and the window function works independently within each partition.&lt;/p&gt;

&lt;p&gt;calculate the total quantity of orders for each customer and the average price of the books ordered by each customer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_order_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;SUM(o.quantity) gives total quantity ordered by each customer.&lt;br&gt;&lt;br&gt;
PARTITION BY o.customer_id ensures the total order quantity is calculated for each individual customer.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>sql</category>
      <category>postgres</category>
      <category>data</category>
      <category>database</category>
    </item>
    <item>
      <title>## I Built an AI Tourism Assistant for Kenya Using RAG, pgvector, and Streamlit</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:21:41 +0000</pubDate>
      <link>https://forem.com/john_analytics/-i-built-an-ai-tourism-assistant-for-kenya-using-rag-pgvector-and-streamlit-c71</link>
      <guid>https://forem.com/john_analytics/-i-built-an-ai-tourism-assistant-for-kenya-using-rag-pgvector-and-streamlit-c71</guid>
      <description>&lt;p&gt;Imagine asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What's the best luxury safari in Maasai Mara?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and instantly getting &lt;strong&gt;personalized travel recommendations powered by&lt;br&gt;
AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's exactly what I built --- an &lt;strong&gt;AI Tourism Intelligence Assistant&lt;/strong&gt;&lt;br&gt;
that helps travelers discover the best travel packages in Kenya based on&lt;br&gt;
their &lt;strong&gt;budget, travel style, duration, and preferred destination.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this article, I'll walk you through:&lt;/p&gt;

&lt;p&gt;• The idea behind the project&lt;br&gt;
• How I built the AI recommendation system&lt;br&gt;
• The RAG architecture powering it&lt;br&gt;
• How vector search makes travel discovery smarter&lt;br&gt;
• Deployment with Streamlit&lt;/p&gt;




&lt;h2&gt;
  
  
  ✨ The Idea
&lt;/h2&gt;

&lt;p&gt;Kenya is one of the world's most beautiful tourism destinations,&lt;br&gt;
offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Wildlife safaris 🦁&lt;/li&gt;
&lt;li&gt;  Tropical beaches 🏝&lt;/li&gt;
&lt;li&gt;  Mountain adventures ⛰&lt;/li&gt;
&lt;li&gt;  Cultural experiences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But planning trips can be frustrating because:&lt;/p&gt;

&lt;p&gt;• Travel packages are scattered across multiple websites&lt;br&gt;
• Platforms rarely provide &lt;strong&gt;personalized recommendations&lt;/strong&gt;&lt;br&gt;
• Comparing destinations based on budget or style is difficult&lt;/p&gt;

&lt;p&gt;So I decided to build an &lt;strong&gt;AI-powered tourism assistant&lt;/strong&gt; that could:&lt;/p&gt;

&lt;p&gt;✔ Understand traveler preferences&lt;br&gt;
✔ Retrieve relevant travel packages&lt;br&gt;
✔ Generate intelligent recommendations&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 What the AI Assistant Does
&lt;/h2&gt;

&lt;p&gt;Users simply input their preferences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Budget&lt;/li&gt;
&lt;li&gt;  Travel duration&lt;/li&gt;
&lt;li&gt;  Travel style&lt;/li&gt;
&lt;li&gt;  Preferred destination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system then returns &lt;strong&gt;relevant travel packages from a tourism&lt;br&gt;
database.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example query:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Budget: $2000&lt;br&gt;
Days: 5&lt;br&gt;
Style: Relaxing&lt;br&gt;
Destination: Diani&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The assistant responds with &lt;strong&gt;recommended travel packages matching those&lt;br&gt;
criteria.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ Tech Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Programming
&lt;/h3&gt;

&lt;p&gt;Python&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Engineering
&lt;/h3&gt;

&lt;p&gt;PostgreSQL&lt;br&gt;
pgvector&lt;/p&gt;

&lt;h3&gt;
  
  
  AI
&lt;/h3&gt;

&lt;p&gt;Mistral AI embeddings&lt;br&gt;
Retrieval-Augmented Generation (RAG)&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Collection
&lt;/h3&gt;

&lt;p&gt;Playwright&lt;br&gt;
BeautifulSoup&lt;/p&gt;

&lt;h3&gt;
  
  
  Backend
&lt;/h3&gt;

&lt;p&gt;SQLAlchemy&lt;/p&gt;

&lt;h3&gt;
  
  
  Frontend
&lt;/h3&gt;

&lt;p&gt;Streamlit&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;

&lt;p&gt;Streamlit Cloud&lt;br&gt;
Neon PostgreSQL&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗 System Architecture
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tourism Websites
      │
      ▼
Web Scraping (Playwright)
      │
      ▼
PostgreSQL Database
      │
      ▼
Embedding Generation (Mistral AI)
      │
      ▼
Vector Database (pgvector)
      │
      ▼
Recommendation Engine
      │
      ▼
Streamlit Web Application
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  🔎 How the RAG System Works
&lt;/h2&gt;

&lt;p&gt;The project uses &lt;strong&gt;Retrieval‑Augmented Generation (RAG)&lt;/strong&gt; to deliver&lt;br&gt;
intelligent responses.&lt;/p&gt;

&lt;p&gt;Instead of the AI guessing answers, it &lt;strong&gt;retrieves real travel packages&lt;br&gt;
from the database first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pipeline:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
     │
     ▼
Convert Query → Embedding
     │
     ▼
Vector Similarity Search
     │
     ▼
Retrieve Relevant Travel Packages
     │
     ▼
Generate Personalized Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This ensures the AI responds with &lt;strong&gt;real tourism data rather than&lt;br&gt;
hallucinations.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🗄 Database Design
&lt;/h2&gt;

&lt;p&gt;The database stores travel information in structured tables such as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;travel_packages
destinations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Each travel package contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Package name&lt;/li&gt;
&lt;li&gt;  Destination&lt;/li&gt;
&lt;li&gt;  Duration&lt;/li&gt;
&lt;li&gt;  Price&lt;/li&gt;
&lt;li&gt;  Description&lt;/li&gt;
&lt;li&gt;  Vector embedding&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔍 Why Vector Search Matters
&lt;/h2&gt;

&lt;p&gt;Traditional search relies on keywords.&lt;/p&gt;

&lt;p&gt;Vector search understands &lt;strong&gt;meaning and context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, if a user searches:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Affordable safari in Kenya"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system can still return:&lt;/p&gt;

&lt;p&gt;• Budget Maasai Mara packages&lt;br&gt;
• Lake Nakuru safari deals&lt;br&gt;
• Amboseli wildlife tours&lt;/p&gt;

&lt;p&gt;Even if those exact words were not used.&lt;/p&gt;




&lt;h2&gt;
  
  
  💻 Building the Interface
&lt;/h2&gt;

&lt;p&gt;The frontend is built using &lt;strong&gt;Streamlit&lt;/strong&gt;, which makes it easy to create&lt;br&gt;
interactive data apps.&lt;/p&gt;

&lt;p&gt;Users can:&lt;/p&gt;

&lt;p&gt;✔ Enter travel preferences&lt;br&gt;
✔ Browse travel packages&lt;br&gt;
✔ Receive AI‑powered recommendations&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Deployment
&lt;/h2&gt;

&lt;p&gt;The application is deployed using:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streamlit Cloud&lt;/strong&gt; for hosting the web app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neon PostgreSQL&lt;/strong&gt; for the managed database.&lt;/p&gt;

&lt;p&gt;This allows the project to run fully online.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Key Results
&lt;/h2&gt;

&lt;p&gt;The project successfully delivers:&lt;/p&gt;

&lt;p&gt;✔ AI-powered tourism recommendations&lt;br&gt;
✔ Semantic search using vector embeddings&lt;br&gt;
✔ A fully deployed web application&lt;br&gt;
✔ Personalized travel package discovery&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Web Scraping Complexity
&lt;/h3&gt;

&lt;p&gt;Many travel websites load content dynamically, which required&lt;br&gt;
&lt;strong&gt;Playwright&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Quality Issues
&lt;/h3&gt;

&lt;p&gt;Scraped data often contained:&lt;/p&gt;

&lt;p&gt;• Missing prices&lt;br&gt;
• Duplicate packages&lt;br&gt;
• Inconsistent destination names&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Rate Limits
&lt;/h3&gt;

&lt;p&gt;Embedding generation triggered &lt;strong&gt;API rate limits&lt;/strong&gt;, requiring retry&lt;br&gt;
logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Configuration
&lt;/h3&gt;

&lt;p&gt;Deployment required careful setup of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Environment variables&lt;/li&gt;
&lt;li&gt;  Streamlit secrets&lt;/li&gt;
&lt;li&gt;  Database connection strings&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔮 Future Improvements
&lt;/h2&gt;

&lt;p&gt;Future versions of the system could include:&lt;/p&gt;

&lt;p&gt;• AI itinerary generation&lt;br&gt;
• Social media tourism trend analysis&lt;br&gt;
• Integration with booking APIs&lt;br&gt;
• User accounts and saved trips&lt;/p&gt;




&lt;h2&gt;
  
  
  🌍 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Combining &lt;strong&gt;vector databases, AI retrieval systems, and interactive web&lt;br&gt;
apps&lt;/strong&gt; opens powerful opportunities for building intelligent data&lt;br&gt;
products.&lt;/p&gt;

&lt;p&gt;This project demonstrates how AI can improve &lt;strong&gt;tourism discovery and&lt;br&gt;
travel planning&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>rag</category>
      <category>postgres</category>
      <category>etl</category>
      <category>data</category>
    </item>
    <item>
      <title>🏠 Building a Machine Learning Property Price Predictor (From Web Scraping to Deployment</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Mon, 23 Feb 2026 07:53:09 +0000</pubDate>
      <link>https://forem.com/john_analytics/building-a-machine-learning-property-price-predictor-from-web-scraping-to-deployment-2ma5</link>
      <guid>https://forem.com/john_analytics/building-a-machine-learning-property-price-predictor-from-web-scraping-to-deployment-2ma5</guid>
      <description>&lt;p&gt;In this project, I built a complete end-to-end machine learning system&lt;br&gt;
that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Scrapes property listings&lt;/li&gt;
&lt;li&gt;  Cleans and engineers features&lt;/li&gt;
&lt;li&gt;  Trains multiple ML models&lt;/li&gt;
&lt;li&gt;  Deploys a pricing app&lt;/li&gt;
&lt;li&gt;  Builds a business-ready dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article walks through the entire pipeline  from raw web data to a deployed ML product.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1 --- Web Scraping
&lt;/h2&gt;

&lt;p&gt;I built a Selenium scraper to extract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Location&lt;/li&gt;
&lt;li&gt;  Property Type&lt;/li&gt;
&lt;li&gt;  Bedrooms&lt;/li&gt;
&lt;li&gt;  Bathrooms&lt;/li&gt;
&lt;li&gt;  Size (sqm)&lt;/li&gt;
&lt;li&gt;  Amenities&lt;/li&gt;
&lt;li&gt;  Price (KES)&lt;/li&gt;
&lt;li&gt;  Listing Date&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Sample Scraping Logic
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;listings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;property&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//a[contains(@href,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/listings/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;property_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sample Scraping Logic
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;listings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;property&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//a[contains(@href,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/listings/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;property_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 --- Exploratory Analysis
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Most Expensive Locations
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;location_prices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Price (KES)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;location_prices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4 --- Modeling
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Train/Test Split
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bedrooms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bathrooms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Size (sqm)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amenity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Price (KES)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linear Regression (Baseline)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;lr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rmse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;r2_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Random Forest
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;

&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  XGBoost
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;xgboost&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XGBRegressor&lt;/span&gt;

&lt;span class="n"&gt;xgb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;XGBRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;xgb_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5 --- Deployment (Streamlit App)
&lt;/h2&gt;

&lt;p&gt;The pricing app allows users to input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Location&lt;/li&gt;
&lt;li&gt;  Bedrooms&lt;/li&gt;
&lt;li&gt;  Bathrooms&lt;/li&gt;
&lt;li&gt;  Size&lt;/li&gt;
&lt;li&gt;  Amenities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Predicted price&lt;/li&gt;
&lt;li&gt;  Estimated range (± MAE)&lt;/li&gt;
&lt;li&gt;  Explanation of price drivers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;streamlit run Streamlit_app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6 --- Executive Dashboard
&lt;/h2&gt;

&lt;p&gt;Built using Streamlit with interactive filters.&lt;/p&gt;

&lt;p&gt;Includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Median price by location&lt;/li&gt;
&lt;li&gt;  Monthly price trends&lt;/li&gt;
&lt;li&gt;  Price per sqft comparison&lt;/li&gt;
&lt;li&gt;  Amenity impact analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;streamlit run Dashboard.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Size is the strongest determinant of price.&lt;/li&gt;
&lt;li&gt;  Premium neighborhoods significantly increase valuation.&lt;/li&gt;
&lt;li&gt;  Amenities increase value but are secondary drivers.&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>webscraping</category>
      <category>streamlit</category>
      <category>python</category>
      <category>data</category>
    </item>
    <item>
      <title>From Messy Data to Confident Decisions: How Analysts Use Power BI, DAX, and Dashboards in the Real World</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 10 Feb 2026 08:48:51 +0000</pubDate>
      <link>https://forem.com/john_analytics/from-messy-data-to-confident-decisions-how-analysts-use-power-bi-dax-and-dashboards-in-the-real-4860</link>
      <guid>https://forem.com/john_analytics/from-messy-data-to-confident-decisions-how-analysts-use-power-bi-dax-and-dashboards-in-the-real-4860</guid>
      <description>&lt;p&gt;Power BI skills are often misunderstood as “just reporting.” In reality, professional analysts use Power BI as a &lt;strong&gt;decision-support system&lt;/strong&gt; — one that transforms messy, unreliable data into insights leaders trust to allocate budgets, adjust strategy, and measure performance.&lt;/p&gt;

&lt;p&gt;This article demonstrates how &lt;strong&gt;technical Power BI skills translate directly into real-world business decisions and measurable impact&lt;/strong&gt;, following the same workflow used in real organizations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Messy Data Is a Business Risk, Not a Technical Issue
&lt;/h2&gt;

&lt;p&gt;In real organizations, data arrives incomplete and inconsistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regions spelled differently across systems&lt;/li&gt;
&lt;li&gt;Missing transaction dates&lt;/li&gt;
&lt;li&gt;Revenue stored as text&lt;/li&gt;
&lt;li&gt;Duplicate customer records&lt;/li&gt;
&lt;li&gt;Placeholder values like &lt;code&gt;N/A&lt;/code&gt; and &lt;code&gt;Error&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When these issues are ignored, dashboards show incorrect KPIs and misleading trends.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;p&gt;When analysts clean data correctly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial metrics become trustworthy&lt;/li&gt;
&lt;li&gt;Performance comparisons are accurate&lt;/li&gt;
&lt;li&gt;Leaders focus on decisions instead of debating numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why analysts begin in &lt;strong&gt;Power Query&lt;/strong&gt;, not visuals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Power Query: Turning Raw Inputs into Reliable Data
&lt;/h2&gt;

&lt;p&gt;Power Query is where analysts reduce business risk.&lt;/p&gt;

&lt;p&gt;Using repeatable transformation steps, analysts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standardize categories for consistent grouping&lt;/li&gt;
&lt;li&gt;Remove invalid or duplicate records&lt;/li&gt;
&lt;li&gt;Apply correct data types for calculations and time analysis&lt;/li&gt;
&lt;li&gt;Replace pseudo-blanks with true null values&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Outcome
&lt;/h3&gt;

&lt;p&gt;After proper Power Query transformations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monthly revenue no longer fluctuates unexpectedly&lt;/li&gt;
&lt;li&gt;Forecasts align with finance systems&lt;/li&gt;
&lt;li&gt;Data refreshes produce consistent results automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliable data is the foundation of every decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Modeling: Structuring Data for Decision-Making
&lt;/h2&gt;

&lt;p&gt;Power BI does not analyze spreadsheets — it analyzes &lt;strong&gt;data models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Professional analysts design &lt;strong&gt;star schemas&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable business events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension tables&lt;/strong&gt; provide descriptive context&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Well-designed models ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable filter behavior&lt;/li&gt;
&lt;li&gt;Accurate KPIs across dashboards&lt;/li&gt;
&lt;li&gt;Strong performance as data volumes grow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor modeling leads to conflicting answers and erodes stakeholder trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  DAX: Translating Business Questions into Logic
&lt;/h2&gt;

&lt;p&gt;DAX allows analysts to express business logic directly in calculations.&lt;/p&gt;

&lt;p&gt;Executives ask questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are we improving compared to last year?&lt;/li&gt;
&lt;li&gt;Which regions are underperforming?&lt;/li&gt;
&lt;li&gt;How close are we to our targets?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using DAX, analysts move beyond raw totals to meaningful metrics like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profit margins&lt;/li&gt;
&lt;li&gt;Year-over-year growth&lt;/li&gt;
&lt;li&gt;Year-to-date performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;p&gt;DAX enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fair comparisons across time&lt;/li&gt;
&lt;li&gt;KPI tracking against targets&lt;/li&gt;
&lt;li&gt;Scenario-based decision-making&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Without DAX, dashboards show numbers.  With DAX, they show &lt;strong&gt;meaning&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Time Intelligence: Supporting Strategic Decisions
&lt;/h2&gt;

&lt;p&gt;Time intelligence allows organizations to understand performance trends.&lt;/p&gt;

&lt;p&gt;Using time-based analysis, analysts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare current results to prior periods&lt;/li&gt;
&lt;li&gt;Detect early signs of growth or decline&lt;/li&gt;
&lt;li&gt;Measure progress toward annual goals&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decisions Enabled
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Expanding high-growth regions&lt;/li&gt;
&lt;li&gt;Addressing seasonal declines proactively&lt;/li&gt;
&lt;li&gt;Adjusting forecasts based on YTD performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time intelligence transforms historical data into &lt;strong&gt;forward-looking insight&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dashboards: From Information to Action
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Dashboards are not collections of charts.
They are &lt;strong&gt;decision interfaces&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Effective dashboards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highlight critical KPIs&lt;/li&gt;
&lt;li&gt;Show trends requiring attention&lt;/li&gt;
&lt;li&gt;Surface underperformance and exceptions&lt;/li&gt;
&lt;li&gt;Enable fast filtering without technical effort&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Measurable Outcomes
&lt;/h3&gt;

&lt;p&gt;Well-designed dashboards help organizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce time spent validating numbers&lt;/li&gt;
&lt;li&gt;Detect issues earlier&lt;/li&gt;
&lt;li&gt;Align teams around shared metrics&lt;/li&gt;
&lt;li&gt;Act faster with confidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dashboards succeed when users know &lt;strong&gt;what to do next&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring Success: Business Impact of Power BI
&lt;/h2&gt;

&lt;p&gt;When Power BI is used effectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leaders trust the data without manual validation&lt;/li&gt;
&lt;li&gt;Decisions are backed by consistent metrics&lt;/li&gt;
&lt;li&gt;Reporting effort decreases&lt;/li&gt;
&lt;li&gt;Performance improvements are measurable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real value of Power BI is not the report —&lt;br&gt;&lt;br&gt;
it is the &lt;strong&gt;decisions enabled by the report&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Power BI as a Strategic Asset
&lt;/h2&gt;

&lt;p&gt;Analysts create measurable impact by combining:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Power Query for data reliability
&lt;/li&gt;
&lt;li&gt;Data modeling for meaningful analysis
&lt;/li&gt;
&lt;li&gt;DAX for business logic
&lt;/li&gt;
&lt;li&gt;Dashboards designed for action
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is how messy data becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trusted insights&lt;/li&gt;
&lt;li&gt;Confident decisions&lt;/li&gt;
&lt;li&gt;Real business outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Power BI, when used professionally, is not a reporting tool —  it is a &lt;strong&gt;strategic decision-making engine&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




</description>
      <category>powerbi</category>
      <category>data</category>
      <category>analytics</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>📊 Understanding Schemas and Data Modelling in Power BI</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 05 Feb 2026 11:13:12 +0000</pubDate>
      <link>https://forem.com/john_analytics/understanding-schemas-and-data-modelling-in-power-bi-2epd</link>
      <guid>https://forem.com/john_analytics/understanding-schemas-and-data-modelling-in-power-bi-2epd</guid>
      <description>&lt;p&gt;Data modelling is the foundation of building scalable and&lt;br&gt;
high-performance dashboards in Power BI. While many developers focus&lt;br&gt;
heavily on visuals and DAX calculations, the true performance and&lt;br&gt;
accuracy of a report depend heavily on how data is structured.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧱 What Is Data Modelling in Power BI?
&lt;/h2&gt;

&lt;p&gt;Data modelling refers to structuring data into logical formats that&lt;br&gt;
support analysis and reporting. Dimensional modelling organizes data&lt;br&gt;
into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Facts (measurable metrics)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dimensions (descriptive attributes)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A well-designed Power BI model determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How tables relate&lt;/li&gt;
&lt;li&gt;  How filters propagate&lt;/li&gt;
&lt;li&gt;  How fast reports load&lt;/li&gt;
&lt;li&gt;  How accurate calculations are&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⭐ Star Schema
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ What Is Star Schema?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Star Schema&lt;/strong&gt; organizes data using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  One &lt;strong&gt;central fact table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Multiple &lt;strong&gt;dimension tables&lt;/strong&gt; connected directly to the fact table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The structure resembles a star, where dimension tables surround the&lt;br&gt;
central fact table.&lt;/p&gt;




&lt;h3&gt;
  
  
  📊 Components of Star Schema
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Fact Table
&lt;/h4&gt;

&lt;p&gt;Contains measurable and quantitative data such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Sales revenue&lt;/li&gt;
&lt;li&gt;  Quantity sold&lt;/li&gt;
&lt;li&gt;  Profit&lt;/li&gt;
&lt;li&gt;  Discount&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each row represents a business event like a transaction.&lt;/p&gt;

&lt;h4&gt;
  
  
  Dimension Tables
&lt;/h4&gt;

&lt;p&gt;Provide descriptive context to fact data such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Customer details&lt;/li&gt;
&lt;li&gt;  Product attributes&lt;/li&gt;
&lt;li&gt;  Date/Time&lt;/li&gt;
&lt;li&gt;  Store location&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚀 Benefits of Star Schema
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  ✔ High query performance&lt;/li&gt;
&lt;li&gt;  ✔ Simple design&lt;/li&gt;
&lt;li&gt;  ✔ Easier DAX calculations&lt;/li&gt;
&lt;li&gt;  ✔ Optimized for reporting and dashboards&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Data redundancy&lt;/li&gt;
&lt;li&gt;  Higher storage usage&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ❄️ Snowflake Schema
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ What Is Snowflake Schema?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Snowflake Schema&lt;/strong&gt; extends star schema by normalizing dimension&lt;br&gt;
tables into multiple related tables.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Customer → City → Country&lt;/p&gt;




&lt;h3&gt;
  
  
  📌 Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Normalized dimension tables&lt;/li&gt;
&lt;li&gt;  Supports hierarchical drill-down analysis&lt;/li&gt;
&lt;li&gt;  Improves data integrity&lt;/li&gt;
&lt;li&gt;  Requires additional joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚖️ Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  ✔ Reduced redundancy&lt;/li&gt;
&lt;li&gt;  ✔ Better data consistency&lt;/li&gt;
&lt;li&gt;  ✔ Supports complex hierarchies&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚠️ Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  More complex design&lt;/li&gt;
&lt;li&gt;  Slower performance due to joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📊 Fact Tables vs Dimension Tables
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📊 Fact Tables
&lt;/h3&gt;

&lt;p&gt;Store numeric metrics and foreign keys linking to dimensions.&lt;br&gt;
Usually contain transactional data and large volumes of records.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧾 Dimension Tables
&lt;/h3&gt;

&lt;p&gt;Store descriptive attributes that provide context to fact tables.&lt;br&gt;
Used for filtering, grouping, and categorizing data.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 Relationships in Power BI
&lt;/h2&gt;

&lt;p&gt;Relationships connect tables and enable filtering across datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  📌 Types of Relationships
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;One-to-Many&lt;/strong&gt; -- One dimension record links to many fact records&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Many-to-One&lt;/strong&gt; -- Reverse of one-to-many&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Many-to-Many&lt;/strong&gt; -- Multiple matching records on both sides&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🚨 Why Good Data Modelling Is Critical
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ⚡ Performance Optimization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Improves query speed&lt;/li&gt;
&lt;li&gt;  Reduces memory usage&lt;/li&gt;
&lt;li&gt;  Enables faster dashboard loading&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎯 Accurate Reporting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Ensures correct aggregations&lt;/li&gt;
&lt;li&gt;  Maintains reliable filter behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧠 Easier DAX Calculations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Simplifies analytical queries&lt;/li&gt;
&lt;li&gt;  Improves calculation accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔧 Scalability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Supports future data expansion&lt;/li&gt;
&lt;li&gt;  Easier troubleshooting and maintenance&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⭐ Star Schema vs Snowflake Schema
&lt;/h2&gt;

&lt;p&gt;Feature       Star Schema            Snowflake Schema&lt;/p&gt;




&lt;p&gt;Structure     Denormalized           Normalized&lt;br&gt;
  Performance   Faster queries         Slower queries&lt;br&gt;
  Complexity    Simple                 Complex&lt;br&gt;
  Storage       Higher storage         Lower storage&lt;br&gt;
  Use Case      Reporting dashboards   Large complex warehouses&lt;/p&gt;




&lt;h2&gt;
  
  
  🏆 Power BI Data Modelling Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Use Star Schema whenever possible&lt;/li&gt;
&lt;li&gt;  Separate fact and dimension tables&lt;/li&gt;
&lt;li&gt;  Maintain clear relationships&lt;/li&gt;
&lt;li&gt;  Optimize data types&lt;/li&gt;
&lt;li&gt;  Reduce unnecessary joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📌 Real-World Analogy
&lt;/h2&gt;

&lt;p&gt;Think of a library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Fact tables&lt;/strong&gt; = Books (contain measurable information)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dimension tables&lt;/strong&gt; = Catalogue system (organizes and locates
books)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 Conclusion
&lt;/h2&gt;

&lt;p&gt;Data modelling is one of the most important skills for Power BI&lt;br&gt;
developers. Understanding schema design ensures dashboards are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Fast&lt;/li&gt;
&lt;li&gt;  Accurate&lt;/li&gt;
&lt;li&gt;  Scalable&lt;/li&gt;
&lt;li&gt;  Maintainable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before building visuals or writing DAX formulas, always ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is my data model structured correctly?&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>datamodelling</category>
      <category>powerbi</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Getting Started With Linux for Data Engineers (With Vi and Nano Examples)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 05 Feb 2026 10:55:10 +0000</pubDate>
      <link>https://forem.com/john_analytics/getting-started-with-linux-for-data-engineers-with-vi-and-nano-examples-4m99</link>
      <guid>https://forem.com/john_analytics/getting-started-with-linux-for-data-engineers-with-vi-and-nano-examples-4m99</guid>
      <description>&lt;p&gt;If you're getting into data engineering, there's one skill that keeps&lt;br&gt;
showing up everywhere: &lt;strong&gt;Linux&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Whether you're working with cloud servers, big data tools, or pipeline&lt;br&gt;
automation, Linux is almost always running behind the scenes. The good&lt;br&gt;
news? You don't need to be a Linux wizard to get started.&lt;/p&gt;

&lt;p&gt;In this guide, we'll break down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Why data engineers need Linux&lt;/li&gt;
&lt;li&gt;  Basic commands you'll actually use&lt;/li&gt;
&lt;li&gt;  How to edit files using Vi&lt;/li&gt;
&lt;li&gt;  How to edit files using Nano&lt;/li&gt;
&lt;li&gt;  Real-world examples&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why Should Data Engineers Learn Linux?
&lt;/h2&gt;

&lt;p&gt;Here's the honest truth --- most production data systems run on Linux&lt;br&gt;
servers.&lt;/p&gt;

&lt;p&gt;When you deploy Spark jobs, schedule Airflow pipelines, or manage&lt;br&gt;
databases, you'll likely connect to a Linux machine.&lt;/p&gt;
&lt;h3&gt;
  
  
  It's Built for Performance
&lt;/h3&gt;

&lt;p&gt;Linux handles heavy workloads really well, which is perfect for big data&lt;br&gt;
processing.&lt;/p&gt;
&lt;h3&gt;
  
  
  It's Highly Customizable
&lt;/h3&gt;

&lt;p&gt;Since Linux is open source, companies tailor it for their&lt;br&gt;
infrastructure.&lt;/p&gt;
&lt;h3&gt;
  
  
  It Runs the Cloud
&lt;/h3&gt;

&lt;p&gt;Most AWS, Azure, and Google Cloud servers run Linux.&lt;/p&gt;
&lt;h3&gt;
  
  
  It Supports Automation
&lt;/h3&gt;

&lt;p&gt;Data engineers constantly automate workflows using shell scripts.&lt;/p&gt;


&lt;h2&gt;
  
  
  Linux Commands Every Beginner Should Know
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Check Where You Are
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  List Files
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Move Between Folders
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;folder_name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Create a Folder
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;data_project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Create a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Read a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Text Editors Matter in Linux
&lt;/h2&gt;

&lt;p&gt;When you log into a server, there's usually no graphical editor like VS&lt;br&gt;
Code or Notepad.&lt;/p&gt;

&lt;p&gt;Instead, you use terminal editors like: - Vi (powerful but tricky) -&lt;br&gt;
Nano (simple and beginner-friendly)&lt;/p&gt;


&lt;h2&gt;
  
  
  Using Vi (The Power Tool)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Open or Create a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi sample.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Enter Insert Mode
&lt;/h3&gt;

&lt;p&gt;Press &lt;code&gt;i&lt;/code&gt; and start typing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Save and Exit
&lt;/h3&gt;

&lt;p&gt;Press &lt;code&gt;ESC&lt;/code&gt;, then type:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:wq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Exit Without Saving
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:q!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Example Script
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi pipeline.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Pipeline started"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Using Nano (The Friendly Editor)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Open a File
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Save Your Work
&lt;/h3&gt;

&lt;p&gt;Press:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CTRL + O
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Exit Nano
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CTRL + X
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Example Config
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano config.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;database=postgres
username=admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  Real-Life Data Engineering Scenario
&lt;/h2&gt;

&lt;p&gt;You may need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Update Airflow configuration&lt;/li&gt;
&lt;li&gt;  Fix pipeline scripts&lt;/li&gt;
&lt;li&gt;  Modify database credentials&lt;/li&gt;
&lt;li&gt;  Check logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Commands might include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano airflow.cfg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi pipeline.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pro Tips for Beginners
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Always back up files before editing&lt;/li&gt;
&lt;li&gt;  Practice Vi commands slowly&lt;/li&gt;
&lt;li&gt;  Use Nano when learning&lt;/li&gt;
&lt;li&gt;  Learn basic shell commands daily&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Linux is part of the foundation of modern data infrastructure.&lt;/p&gt;

&lt;p&gt;Learning Linux commands and text editors gives you confidence when&lt;br&gt;
working with production servers and cloud platforms.&lt;/p&gt;

&lt;p&gt;Start with Nano.&lt;br&gt;
Grow into Vi.&lt;br&gt;
Practice consistently.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>data</category>
    </item>
    <item>
      <title>Real-World ETL Pipeline from a Public Google Sheet</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Feb 2026 08:52:02 +0000</pubDate>
      <link>https://forem.com/john_analytics/real-world-etl-pipeline-from-a-public-google-sheet-7cn</link>
      <guid>https://forem.com/john_analytics/real-world-etl-pipeline-from-a-public-google-sheet-7cn</guid>
      <description>&lt;p&gt;Spreadsheets are everywhere.&lt;/p&gt;

&lt;p&gt;They’re easy to use, easy to share, and often become the &lt;strong&gt;first home of business data&lt;/strong&gt;. But they’re terrible for analytics, automation, and scale.&lt;/p&gt;

&lt;p&gt;In this article, I’ll walk through how I built a &lt;strong&gt;production-style ETL pipeline&lt;/strong&gt; that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts data from a &lt;strong&gt;public Google Sheet&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Cleans and validates the data using Python&lt;/li&gt;
&lt;li&gt;Loads the data into &lt;strong&gt;PostgreSQL&lt;/strong&gt; and &lt;strong&gt;MongoDB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Handles real-world issues like UUIDs, connection strings, and performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A supermarket dataset was stored in a Google Sheet. While this works for manual inspection, it introduces several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No schema enforcement&lt;/li&gt;
&lt;li&gt;No support for analytics or BI tools&lt;/li&gt;
&lt;li&gt;Poor performance for large queries&lt;/li&gt;
&lt;li&gt;No safe way to integrate with applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was to move this data into proper databases while following &lt;strong&gt;real ETL best practices&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Public Google Sheet (CSV Export)
        ↓
     Python ETL
        ↓
  Transform &amp;amp; Validate
        ↓
 PostgreSQL (Analytics)   MongoDB (Documents)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why two databases?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; acts as the system of record for analytics and reporting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; provides flexible, document-based storage for application access&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tools Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python 3.12&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UV&lt;/strong&gt; for dependency and environment management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; for data transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests&lt;/strong&gt; for HTTP-based extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; (via SQLAlchemy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; (via PyMongo)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loguru&lt;/strong&gt; for structured logging&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Extracting Data from a Public Google Sheet
&lt;/h2&gt;

&lt;p&gt;Instead of dealing with Google Cloud authentication, I used a simpler (and very realistic) approach.&lt;/p&gt;

&lt;p&gt;Google Sheets exposes a &lt;strong&gt;CSV export endpoint&lt;/strong&gt; for public sheets.&lt;/p&gt;

&lt;p&gt;A human-friendly link like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://docs.google.com/spreadsheets/d/&amp;lt;sheet-id&amp;gt;/edit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can be converted to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://docs.google.com/spreadsheets/d/&amp;lt;sheet-id&amp;gt;/export?format=csv&amp;amp;gid=&amp;lt;gid&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python can then fetch the data directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys. No OAuth. Fully automated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Transforming the Data (Where Things Get Real)
&lt;/h2&gt;

&lt;p&gt;This is where assumptions break.&lt;/p&gt;

&lt;p&gt;I initially assumed the &lt;code&gt;id&lt;/code&gt; column was numeric. It wasn’t.&lt;/p&gt;

&lt;p&gt;It contained &lt;strong&gt;UUIDs&lt;/strong&gt; like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;47d54138-a950-4ec0-9d4a-e637e8dfb290
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trying to cast this to an integer caused the pipeline to fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson #1: The data always wins
&lt;/h3&gt;

&lt;p&gt;The fix was simple but important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat &lt;code&gt;id&lt;/code&gt; as a &lt;strong&gt;string&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Update both transformation logic and database schemas&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: Loading into PostgreSQL
&lt;/h2&gt;

&lt;p&gt;PostgreSQL is the backbone of the pipeline.&lt;/p&gt;

&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong schema enforcement&lt;/li&gt;
&lt;li&gt;Idempotent inserts&lt;/li&gt;
&lt;li&gt;Safe re-runs of the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The table is created if it doesn’t exist, and inserts use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTHING&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No duplicate records&lt;/li&gt;
&lt;li&gt;No need to truncate tables&lt;/li&gt;
&lt;li&gt;Safe incremental runs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 4: Loading into MongoDB (and Fixing Performance)
&lt;/h2&gt;

&lt;p&gt;My first MongoDB implementation used &lt;code&gt;update_one()&lt;/code&gt; in a loop.&lt;/p&gt;

&lt;p&gt;It worked — but it was &lt;strong&gt;painfully slow&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The fix was switching to &lt;strong&gt;bulk operations&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reduced load time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Debugging a Nasty PostgreSQL Error
&lt;/h2&gt;

&lt;p&gt;One of the most confusing errors I hit was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;could not translate host name "4401@localhost"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It turned out the PostgreSQL password contained an &lt;code&gt;@&lt;/code&gt; symbol.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson #3: Database passwords must be URL-safe
&lt;/h3&gt;

&lt;p&gt;The solution was &lt;strong&gt;URL encoding&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KIM@4401 → KIM%404401
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After fixing these issues, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs end-to-end.&lt;/li&gt;
&lt;li&gt;Can be safely re-run without duplicates&lt;/li&gt;
&lt;li&gt;Loads clean data into both databases&lt;/li&gt;
&lt;li&gt;Handles real-world data quirks correctly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Spreadsheets are common sources — but not suitable destinations&lt;/li&gt;
&lt;li&gt;Never assume data types without inspecting real data&lt;/li&gt;
&lt;li&gt;UUIDs are extremely common in production systems&lt;/li&gt;
&lt;li&gt;Bulk operations matter for performance&lt;/li&gt;
&lt;li&gt;Environment variables and connection strings are frequent failure points&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project wasn’t about flashy tools.&lt;/p&gt;

&lt;p&gt;It was about building something &lt;strong&gt;real&lt;/strong&gt;, breaking it, and fixing it.&lt;/p&gt;




</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a MedAdvantage RAF Engine with dbt &amp; PostgreSQL (Step-by-Step Guide)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 13 Jan 2026 10:09:54 +0000</pubDate>
      <link>https://forem.com/john_analytics/building-a-medadvantage-raf-engine-with-dbt-postgresql-step-by-step-guide-5fed</link>
      <guid>https://forem.com/john_analytics/building-a-medadvantage-raf-engine-with-dbt-postgresql-step-by-step-guide-5fed</guid>
      <description>&lt;p&gt;In this project, I built a mini &lt;strong&gt;Medicare Advantage Risk Adjustment Factor (RAF) engine&lt;/strong&gt; using &lt;strong&gt;PostgreSQL&lt;/strong&gt;, &lt;strong&gt;dbt Core&lt;/strong&gt;, and &lt;strong&gt;synthetic healthcare data&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
The goal was to simulate a real-world healthcare analytics pipeline that transforms &lt;strong&gt;raw claims data into member-level risk scores&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article walks through the &lt;strong&gt;entire process step by step&lt;/strong&gt;, from raw CSV files to a final analytics mart.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Project Overview
&lt;/h2&gt;

&lt;p&gt;The project models how healthcare organizations calculate &lt;strong&gt;risk scores&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Member demographic data
&lt;/li&gt;
&lt;li&gt;Medical claims with &lt;strong&gt;ICD-10 diagnosis codes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Pharmacy claims with &lt;strong&gt;NDC drug codes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reference mapping tables that convert codes into &lt;strong&gt;HCC and RxHCC categories&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final output is a table that shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Member ID
&lt;/li&gt;
&lt;li&gt;Service year
&lt;/li&gt;
&lt;li&gt;Gender and plan
&lt;/li&gt;
&lt;li&gt;Total HCC weight
&lt;/li&gt;
&lt;li&gt;Total RxHCC weight
&lt;/li&gt;
&lt;li&gt;Final risk score
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This type of table is commonly used for &lt;strong&gt;actuarial analysis, reimbursement modeling, and population health analytics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tools &amp;amp; Environment Setup
&lt;/h2&gt;

&lt;p&gt;Before starting the project, I installed and configured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; as the data warehouse
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBeaver&lt;/strong&gt; as the database client
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python + virtual environment (venv)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt Core with the Postgres adapter&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup allows a modern &lt;strong&gt;ELT (Extract → Load → Transform)&lt;/strong&gt; workflow where data is first loaded into the warehouse and then transformed using dbt.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Database Design
&lt;/h2&gt;

&lt;p&gt;I used one PostgreSQL database with &lt;strong&gt;two schemas&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;med_project&lt;/code&gt;&lt;/strong&gt; → holds raw data and reference tables
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;analytics&lt;/code&gt;&lt;/strong&gt; → holds all dbt models (staging, core, and marts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Raw tables stored in &lt;code&gt;med_project&lt;/code&gt;:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Members
&lt;/li&gt;
&lt;li&gt;Medical claims
&lt;/li&gt;
&lt;li&gt;Pharmacy claims
&lt;/li&gt;
&lt;li&gt;ICD-to-HCC mapping table
&lt;/li&gt;
&lt;li&gt;NDC-to-RxHCC mapping table
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  dbt models stored in &lt;code&gt;analytics&lt;/code&gt;:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Staging models
&lt;/li&gt;
&lt;li&gt;Core models
&lt;/li&gt;
&lt;li&gt;Mart models (final analytics tables)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation keeps &lt;strong&gt;raw data immutable&lt;/strong&gt; and &lt;strong&gt;transformed data fully governed by dbt&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Loading the Raw Data
&lt;/h2&gt;

&lt;p&gt;Synthetic CSV files were generated and loaded into PostgreSQL using &lt;strong&gt;DBeaver’s Import Tool&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;Each CSV corresponded to one raw table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Members
&lt;/li&gt;
&lt;li&gt;Medical claims
&lt;/li&gt;
&lt;li&gt;Pharmacy claims
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All raw table columns were stored as &lt;strong&gt;TEXT&lt;/strong&gt; initially. This prevents ingestion failures during loading and allows all data type enforcement to be handled inside dbt.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Initializing the dbt Project
&lt;/h2&gt;

&lt;p&gt;A new dbt project was created inside a Python virtual environment.&lt;br&gt;&lt;br&gt;
During initialization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL was selected as the adapter
&lt;/li&gt;
&lt;li&gt;A profile was created in &lt;code&gt;profiles.yml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The default target schema was set to &lt;code&gt;analytics&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dbt connection was validated to ensure connectivity between dbt and PostgreSQL.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Registering Raw Tables as dbt Sources
&lt;/h2&gt;

&lt;p&gt;To allow dbt to reference raw tables safely, a &lt;strong&gt;source configuration file&lt;/strong&gt; was created.&lt;/p&gt;

&lt;p&gt;This file tells dbt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which schema the raw tables are in
&lt;/li&gt;
&lt;li&gt;Which tables are considered &lt;strong&gt;authoritative raw sources&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Which tables act as &lt;strong&gt;reference data (HCC &amp;amp; RxHCC mappings)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables consistent use of dbt sources and prevents hard-coding schema names inside models.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Building the Staging Layer
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;staging layer&lt;/strong&gt; is where all raw data is cleaned and standardized.&lt;/p&gt;

&lt;p&gt;At this stage, I performed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit parsing of U.S.-formatted dates
&lt;/li&gt;
&lt;li&gt;Numeric type conversions for amounts and quantities
&lt;/li&gt;
&lt;li&gt;Trimming and upper-casing of ICD and NDC codes
&lt;/li&gt;
&lt;li&gt;Basic null handling
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three staging models were created:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;stg_members&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stg_medical_claims&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stg_pharmacy_claims&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models ensure that all downstream data has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistent data types
&lt;/li&gt;
&lt;li&gt;Clean formats
&lt;/li&gt;
&lt;li&gt;Reliable date values
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Building the Core Layer
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;core layer&lt;/strong&gt; represents &lt;strong&gt;analytics-ready entities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here I focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating one clean row per member
&lt;/li&gt;
&lt;li&gt;Passing through only validated claim and pharmacy records
&lt;/li&gt;
&lt;li&gt;Preparing the data for aggregation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Core models included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean member dimension
&lt;/li&gt;
&lt;li&gt;Medical claim fact table
&lt;/li&gt;
&lt;li&gt;Pharmacy claim fact table
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer removes duplication and creates stable structures used by the marts.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Mapping Diagnoses to HCCs
&lt;/h2&gt;

&lt;p&gt;Medical claims were mapped to &lt;strong&gt;HCC categories&lt;/strong&gt; using the ICD-to-HCC reference table.&lt;/p&gt;

&lt;p&gt;At this stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each member’s diagnosis codes were expanded
&lt;/li&gt;
&lt;li&gt;Codes were normalized
&lt;/li&gt;
&lt;li&gt;Valid diagnoses were matched to HCC categories
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output produces one record per &lt;strong&gt;member per year per HCC category&lt;/strong&gt;, with an associated &lt;strong&gt;risk weight&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Computing Final Member Risk Scores
&lt;/h2&gt;

&lt;p&gt;The final step combined:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregated &lt;strong&gt;HCC weights&lt;/strong&gt; from medical claims
&lt;/li&gt;
&lt;li&gt;Aggregated &lt;strong&gt;RxHCC weights&lt;/strong&gt; from pharmacy claims
&lt;/li&gt;
&lt;li&gt;Member demographic attributes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each member and service year:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HCC weights were summed
&lt;/li&gt;
&lt;li&gt;RxHCC weights were summed
&lt;/li&gt;
&lt;li&gt;A base score was added
&lt;/li&gt;
&lt;li&gt;A final &lt;strong&gt;risk score&lt;/strong&gt; was computed
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This produced the final analytics mart: &lt;strong&gt;&lt;code&gt;member_risk_scores&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Results &amp;amp; Key Outcomes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A fully functional &lt;strong&gt;end-to-end healthcare analytics pipeline&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Clean transformation workflow from raw CSVs to final mart&lt;/li&gt;
&lt;li&gt;A production-style &lt;strong&gt;member-level risk score table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data ready for reporting tools like Power BI or Tableau&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;dbt enforces strong modeling discipline through layered architecture
&lt;/li&gt;
&lt;li&gt;Reference mapping tables are the backbone of healthcare risk analytics
&lt;/li&gt;
&lt;li&gt;Explicit type casting prevents silent data quality issues
&lt;/li&gt;
&lt;li&gt;Separation of raw, staging, core, and marts ensures scalability and auditability
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  12. Key Challenges Faced &amp;amp; Resolved
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL CSV permission issues → Resolved by using client-side imports
&lt;/li&gt;
&lt;li&gt;Cross-database reference errors in dbt → Fixed by aligning the dbt database configuration
&lt;/li&gt;
&lt;li&gt;U.S. date format parsing errors → Solved by explicitly controlling date parsing in staging
&lt;/li&gt;
&lt;li&gt;Semicolon syntax errors in dbt models → Resolved by removing trailing semicolons
&lt;/li&gt;
&lt;li&gt;Source configuration mismatches → Fixed by correcting schema references
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how modern analytics engineering tools can simulate &lt;strong&gt;real Medicare Advantage risk modeling workflows&lt;/strong&gt; using open-source technologies.  &lt;/p&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>INTRODUCTION TO DBT(Data Build Tool)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Mon, 24 Nov 2025 08:58:17 +0000</pubDate>
      <link>https://forem.com/john_analytics/introduction-to-dbtdata-build-tool-5aeg</link>
      <guid>https://forem.com/john_analytics/introduction-to-dbtdata-build-tool-5aeg</guid>
      <description>&lt;p&gt;A Beginner-Friendly Guide to Modern Analytics Engineering&lt;/p&gt;

&lt;p&gt;This article introduces &lt;strong&gt;dbt (data build tool)&lt;/strong&gt; and explores several foundational concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What dbt is
&lt;/li&gt;
&lt;li&gt;Core principles
&lt;/li&gt;
&lt;li&gt;Why we transform data
&lt;/li&gt;
&lt;li&gt;How dbt structures SQL development
&lt;/li&gt;
&lt;li&gt;How macros, tests, documentation, and &lt;code&gt;ref&lt;/code&gt; work
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 What is dbt?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;dbt is an open-source transformation framework&lt;/strong&gt; that allows anyone comfortable with SQL to build modular, version-controlled, production-grade data pipelines.&lt;/p&gt;

&lt;p&gt;Unlike ingestion tools, dbt &lt;strong&gt;does not move data&lt;/strong&gt;; it &lt;strong&gt;transforms&lt;/strong&gt; the data already in your warehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧹 Why Transform Data?
&lt;/h2&gt;

&lt;p&gt;dbt helps reshape and standardize raw data for analytics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning
&lt;/li&gt;
&lt;li&gt;Deduplication
&lt;/li&gt;
&lt;li&gt;Restructuring
&lt;/li&gt;
&lt;li&gt;Filtering
&lt;/li&gt;
&lt;li&gt;Aggregation
&lt;/li&gt;
&lt;li&gt;Joining
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗 dbt Is Open Core
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DBT CORE&lt;/th&gt;
&lt;th&gt;DBT CLOUD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Open-source data transformation&lt;/td&gt;
&lt;td&gt;Fully managed dbt experience&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License: Apache 2.0&lt;/td&gt;
&lt;td&gt;SaaS platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL compiler, Jinja, adapters&lt;/td&gt;
&lt;td&gt;IDE, scheduling, logging, alerting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLI interface&lt;/td&gt;
&lt;td&gt;Authentication &amp;amp; SSO&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  🔑 Core Concepts in dbt
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Write transformations using SELECT statements
&lt;/li&gt;
&lt;li&gt;Build DAGs using &lt;code&gt;ref()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Test models to ensure accuracy
&lt;/li&gt;
&lt;li&gt;Generate documentation
&lt;/li&gt;
&lt;li&gt;Use macros to write reusable SQL
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🧩 1. CORE 1 : Express all transforms with SELECT statements
&lt;/h2&gt;

&lt;p&gt;In dbt:&lt;/p&gt;

&lt;p&gt;Everything is a SELECT statement&lt;/p&gt;

&lt;p&gt;dbt takes care of the boilerplate and SQL DDL.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;dbt Model Example&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;is_deleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compiled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;is_deleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DBT supports several materialization strategies&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Syntax&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table&lt;/td&gt;
&lt;td&gt;create table analytics.orders as (...)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;View&lt;/td&gt;
&lt;td&gt;create view analytics.orders as (...)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ephemeral&lt;/td&gt;
&lt;td&gt;Model interpolated into model 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incremental(Advanced)&lt;/td&gt;
&lt;td&gt;Selective rebuild for new rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build your own&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  🔗 CORE 2 : Express relationships with {{ref}} statement
&lt;/h2&gt;

&lt;p&gt;The {{ref (...)}} statement automatically handles dependencies in dbt models&lt;/p&gt;

&lt;p&gt;The ref statement allows you to do two things&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Interpolates the name of your schema&lt;/li&gt;
&lt;li&gt;Builds an edge in the DAG between two models helping dbt understand dependencies&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Use it like you would any table&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'base_orders'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧪 CORE 3 : Easily build tests to ensure model accuracy
&lt;/h2&gt;

&lt;p&gt;DBT has a framework for testing your models and datasets&lt;/p&gt;

&lt;p&gt;dbt can test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uniqueness
&lt;/li&gt;
&lt;li&gt;non-null
&lt;/li&gt;
&lt;li&gt;accepted values
&lt;/li&gt;
&lt;li&gt;foreign key relationships
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 CORE 4 : Documentation is accessible and easily updated
&lt;/h2&gt;

&lt;p&gt;When dbt generates documentation, it takes many aspects into account. &lt;/p&gt;

&lt;p&gt;Everything it knows about your project&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Description (from .yml file)&lt;/li&gt;
&lt;li&gt;Model dependencies&lt;/li&gt;
&lt;li&gt;Model SQL&lt;/li&gt;
&lt;li&gt;Sources&lt;/li&gt;
&lt;li&gt;Tests&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Generate docs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt docs generate &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; dbt docs serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧠 5. CORE 5 : Use Macros to write reusable/modular SQL
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Using Jinja turns your dbt project into a programming environment for SQL&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Use control structures (e.g., if statements and for loops) in SQL&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use environment variables in your dbt project for production deployments&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Operate on the results of one query to generate another query&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Abstract snippets of SQL into reusable macros — these are analogous to functions in most programming languages.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Macros in Jinja are pieces of code that can be used multiple times&lt;br&gt;
&lt;strong&gt;Macro (macros/cents_to_dollars.sql)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{% macro cents_to_dollars(column_name, precision=2) %}
    ({{ column_name }} / 100)::numeric(16, {{ precision }})
{% endmacro %}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model (models/stg_payments.sql)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;payment_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;cents_to_dollars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'amount'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;amount_usd&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;app_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payments&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎯 Conclusion
&lt;/h2&gt;

&lt;p&gt;dbt helps teams deliver:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher-quality datasets
&lt;/li&gt;
&lt;li&gt;Faster development cycles
&lt;/li&gt;
&lt;li&gt;Lower maintenance costs
&lt;/li&gt;
&lt;li&gt;Clear lineage
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>analytics</category>
    </item>
    <item>
      <title>🌍 Building a Live Weather Dashboard for East Africa Using Power BI and OpenWeatherMap API</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 11 Nov 2025 08:02:27 +0000</pubDate>
      <link>https://forem.com/john_analytics/building-a-live-weather-dashboard-for-east-africa-using-power-bi-and-openweathermap-api-4ok1</link>
      <guid>https://forem.com/john_analytics/building-a-live-weather-dashboard-for-east-africa-using-power-bi-and-openweathermap-api-4ok1</guid>
      <description>&lt;h2&gt;
  
  
  💡 Introduction
&lt;/h2&gt;

&lt;p&gt;Have you ever wished you could monitor real-time weather conditions across multiple cities — all in one dashboard?&lt;/p&gt;

&lt;p&gt;This project started with a simple question:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;“Can I connect Power BI directly to a live API and visualize up-to-date weather data?”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer turned out to be &lt;em&gt;yes&lt;/em&gt;.  &lt;/p&gt;

&lt;p&gt;Using the &lt;strong&gt;OpenWeatherMap API&lt;/strong&gt;, I built an &lt;strong&gt;interactive Power BI dashboard&lt;/strong&gt; that tracks live &lt;strong&gt;temperature, humidity, and pressure&lt;/strong&gt; for four East African cities — &lt;strong&gt;Nairobi, Mombasa, Kampala, and Kigali&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article walks you through the journey, design decisions, and insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 The Problem
&lt;/h2&gt;

&lt;p&gt;Weather data in East Africa is often scattered across various platforms — news sites, mobile apps, and government portals. Comparing weather conditions between cities can be time-consuming and inconsistent.&lt;/p&gt;

&lt;p&gt;As a data analyst, I wanted a &lt;strong&gt;unified, automated, and interactive dashboard&lt;/strong&gt; that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch live weather data automatically
&lt;/li&gt;
&lt;li&gt;Display real-time metrics
&lt;/li&gt;
&lt;li&gt;Let users explore and compare cities with intuitive visuals
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, a single dashboard where &lt;strong&gt;data meets interactivity&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠 Tools &amp;amp; Approach
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power BI Desktop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data visualization and dashboard design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power Query (M language)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connecting and transforming API data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenWeatherMap API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Live data source for weather metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DAX (Data Analysis Expressions)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Creating dynamic interactivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hosting project files and documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Connect Power BI to OpenWeatherMap&lt;/strong&gt; — to fetch weather data in real time.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform JSON responses&lt;/strong&gt; in Power Query — to clean and standardize the data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design visuals&lt;/strong&gt; that show temperature, humidity, and pressure.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add interactivity&lt;/strong&gt; with slicers, tooltips, and drillthrough pages.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish &amp;amp; document&lt;/strong&gt; the project for sharing and reproducibility.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🎨 Designing the Dashboard
&lt;/h2&gt;

&lt;p&gt;The dashboard is minimal, clean, and interactive.&lt;/p&gt;

&lt;p&gt;At the top, three &lt;strong&gt;cards&lt;/strong&gt; display the current &lt;strong&gt;temperature, humidity, and pressure&lt;/strong&gt; for the selected city.&lt;br&gt;&lt;br&gt;
Below, a &lt;strong&gt;bar chart&lt;/strong&gt; compares temperatures across all four cities.  &lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;city slicer&lt;/strong&gt; allows filtering, while &lt;strong&gt;hover tooltips&lt;/strong&gt; and &lt;strong&gt;drillthrough pages&lt;/strong&gt; reveal more details about each city.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dashboard Highlights
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🌍 &lt;strong&gt;City Filter (Slicer)&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Filter visuals by city dynamically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🌡 &lt;strong&gt;Interactive Cards&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Display temperature, humidity, and pressure in real time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;📊 &lt;strong&gt;Bar Chart Comparison&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Compare temperatures across multiple cities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🧭 &lt;strong&gt;Drillthrough Page&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Explore detailed city metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;💬 &lt;strong&gt;Tooltip Page&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Get insights by simply hovering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔁 &lt;strong&gt;Automatic Refresh&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Updates data directly from the API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  📈 Insights &amp;amp; Key Takeaways
&lt;/h2&gt;

&lt;p&gt;After visualizing the live data, a few trends became clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mombasa&lt;/strong&gt; consistently shows the &lt;strong&gt;highest temperatures&lt;/strong&gt;, typical of coastal climates.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kigali&lt;/strong&gt; and &lt;strong&gt;Kampala&lt;/strong&gt; are cooler and more humid due to their higher elevation.
&lt;/li&gt;
&lt;li&gt;The ability to &lt;strong&gt;compare cities side by side&lt;/strong&gt; reveals regional climate variations in real time.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This dashboard goes beyond static reports — it offers &lt;em&gt;contextual, live weather insights&lt;/em&gt; at a glance.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Building this project was a practical exercise in real-world API integration within Power BI.&lt;br&gt;&lt;br&gt;
Here are the biggest takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Power BI can handle live APIs&lt;/strong&gt;, not just static data sources.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling in Power Query&lt;/strong&gt; is essential when dealing with real-time data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAX measures&lt;/strong&gt; are the secret to responsive dashboards.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Good design is clarity&lt;/strong&gt; — simplicity and focus always win.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameterizing API keys&lt;/strong&gt; keeps your work secure and scalable.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🧭 Why This Project Matters
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how &lt;strong&gt;data visualization&lt;/strong&gt; and &lt;strong&gt;API integration&lt;/strong&gt; can solve everyday challenges — from environmental monitoring to urban planning.  &lt;/p&gt;

&lt;p&gt;For organizations, such dashboards can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support &lt;strong&gt;agriculture&lt;/strong&gt; by tracking temperature and humidity trends.
&lt;/li&gt;
&lt;li&gt;Aid &lt;strong&gt;energy planning&lt;/strong&gt; by monitoring pressure and heat variations.
&lt;/li&gt;
&lt;li&gt;Help &lt;strong&gt;logistics and travel&lt;/strong&gt; businesses adjust operations based on weather.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The concept extends beyond weather — it’s a blueprint for any &lt;strong&gt;real-time data visualization project&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>datascience</category>
      <category>microsoft</category>
      <category>showdev</category>
      <category>api</category>
    </item>
    <item>
      <title>🌍 Automating Africa’s Energy Data Collection Using Python, Playwright(+Why Playwright ?), and MongoDB (2000–2024)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 04 Nov 2025 12:07:16 +0000</pubDate>
      <link>https://forem.com/john_analytics/automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024-29lo</link>
      <guid>https://forem.com/john_analytics/automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024-29lo</guid>
      <description>&lt;h2&gt;
  
  
  ⚡ Introduction
&lt;/h2&gt;

&lt;p&gt;In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.&lt;br&gt;&lt;br&gt;
However, most open data platforms in Africa — such as the &lt;strong&gt;Africa Energy Portal (AEP)&lt;/strong&gt; — present information in dashboard views, which makes large-scale analysis tedious.&lt;/p&gt;

&lt;p&gt;To address this challenge, I built a fully automated &lt;strong&gt;ETL (Extract, Transform, Load)&lt;/strong&gt; pipeline that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scrapes &lt;strong&gt;energy indicators&lt;/strong&gt; for all African countries (2000–2024),&lt;/li&gt;
&lt;li&gt;Formats and validates the data for consistency,&lt;/li&gt;
&lt;li&gt;And stores it in a &lt;strong&gt;MongoDB database&lt;/strong&gt; for easy access and analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project uses &lt;strong&gt;Python&lt;/strong&gt;, &lt;strong&gt;Playwright&lt;/strong&gt;, and &lt;strong&gt;MongoDB&lt;/strong&gt;, with automation powered by the lightweight dependency manager &lt;strong&gt;uv&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧩 Problem Statement
&lt;/h2&gt;

&lt;p&gt;While the Africa Energy Portal provides valuable country-level datasets, it does not offer a bulk download option.&lt;br&gt;&lt;br&gt;
Researchers, analysts, and energy planners need historical time-series data — such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Electricity generation and consumption
&lt;/li&gt;
&lt;li&gt;Renewable energy contribution
&lt;/li&gt;
&lt;li&gt;Access to clean cooking
&lt;/li&gt;
&lt;li&gt;Population electrification (rural vs urban)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manually downloading data for &lt;strong&gt;50+ African countries&lt;/strong&gt; and 20+ years would take days — not counting inconsistencies in data formats and missing years.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;automate it end-to-end&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧠 Project Goals
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data for all African countries directly from the AEP website.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it into a structured, tabular format for analysis.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store&lt;/strong&gt; it efficiently in MongoDB for scalability and retrieval.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate&lt;/strong&gt; data completeness and consistency across countries and indicators.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export&lt;/strong&gt; the final cleaned dataset for analysis and sharing.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  ⚙️ Tools &amp;amp; Technologies
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Tool / Library&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Web scraping&lt;/td&gt;
&lt;td&gt;&lt;a href="https://playwright.dev/python/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Automates browser-based data capture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment &amp;amp; Dependency Management&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Manages virtual environment and packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data storage&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.mongodb.com/" rel="noopener noreferrer"&gt;MongoDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Stores country-wise metrics and year data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data validation &amp;amp; analysis&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;pydantic&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cleans and structures data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Export&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openpyxl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Saves Excel files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scripting&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Python&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Glue for the entire ETL process&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  🔄 ETL Pipeline Overview
&lt;/h2&gt;

&lt;p&gt;The pipeline consists of four modular stages:&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 1 – Data Extraction&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Uses &lt;strong&gt;Playwright&lt;/strong&gt; to navigate to each country’s profile page.&lt;/li&gt;
&lt;li&gt;Intercepts the &lt;code&gt;/get-country-data&lt;/code&gt; XHR response.&lt;/li&gt;
&lt;li&gt;Extracts JSON payloads containing all available indicators and yearly values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each JSON record includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kenya"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Population with access to electricity - National"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ELECTRICITY ACCESS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"yearly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2015"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;19.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2016"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;25.73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2022"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;42.62&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 2 – Data Formatting&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Converts raw JSON into a &lt;strong&gt;tabular schema&lt;/strong&gt;:
["country", "country_serial", "metric", "unit", "sector", "sub_sector", "sub_sub_sector", "source_link", "source", "2000", ..., "2024"]&lt;/li&gt;
&lt;li&gt;Ensures each row represents one metric for one country.&lt;/li&gt;
&lt;li&gt;Fills missing years with &lt;code&gt;null&lt;/code&gt; values to maintain consistency.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 3 – Data Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Inserts formatted records into &lt;strong&gt;MongoDB&lt;/strong&gt; using &lt;code&gt;pymongo&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Adds a unique index &lt;code&gt;(country, metric, source)&lt;/code&gt; to prevent duplicates.&lt;/li&gt;
&lt;li&gt;Upserts records — ensuring updates don’t create duplicates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each MongoDB document looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kenya"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Access to Clean Cooking%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Tracking SDG7/WBG"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"2000"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"2015"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"2020"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;23.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"2024"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Stage 4 – Validation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Identifies missing years or inconsistent units.
&lt;/li&gt;
&lt;li&gt;Detects countries with incomplete datasets.
&lt;/li&gt;
&lt;li&gt;Exports a detailed &lt;code&gt;validation_report.csv&lt;/code&gt; that flags issues automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sample output:&lt;br&gt;
| issue_type | country | metric | details |&lt;br&gt;
|-------------|----------|--------|----------|&lt;br&gt;
| MISSING_YEARS | Kenya | Access to Clean Cooking% | 2000–2014, 2023–2024 |&lt;br&gt;
| UNIT_INCONSISTENCY | &lt;em&gt;ALL&lt;/em&gt; | Electricity Access | %; MW |&lt;/p&gt;


&lt;h2&gt;
  
  
  🧾 Data Export
&lt;/h2&gt;

&lt;p&gt;Once the ETL pipeline finishes, data is exported to both &lt;strong&gt;CSV&lt;/strong&gt; and &lt;strong&gt;Excel&lt;/strong&gt; formats for analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python export_to_csv.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;reports/exports/energy_data.csv&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reports/exports/energy_data.xlsx&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎭 Why Playwright Was Essential for This Project
&lt;/h2&gt;

&lt;p&gt;Many would wonder:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Why not just use requests or BeautifulSoup to get the data?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Africa Energy Portal (AEP) website is highly dynamic — it doesn’t serve raw data directly in the page HTML. Instead, when you open a country page like&lt;br&gt;
&lt;a href="https://africa-energy-portal.org/country/kenya" rel="noopener noreferrer"&gt;https://africa-energy-portal.org/country/kenya&lt;/a&gt;,&lt;br&gt;
the browser first loads a basic template, and then JavaScript makes a hidden request to an internal endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST https://africa-energy-portal.org/get-country-data

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the actual source of all the energy statistics.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;1. Dynamic JavaScript Rendering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The data (electricity access, renewables, etc.) is fetched asynchronously after the page loads.&lt;br&gt;
Traditional libraries like requests only download the HTML shell, missing all those dynamic values.&lt;/p&gt;

&lt;p&gt;Playwright, however, executes JavaScript in a real browser, allowing it to:&lt;/p&gt;

&lt;p&gt;Wait until the data request is made,&lt;/p&gt;

&lt;p&gt;Intercept the /get-country-data response,&lt;/p&gt;

&lt;p&gt;Capture the full JSON payload in real time.&lt;/p&gt;

&lt;p&gt;This gives us clean, structured data instead of messy HTML scraping.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;2. Cloudflare and Anti-Bot Protection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AEP website uses Cloudflare security, which blocks automated clients that don’t behave like browsers.&lt;br&gt;
When we tried using requests, we got frequent 403 (Forbidden) and 500 (Server Error) responses.&lt;/p&gt;

&lt;p&gt;Playwright solves this because it:&lt;/p&gt;

&lt;p&gt;Runs a full Chromium browser (just like Chrome or Edge).&lt;/p&gt;

&lt;p&gt;Sends real headers, cookies, and browsing patterns.&lt;/p&gt;

&lt;p&gt;Is indistinguishable from a human visitor.&lt;/p&gt;

&lt;p&gt;This made it the only reliable way to consistently access data without breaking terms of service or scraping hidden content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔹 3. Network Interception&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Playwright allows us to listen for specific network calls using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;page.expect_response(lambda r: "get-country-data" in r.url and r.status == 200)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means we can:&lt;/p&gt;

&lt;p&gt;Capture the JSON the site uses internally,&lt;/p&gt;

&lt;p&gt;Save it instantly,&lt;/p&gt;

&lt;p&gt;And avoid parsing the visual layout at all.&lt;/p&gt;

&lt;p&gt;It’s essentially like “catching the data packet” as it flies through the browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔹 4. Reliability &amp;amp; Control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because we were scraping 50+ countries, we needed:&lt;/p&gt;

&lt;p&gt;Timeout handling (some pages take 30–60s to load),&lt;/p&gt;

&lt;p&gt;Retries for network issues,&lt;/p&gt;

&lt;p&gt;Throttling to avoid overwhelming the server.&lt;/p&gt;

&lt;p&gt;Playwright gives us those controls — ensuring we get consistent, ethical, and stable scraping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Clean ETL Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With Playwright, our Stage 1 data extraction directly produces structured JSON, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "country": "Kenya",
  "metric": "Access to Clean Cooking%",
  "sector": "Basic data",
  "yearly": {
    "2015": 11.9,
    "2016": 13.8,
    "2022": 30.0
  }
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That JSON is ready for transformation and MongoDB storage, making the pipeline efficient end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In short:&lt;/em&gt;&lt;br&gt;
Playwright wasn’t just a choice — it was a necessity.&lt;br&gt;
It allowed this ETL project to move from fragile scraping to robust, browser-level automation, ensuring accurate and repeatable extraction of Africa’s energy data. ⚡&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚠️ Challenges Faced
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloudflare protection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The AEP website blocked simple HTTP requests (403, 500). Solved by using Playwright’s browser simulation to mimic human behavior.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slow response times&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Some pages took &amp;gt;30 seconds to return data. Added retry logic and longer timeouts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inconsistent URL naming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Country URLs (like &lt;code&gt;cote-d’ivoire&lt;/code&gt; vs &lt;code&gt;cote-divoire&lt;/code&gt;) required slug normalization logic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Incomplete datasets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Some countries lacked data for certain years, handled via validation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser resource use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Playwright’s real browser automation was resource-heavy; introduced throttling to manage load.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ Successfully extracted data for &lt;strong&gt;50 African countries&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Collected &lt;strong&gt;500+ indicators&lt;/strong&gt; covering &lt;strong&gt;2000–2024&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ All records stored in MongoDB with proper schema
&lt;/li&gt;
&lt;li&gt;✅ Automated validation caught missing and inconsistent data
&lt;/li&gt;
&lt;li&gt;✅ Exportable formats ready for visualization and analysis&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Automating data extraction from protected websites is possible using &lt;strong&gt;browser-level automation (Playwright)&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Designing modular ETL stages makes maintenance and debugging easier.
&lt;/li&gt;
&lt;li&gt;Data validation is just as important as extraction — raw data is rarely clean.
&lt;/li&gt;
&lt;li&gt;Storing data in &lt;strong&gt;MongoDB&lt;/strong&gt; offers flexibility for hierarchical (nested) data structures.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 Future Work
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Build an &lt;strong&gt;interactive dashboard&lt;/strong&gt; using Streamlit or Power BI.
&lt;/li&gt;
&lt;li&gt;Automate periodic updates (monthly/quarterly).
&lt;/li&gt;
&lt;li&gt;Add country-level time-series visualization modules.&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>dataengineering</category>
      <category>mongodb</category>
      <category>python</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
