<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Blessing Angus</title>
    <description>The latest articles on Forem by Blessing Angus (@ccinaza).</description>
    <link>https://forem.com/ccinaza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1422545%2Fd0c542f2-7eb0-44e9-a4f5-1fe14d4d4e7a.jpeg</url>
      <title>Forem: Blessing Angus</title>
      <link>https://forem.com/ccinaza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ccinaza"/>
    <language>en</language>
    <item>
      <title>Building an Incremental Zoho Desk to BigQuery Pipeline: Lessons from the Trenches</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sat, 28 Feb 2026 22:35:39 +0000</pubDate>
      <link>https://forem.com/ccinaza/building-an-incremental-zoho-desk-to-bigquery-pipeline-lessons-from-the-trenches-op1</link>
      <guid>https://forem.com/ccinaza/building-an-incremental-zoho-desk-to-bigquery-pipeline-lessons-from-the-trenches-op1</guid>
      <description>&lt;p&gt;&lt;em&gt;What I thought would take a few days ended up taking weeks so here's everything I learned building a production Zoho Desk to BigQuery pipeline from scratch.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;When my company decided to centralise customer support analytics, the task landed on my plate: pull data from Zoho Desk, land it in BigQuery, transform with dbt, done. The plan looked clean on paper.&lt;/p&gt;

&lt;p&gt;What followed was a masterclass in why production data engineering is never as simple as the happy path suggests.&lt;/p&gt;

&lt;p&gt;This is the story of building that pipeline; the architecture decisions, the walls I hit, and the lessons I'll carry into every pipeline I build from here.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Starting Point: A Full Load That Took Forever
&lt;/h2&gt;

&lt;p&gt;The first working version of the pipeline was blunt but functional. Every day, pull every ticket ever created from the Zoho Desk API and overwrite the BigQuery table. Simple. Predictable. And for tickets, completely untenable at scale because the table had over a million rows and was growing daily.&lt;/p&gt;

&lt;p&gt;I needed incremental loads. But before I could get there, I had a different problem to solve: how do I get all the historical data into BigQuery in the first place without running an API job for days?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bootstrapping Problem: When the API Is Too Slow for the Initial Load
&lt;/h2&gt;

&lt;p&gt;Loading years of historical data through a paginated API isn't a pipeline problem, it's a waiting problem. For a table with hundreds of thousands of rows, even a well-optimised API pull can take hours or days just for the seed load.&lt;/p&gt;

&lt;p&gt;The solution was to sidestep the API entirely for the initial load. Zoho Desk has a built-in &lt;strong&gt;data backup feature&lt;/strong&gt; that exports your entire account data as CSV files. I used this to export a full snapshot of tickets, threads, contacts, and calls, then loaded each CSV directly into BigQuery via the BQ Console UI.&lt;/p&gt;

&lt;p&gt;The UI load process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Format: CSV&lt;/li&gt;
&lt;li&gt;Schema: defined manually (not auto-detect — more on why this matters later)&lt;/li&gt;
&lt;li&gt;Skip leading rows: 1 (header row)&lt;/li&gt;
&lt;li&gt;Allow quoted newlines: yes (critical for fields like ticket descriptions that contain line breaks)&lt;/li&gt;
&lt;li&gt;Allow jagged rows: yes (API responses sometimes omit optional fields)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the historical snapshot was in BigQuery, I set the incremental pipeline's &lt;code&gt;start_date&lt;/code&gt; to the backup date. The first scheduled run picks up any changes from that day forward, no gap, no overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; For large initial loads, don't fight the API. Use native export features if they exist. The pipeline is for keeping data fresh; getting the history in is a one-time bootstrapping problem that deserves its own solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Code Generation Over Copy-Paste
&lt;/h2&gt;

&lt;p&gt;Here's how the pipeline itself is structured. Rather than writing a separate Airflow DAG for every Zoho Desk endpoint, I built a code-generation system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;custom Airflow operator&lt;/strong&gt; (&lt;code&gt;ZohoDeskToGCSOperator&lt;/code&gt;) handles all the API extraction logic; pagination, OAuth, concurrent detail fetching, incremental search&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Jinja template&lt;/strong&gt; defines the DAG structure once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YAML config files&lt;/strong&gt; one per endpoint and each defines the parameters: schedule, columns, schema, endpoint type&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;generator script&lt;/strong&gt; renders YAML + template to DAG Python file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adding a new endpoint means writing a YAML file and running the generator, not copying a DAG from scratch. The template handles the branching logic for different endpoint types automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Zoho Desk API
      │
      ▼
ZohoDeskToGCSOperator  ──►  GCS (staging CSV)
                                    │
                                    ▼
                GCSToBigQueryOperator  ──►  BigQuery (_staging table)
                                                              │
                                                              ▼
                                            BigQueryInsertJobOperator
                                              (MERGE into main table)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For large transactional tables (tickets, contacts, threads, calls), data lands in a &lt;code&gt;_staging&lt;/code&gt; table first, then gets merged into the main table, updating existing rows and inserting new ones. For small reference tables (agents, teams, departments), a daily &lt;code&gt;WRITE_TRUNCATE&lt;/code&gt; is sufficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge 1: Not All APIs Are Created Equal
&lt;/h2&gt;

&lt;p&gt;The ticket and contact endpoints support Zoho's &lt;code&gt;modifiedTimeRange&lt;/code&gt; parameter. This allows you pass a start and end timestamp and get back only records modified in that window. Perfect for incremental loads.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/calls&lt;/code&gt; endpoint does not. Pass &lt;code&gt;modifiedTimeRange&lt;/code&gt; to it and you get a 422 back. &lt;/p&gt;

&lt;p&gt;The workaround: sort by &lt;code&gt;createdTime&lt;/code&gt; descending and stop paginating as soon as the oldest record on the current page predates your window. Since calls are append-only in practice, this is equivalent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;createdTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;data_interval_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Don't assume API feature parity across endpoints from the same vendor. Test every endpoint independently before writing a single line of pipeline code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge 2: A Reserved Word Hiding in Your Column Names
&lt;/h2&gt;

&lt;p&gt;The threads table has a column called &lt;code&gt;to&lt;/code&gt; — as in the recipient of a thread. Perfectly reasonable name. Except &lt;code&gt;TO&lt;/code&gt; is a reserved keyword in BigQuery SQL.&lt;/p&gt;

&lt;p&gt;The MERGE statement was generating SQL like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;`from`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`to`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`subject`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`from`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`to`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`subject`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;to&lt;/code&gt; was unquoted. BigQuery's parser sees the keyword &lt;code&gt;TO&lt;/code&gt; in an unexpected position and throws:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Syntax error: Unexpected keyword TO at [40:130]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix was to backtick-quote the column name in the generated MERGE SQL&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; When generating SQL programmatically, always quote all identifiers. You won't always know which column names will collide with reserved words, especially when the schema is driven by a third-party API you don't control.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge 3: The MERGE That Couldn't Match Rows
&lt;/h2&gt;

&lt;p&gt;After fixing the syntax error, the threads MERGE hit a different wall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE/MERGE must match at most one source row for each target row
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one took some digging. The threads endpoint works like this: search for tickets modified in the time window, then fetch all threads for each of those tickets. The problem is that Zoho's &lt;code&gt;modifiedTimeRange&lt;/code&gt; search is paginated and the same ticket can appear on multiple pages if the result set shifts between requests.&lt;/p&gt;

&lt;p&gt;When that happens, threads get fetched for the same ticket twice. The staging table ends up with duplicate thread IDs. BigQuery's MERGE correctly refuses to update a target row when multiple source rows match it.&lt;/p&gt;

&lt;p&gt;I fixed it at two layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the operator (Python):&lt;/strong&gt; deduplicate ticket IDs before fetching threads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ticket_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromkeys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket_ids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;dict.fromkeys&lt;/code&gt; removes duplicates while preserving insertion order. I feel this is cleaner than converting to a set and back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the MERGE SQL (template):&lt;/strong&gt; add a deduplication guard in the &lt;code&gt;USING&lt;/code&gt; clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`staging_table`&lt;/span&gt;
    &lt;span class="n"&gt;QUALIFY&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nv"&gt;`id`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Python fix prevents the problem from occurring. The SQL fix is a safety net for edge cases I haven't thought of yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; For MERGE pipelines, always add a &lt;code&gt;QUALIFY ROW_NUMBER()&lt;/code&gt; dedup guard in the staging select. Even if your source looks clean, it defends against edge cases you haven't anticipated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge 4: Auto-Detect Might Tell Lies
&lt;/h2&gt;

&lt;p&gt;Here's where the bootstrapping decision from earlier came back to bite me. When I loaded the initial backup CSVs into BigQuery, I let auto-detect infer the schema rather than defining it explicitly. Fast and convenient, but caused some failures.&lt;/p&gt;

&lt;p&gt;Auto-detect made choices that didn't match what the incremental pipeline expected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;onholdTime&lt;/code&gt; — stored by Zoho as a timestamp string, loaded by auto-detect as &lt;code&gt;STRING&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tagCount&lt;/code&gt; — a count field, auto-detected as &lt;code&gt;INT64&lt;/code&gt; (actually correct, but the YAML I'd written said &lt;code&gt;STRING&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;isEscalated&lt;/code&gt; — loaded as &lt;code&gt;STRING&lt;/code&gt; because the CSV values were &lt;code&gt;"true"&lt;/code&gt;/&lt;code&gt;"false"&lt;/code&gt; strings, not proper booleans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these caused problems until the first MERGE ran and tried to assign a &lt;code&gt;TIMESTAMP&lt;/code&gt; value to a &lt;code&gt;STRING&lt;/code&gt; column. BigQuery's type enforcement at MERGE time is strict — and unforgiving.&lt;/p&gt;

&lt;p&gt;The fix was to query &lt;code&gt;INFORMATION_SCHEMA.COLUMNS&lt;/code&gt; on the main table and reconcile every column type against the YAML schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`project.dataset.INFORMATION_SCHEMA.COLUMNS`&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'zoho_desk_tickets'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;ordinal_position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; The initial load is the source of truth — not your assumptions about what the types &lt;em&gt;should&lt;/em&gt; be. Always verify &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; before writing a schema config for any table that was bootstrapped manually.&lt;/p&gt;




&lt;h2&gt;
  
  
  The dbt Layer: Fixing What the Raw Layer Can't
&lt;/h2&gt;

&lt;p&gt;With the raw pipeline stable, dbt handles the transformation into clean, analyst-ready tables.&lt;/p&gt;

&lt;p&gt;This keeps the raw layer as a faithful copy of the source, while the transformation layer handles type normalisation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use native export for the initial load&lt;/strong&gt;: don't fight the API for historical data. Export a full backup, load via the UI, then let the pipeline handle increments from that point.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never use auto-detect for a table that an incremental pipeline will MERGE into&lt;/strong&gt;: define the schema explicitly, and verify with &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; immediately after loading.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add the &lt;code&gt;QUALIFY ROW_NUMBER()&lt;/code&gt; dedup guard from day one&lt;/strong&gt;: it costs nothing and saves you from mysterious MERGE failures later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test every API endpoint's query parameter support independently&lt;/strong&gt;: don't inherit assumptions from one endpoint to another.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backtick-quote all identifiers in generated SQL&lt;/strong&gt;: reserved word collisions are unpredictable and the fix is trivial.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep the raw and transformation layers separate&lt;/strong&gt;: having raw data land in BigQuery with minimal transformation, and a separate dbt layer for typing and renaming, makes debugging far easier. You can always re-run dbt without re-hitting the API.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;None of the challenges described here were exotic. Reserved words, duplicate rows, type mismatches, API inconsistencies, etc., these are bread-and-butter data engineering problems. What made them feel hard was hitting them one at a time in production, under pressure, on a pipeline that analysts were waiting on.&lt;/p&gt;

&lt;p&gt;The pipeline is now running reliably in production: tickets, contacts, threads, agents, teams, departments, and accounts incrementally loaded daily. &lt;/p&gt;

&lt;p&gt;If you're building something similar — whether with Zoho, Salesforce, Hubspot, or any SaaS API — I hope these lessons save you a few hours of head-scratching.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The pipeline is built with Apache Airflow, Google Cloud Storage, BigQuery, and dbt. The custom operator pattern and code-generation approach described here are applicable to any REST API integration.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>zohodeskapi</category>
      <category>bigquery</category>
      <category>elt</category>
    </item>
    <item>
      <title>Docker Explained with a Food Analogy</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Wed, 17 Sep 2025 21:24:09 +0000</pubDate>
      <link>https://forem.com/ccinaza/docker-explained-with-a-food-analogy-4gne</link>
      <guid>https://forem.com/ccinaza/docker-explained-with-a-food-analogy-4gne</guid>
      <description>&lt;p&gt;If you're just hearing about Docker, it can sound intimidatingly complex; containers, images, engines, orchestration… it's a lot! But let's break it down using something we all understand: &lt;strong&gt;food&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem with Cooking Everywhere
&lt;/h3&gt;

&lt;p&gt;Imagine you’ve cooked a delicious meal, let’s say jollof rice and chicken. Your dish is perfect, but when you ask a friend to cook it in their kitchen, things fall apart.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They don’t have the exact spices you used.&lt;/li&gt;
&lt;li&gt;Their stove cooks faster than yours.&lt;/li&gt;
&lt;li&gt;They bought a different type of rice.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result? Their version of your meal doesn’t taste the same.&lt;/p&gt;

&lt;p&gt;This is exactly what happens in software development. An application runs beautifully on a developer's laptop but crashes when deployed to a server because the environment is slightly different: different operating system versions, missing libraries, or incompatible dependencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enter Docker: The Universal Lunchbox System
&lt;/h3&gt;

&lt;p&gt;Now imagine a different approach. Instead of sharing just the recipe and hoping for the best, you decide to pack complete, ready-to-eat meals in specialized lunchboxes.&lt;/p&gt;

&lt;p&gt;Here's how the Docker analogy maps out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your Application = The Perfectly Cooked Meal&lt;/strong&gt;&lt;br&gt;
Your software is the delicious jollof rice and chicken, the main attraction.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dependencies = All the Ingredients + Cooking Method&lt;/strong&gt;&lt;br&gt;
Your app needs specific libraries, frameworks, and configurations to work properly, just like your meal needed exact spices and cooking techniques.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Container = The Magic Lunchbox&lt;/strong&gt;&lt;br&gt;
This isn't just any container; it's a special lunchbox that preserves everything perfectly. Once your meal is inside, it doesn't matter if someone opens it in a cold office, a hot car, or a humid cafeteria. The food tastes exactly the same.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Docker Image = The Master Template&lt;/strong&gt;&lt;br&gt;
Think of this as the blueprint of a pre-cooked dish that can be duplicated endlessly. Once the recipe has been followed, you get a finished meal that can be packed into lunchboxes over and over again.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dockerfile = The Recipe Card&lt;/strong&gt;&lt;br&gt;
But where did that template come from? The Dockerfile is like the recipe card itself; it lists the exact steps, spices, and cooking instructions needed to prepare the master dish. Without it, you’d have no consistent way to recreate the meal.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Docker Engine = The Reliable Delivery Service&lt;/strong&gt;&lt;br&gt;
Docker is like the delivery system that ensures these lunchboxes can be transported anywhere and opened reliably, maintaining the meal's quality and consistency.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Changes Everything
&lt;/h3&gt;

&lt;p&gt;Docker solves the classic “It works on my machine” problem. With Docker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers can package their applications with every single dependency into one neat container&lt;/li&gt;
&lt;li&gt;DevOps teams can deploy those containers anywhere: cloud servers, local machines, or data centres, without worrying about environment differences&lt;/li&gt;
&lt;li&gt;Scaling becomes as simple as ordering more lunchboxes instead of setting up entirely new kitchens.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Real-World Impact
&lt;/h3&gt;

&lt;p&gt;Just like how meal delivery services revolutionised food by ensuring consistent quality regardless of location, Docker has revolutionised software deployment. Companies can now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy applications across different environments with confidence&lt;/li&gt;
&lt;li&gt;Scale services up or down instantly&lt;/li&gt;
&lt;li&gt;Ensure every user gets the exact same experience&lt;/li&gt;
&lt;li&gt;Reduce deployment errors by 90%&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Wrapping It Up
&lt;/h3&gt;

&lt;p&gt;Next time someone mentions Docker, don't think about complex technical infrastructure. Instead, picture a sophisticated lunchbox delivery system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every meal (application) is perfectly packaged with its ingredients (dependencies).&lt;/li&gt;
&lt;li&gt;Every lunchbox (container) delivers identical flavour no matter where it's opened.&lt;/li&gt;
&lt;li&gt;The delivery service (Docker) guarantees consistency across all locations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker has essentially solved the software equivalent of ensuring your grandmother's secret recipe tastes exactly the same whether it's cooked in Lagos, London, or Kafanchan.&lt;/p&gt;

&lt;p&gt;So the next time you hear developers talking about "containerization," just smile and think about perfectly packed lunchboxes traveling the world, delivering consistent, delicious experiences wherever they land.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building an End-to-End Data Pipeline for NYC Citi Bike 2024: A DE Zoomcamp Journey</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sat, 12 Apr 2025 22:09:53 +0000</pubDate>
      <link>https://forem.com/ccinaza/building-an-end-to-end-data-pipeline-for-nyc-citi-bike-2024-a-de-zoomcamp-journey-47o</link>
      <guid>https://forem.com/ccinaza/building-an-end-to-end-data-pipeline-for-nyc-citi-bike-2024-a-de-zoomcamp-journey-47o</guid>
      <description>&lt;p&gt;As part of my final project for the DE Zoomcamp 2025 cohort by DataTalksClub, I set out to build an end-to-end batch data pipeline to process and analyze over 1.1 million Citi Bike trips from NYC’s 2024 dataset. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal?&lt;/strong&gt; Uncover trends in urban mobility and rider behavior to support bike-sharing operations and inform city planning. &lt;/p&gt;

&lt;p&gt;In this blog post, I’ll walk you through the project’s motivation, architecture, implementation, insights, challenges, and what I learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Citi Bike Data?
&lt;/h2&gt;

&lt;p&gt;Citi Bike, New York City’s bike-sharing system, generates millions of trip records each month, capturing detailed data on rides, stations, and user behavior. This data is a goldmine for understanding urban mobility patterns, but without an automated pipeline, processing these large datasets and extracting actionable insights is a daunting task. My project aimed to solve this by building a scalable batch pipeline to ingest, store, transform, and visualize Citi Bike’s 2024 trip data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Questions to Answer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How does Citi Bike usage vary over time, and what seasonal patterns emerge in 2024?&lt;/li&gt;
&lt;li&gt;What’s the breakdown of rides between electric and classic bikes, and how does this impact operations?&lt;/li&gt;
&lt;li&gt;How do ride patterns differ between member and casual users?&lt;/li&gt;
&lt;li&gt;Which stations are the most popular starting points, and what does this suggest about urban mobility in NYC?&lt;/li&gt;
&lt;li&gt;What’s the average trip duration, and how can this inform bike maintenance or rebalancing strategies?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The NYC Bike Rides Pipeline is a batch data pipeline that processes Citi Bike 2024 trip data, stores it in a data lake, transforms it in a data warehouse, and visualizes key metrics through an interactive dashboard. Built entirely on Google Cloud Platform (GCP), the pipeline leverages modern data engineering tools to automate the process and deliver insights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud: GCP (Google Cloud Storage for the data lake, BigQuery for the data warehouse)&lt;/li&gt;
&lt;li&gt;Infrastructure as Code (IaC): Terraform to provision infrastructure&lt;/li&gt;
&lt;li&gt;Orchestration: Apache Airflow (running locally via Docker)&lt;/li&gt;
&lt;li&gt;Data Transformations: dbt Cloud for modeling and transformations&lt;/li&gt;
&lt;li&gt;Visualization: Looker Studio for the dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pipeline Architecture
&lt;/h2&gt;

&lt;p&gt;The pipeline follows a batch processing workflow, orchestrated end-to-end using Apache Airflow. Here’s a breakdown of each stage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Ingestion:&lt;/strong&gt;&lt;br&gt;
I sourced the Citi Bike 2024 trip data (CSV files) from a public S3 bucket. The data is downloaded and uploaded to a Google Cloud Storage (GCS) bucket named &lt;code&gt;naza_nyc_bike_rides&lt;/code&gt;, which serves as the data lake.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Orchestration:&lt;/strong&gt;&lt;br&gt;
Apache Airflow, running locally in a Docker container, orchestrates the pipeline. A custom DAG (&lt;code&gt;etl.py&lt;/code&gt;) manages the monthly ingestion, loading, and transformation steps, ensuring the pipeline runs smoothly for each month of 2024.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Warehouse:&lt;/strong&gt;&lt;br&gt;
Raw data is loaded from GCS into Google BigQuery’s staging dataset (&lt;code&gt;nyc_bikes_staging&lt;/code&gt;). To optimize query performance, I partitioned the tables by the &lt;code&gt;started_at&lt;/code&gt; timestamp and clustered them by &lt;code&gt;start_station_id&lt;/code&gt;. After transformation, the data is loaded into the production dataset (&lt;code&gt;nyc_bikes_prod&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transformations:&lt;/strong&gt;&lt;br&gt;
Using dbt Cloud, I transformed the raw data into a production-ready dataset. This involved cleaning (e.g., removing records with null values in critical fields like &lt;code&gt;started_at&lt;/code&gt; and &lt;code&gt;rideable_type&lt;/code&gt;), aggregating (e.g., rides by month), and modeling the data for analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Visualization:&lt;/strong&gt;&lt;br&gt;
The final dataset in &lt;code&gt;nyc_bikes_prod&lt;/code&gt; is connected to Looker Studio, where I built an interactive dashboard to visualize key metrics like total rides, rideable type breakdown, and top start stations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Infrastructure as Code (IaC):&lt;/strong&gt;&lt;br&gt;
I used Terraform to provision the GCS bucket (&lt;code&gt;naza_nyc_bike_rides&lt;/code&gt;) and BigQuery datasets (&lt;code&gt;nyc_bikes_staging&lt;/code&gt; and &lt;code&gt;nyc_bikes_prod&lt;/code&gt;), ensuring reproducibility and scalability.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xtuoql60oq86k5j81vd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xtuoql60oq86k5j81vd.png" alt="Architecture diagram" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Setting Up the Infrastructure with Terraform&lt;/strong&gt;&lt;br&gt;
I started by defining the infrastructure in Terraform, which allowed me to provision the GCS bucket and BigQuery datasets as code. The terraform/ directory contains &lt;code&gt;main.tf&lt;/code&gt; and &lt;code&gt;variables.tf&lt;/code&gt;, where I specified the &lt;code&gt;GCP project ID&lt;/code&gt;, &lt;code&gt;region&lt;/code&gt;, and resource names. Running terraform init and terraform apply set up the infrastructure in minutes, ensuring consistency across environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Orchestrating with Apache Airflow&lt;/strong&gt;&lt;br&gt;
Airflow was the backbone of the pipeline, automating the ingestion, loading, and transformation steps. I containerized Airflow using Docker (&lt;code&gt;docker-compose.yml&lt;/code&gt;) and configured it with environment variables in a &lt;code&gt;.env&lt;/code&gt; file (e.g., GCP project ID, bucket name). The DAG (&lt;code&gt;etl.py&lt;/code&gt;) in the dags/ folder handles the monthly batch processing, downloading CSV files, uploading them to GCS, and loading them into BigQuery.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One challenge I faced was initially attempting to use Google Cloud Composer for Airflow, but setup issues led me to switch to a local Airflow instance via Docker. This turned out to be a blessing in disguise, as it gave me more control over the environment and simplified debugging.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transforming Data with dbt Cloud&lt;/strong&gt;
With the raw data in BigQuery’s &lt;code&gt;nyc_bikes_staging&lt;/code&gt; dataset, I used dbt Cloud to transform it into a production-ready dataset (&lt;code&gt;nyc_bikes_prod&lt;/code&gt;). My dbt models handled:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning: Removing records with null values in critical fields.&lt;/li&gt;
&lt;li&gt;Aggregation: Calculating monthly ride counts, average trip durations, and user type breakdowns.&lt;/li&gt;
&lt;li&gt;Modeling: Creating tables optimized for analysis (e.g., rides by station, rideable type).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running the dbt models was straightforward, and the resulting dataset was ready for visualization.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Visualizing Insights in Looker Studio&lt;/strong&gt;
I connected Looker Studio to the nyc_bikes_prod dataset and built a dashboard titled NYC Bike Usage Insights (2024). 
You can explore the dashboard here:&lt;a href="https://lookerstudio.google.com/s/m9Oe3kIFNYQ" rel="noopener noreferrer"&gt;Looker Studio link&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Challenges and Solutions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Challenge 1: Airflow Setup with Cloud Composer&lt;/strong&gt;&lt;br&gt;
Initially, I planned to use Google Cloud Composer for Airflow, but I ran into setup issues, including dependency conflicts and longer-than-expected provisioning times. I pivoted to running Airflow locally with Docker, which gave me more control and faster iteration cycles. This taught me the importance of flexibility when working with cloud-managed services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 2: Handling Large Datasets in BigQuery&lt;/strong&gt;&lt;br&gt;
The Citi Bike dataset, with over 1.1 million records, required careful optimization in BigQuery. I fine-tuned performance by partitioning tables by the &lt;code&gt;started_at&lt;/code&gt; timestamp and clustering by &lt;code&gt;start_station_id&lt;/code&gt;, which significantly reduced query costs and improved performance for downstream analyses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 3: Data Quality Issues&lt;/strong&gt;&lt;br&gt;
Some records had null values in critical fields like &lt;code&gt;started_at&lt;/code&gt;, &lt;code&gt;rideable_type&lt;/code&gt;, and &lt;code&gt;start_station_name&lt;/code&gt;. I addressed this in dbt by filtering out these records during transformation, ensuring the dashboard reflected accurate insights.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;While I’ve worked with tools like BigQuery, Airflow, and Terraform before, this project deepened my understanding of how to apply them in a real-world context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool Integration:&lt;/strong&gt; I gained a deeper appreciation for integrating a full suite of tools (GCP, Airflow, dbt, Terraform) into a cohesive, scalable pipeline. Each tool has its strengths, and orchestrating them effectively is key to a successful project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Optimization:&lt;/strong&gt; Fine-tuning BigQuery for large datasets with partitioning and clustering was a great exercise in balancing cost and performance, especially for a dataset of this scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best Practices in Orchestration and IaC:&lt;/strong&gt; I refined my approach to orchestrating complex workflows with Airflow and provisioning infrastructure with Terraform, focusing on modularity and reproducibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Storytelling:&lt;/strong&gt; The project reinforced the power of data storytelling—turning raw trip data into actionable insights about urban mobility highlighted the importance of a solid pipeline as the foundation for impactful visualization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future Improvements
&lt;/h3&gt;

&lt;p&gt;This project is a strong foundation, but there are several ways to take it further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD Pipeline: Implement a CI/CD pipeline using GitHub Actions to automate testing and deployment of the DAGs and Terraform configurations.&lt;/li&gt;
&lt;li&gt;Real-Time Data: Explore streaming ingestion for real-time Citi Bike data, enabling live dashboards and more timely insights.&lt;/li&gt;
&lt;li&gt;Advanced Analytics: Add predictive models (e.g., bike demand forecasting) or anomaly detection (e.g., unusual station usage patterns) to provide deeper insights.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building the NYC Bike Rides Pipeline for DE Zoomcamp 2025 was an incredible learning experience. It allowed me to apply my data engineering skills to a real-world problem, from ingestion to visualization, while uncovering meaningful insights about urban mobility in NYC. The project also highlighted the importance of automation, optimization, and storytelling in data engineering.&lt;/p&gt;

&lt;p&gt;You can explore the full project on &lt;a href="https://github.com/Ccinaza/nyc_citibike_pipeline" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;br&gt;
I’d love to hear your feedback or ideas for collaboration—feel free to reach out on &lt;a href="https://www.linkedin.com/in/blessingangus/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or email me at &lt;a href="mailto:blangus.c@gmail.com"&gt;blangus.c@gmail.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A huge thank you to Alexey Grigorev, the DataTalksClub team, and the DE Zoomcamp community for an amazing program that pushed me to grow as a data engineer. &lt;/p&gt;

&lt;p&gt;Here’s to more data adventures! 🚀&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>elt</category>
      <category>zoomcamp</category>
      <category>dbt</category>
    </item>
    <item>
      <title>Building an End-to-End ELT Pipeline: PostgreSQL BigQuery Metabase</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Thu, 27 Mar 2025 23:47:32 +0000</pubDate>
      <link>https://forem.com/ccinaza/building-an-end-to-end-etl-pipeline-postgresql-bigquery-metabase-86e</link>
      <guid>https://forem.com/ccinaza/building-an-end-to-end-etl-pipeline-postgresql-bigquery-metabase-86e</guid>
      <description>&lt;p&gt;ETL/ELT projects are never just about moving data—they’re about designing efficient, scalable, and maintainable pipelines. In this post, I’ll walk through how I built a full ELT process using PostgreSQL, Airflow, BigQuery, and dbt, along with lessons I learned along the way.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tech Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data Source: Brazilian e-commerce dataset (CSV files)&lt;/li&gt;
&lt;li&gt;Data Storage: PostgreSQL as the staging database&lt;/li&gt;
&lt;li&gt;Orchestration: Apache Airflow to automate the ETL workflow&lt;/li&gt;
&lt;li&gt;Data Warehousing: Google Cloud Storage, Google BigQuery&lt;/li&gt;
&lt;li&gt;Transformation: dbt (Data Build Tool) for modeling&lt;/li&gt;
&lt;li&gt;Visualization: Metabase for dashboarding&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Data Ingestion
&lt;/h3&gt;

&lt;p&gt;I downloaded the csv files from Kaggle, then I used Airflow to automate loading the dataset into PostgreSQL. The biggest challenge? Handling the geolocation dataset (1M+ rows!) without performance issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Orchestrating with Airflow
&lt;/h3&gt;

&lt;p&gt;I created DAGs to move data from PostgreSQL to BigQuery using Airflow operators (PostgresToGCS → GCSToBigQuery). This setup ensured automated and reliable data movement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Transforming with dbt
&lt;/h3&gt;

&lt;p&gt;My dbt models were structured into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Staging – Cleaning raw data&lt;/li&gt;
&lt;li&gt;Intermediate – Joining and reshaping&lt;/li&gt;
&lt;li&gt;Marts – Final tables for analytics
I added basic tests to ensure data integrity (e.g., uniqueness checks).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Visualization in Metabase
&lt;/h3&gt;

&lt;p&gt;After loading transformed data into BigQuery, I built a Metabase dashboard for insights. This was where the pipeline came to life!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjc9cmsgzyvlnr0m5hof7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjc9cmsgzyvlnr0m5hof7.png" alt="Ecommerce Sales Insights" width="800" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons Learned
&lt;/h3&gt;

&lt;p&gt;This project was an eye-opener in many ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration is More Than Just Running Tasks:&lt;/strong&gt; First, working with Airflow tested my patience. Debugging DAG failures—especially silent ones—was frustrating but rewarding when I finally got everything running smoothly. I learned to check logs meticulously and not just rely on UI-based error messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dealing with Large Datasets is Tricky:&lt;/strong&gt; I hit a roadblock when trying to ingest the geolocation dataset into PostgreSQL because it had over a million rows. It made me realize that blindly loading data without thinking about performance can be a nightmare. Next time, I’d explore partitioning, indexing, or even alternative storage formats like Parquet to improve performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt is a Game-Changer for Data Transformations:&lt;/strong&gt; Before this project, I hadn’t fully grasped why dbt is so popular. Now, I see the beauty in modular transformations, clear model dependencies, and automated testing. Writing SQL in a structured, scalable way is a game-changer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visualization is the Final Mile:&lt;/strong&gt; Building the dashboard in Metabase made me appreciate how important it is to structure data properly for end users. Raw data alone means nothing until it’s turned into insights people can understand. Simplicity is key—too many metrics can overwhelm users instead of helping them make decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation Matters:&lt;/strong&gt; I’ll be honest, writing documentation felt like an afterthought at first. But looking back, having clear, step-by-step documentation makes this project reusable for anyone (including my future self!).&lt;/p&gt;

&lt;p&gt;Overall, this project reinforced my problem-solving skills, patience, and ability to adapt to different tools. Next time, I’d plan error-handling mechanisms upfront instead of treating them as an afterthought. But most importantly, I’ve realized that no data pipeline is truly “done”—there’s always room for improvement&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to See the project?&lt;/strong&gt;&lt;br&gt;
I documented the entire process with step-by-step instructions. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check it out here: &lt;a href="https://github.com/Ccinaza/ecommerce_elt_pipeline" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to reach out if you have any questions or just want to connect! &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/blessingangus/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Data Ingestion with dlt - Week 3 Bonus</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sun, 16 Feb 2025 22:31:21 +0000</pubDate>
      <link>https://forem.com/ccinaza/data-ingestion-with-dlt-week-3-bonus-25ok</link>
      <guid>https://forem.com/ccinaza/data-ingestion-with-dlt-week-3-bonus-25ok</guid>
      <description>&lt;h2&gt;
  
  
  Data Doesn’t Just Appear—Engineers Make It Happen!
&lt;/h2&gt;

&lt;p&gt;Have you ever opened a dataset and thought, “Wow, this is so clean and structured”? Well, someone worked really hard to make it that way! Welcome to data ingestion—the first step in any powerful data pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Data Pipelines Matter
&lt;/h3&gt;

&lt;p&gt;A data pipeline is more than just moving data from point A to point B. It ensures that raw, unstructured data becomes something usable, reliable, and insightful.&lt;/p&gt;

&lt;p&gt;Here’s what happens under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract: Fetch data from APIs, databases, and files &lt;/li&gt;
&lt;li&gt;Normalize: Clean and structure messy, inconsistent formats &lt;/li&gt;
&lt;li&gt;Load: Store it in data warehouses/lakes for analysis &lt;/li&gt;
&lt;li&gt;Optimize: Use incremental loading to refresh data efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Becoming the Data Magician
&lt;/h3&gt;

&lt;p&gt;During our dlt workshop, we explored how to build scalable, self-maintaining pipelines that handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Real-time and batch ingestion&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automated schema detection and normalization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Governance and best practices for high-quality data&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key takeaway?&lt;/strong&gt; If you want to work in data, mastering ingestion pipelines is a game-changer! Whether you’re dealing with messy JSON, SQL databases, or REST APIs, a strong pipeline ensures that data is always ready when you need it.&lt;/p&gt;

&lt;p&gt;What are your favorite tricks for handling messy data? Drop them in the comments! &lt;/p&gt;

&lt;p&gt;DataEngineering #DLT #ETL #BigData #Python #DataPipelines&lt;/p&gt;

</description>
      <category>dlt</category>
      <category>etl</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>My DE Zoomcamp Journey :Week 3 – Data Warehousing &amp; BigQuery!</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Wed, 12 Feb 2025 09:27:31 +0000</pubDate>
      <link>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-3-data-warehousing-bigquery-4hih</link>
      <guid>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-3-data-warehousing-bigquery-4hih</guid>
      <description>&lt;h2&gt;
  
  
  Diving into the World of OLAP, OLTP, and BigQuery
&lt;/h2&gt;

&lt;p&gt;This week was all about &lt;strong&gt;data at scale&lt;/strong&gt;. We explored &lt;strong&gt;data warehousing, OLAP vs. OLTP, and Google BigQuery&lt;/strong&gt;, diving deep into &lt;strong&gt;costs, best practices, and optimization techniques&lt;/strong&gt;. If you're working with &lt;strong&gt;large datasets&lt;/strong&gt;, these concepts are crucial for &lt;strong&gt;efficient querying, cost savings, and performance optimization&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;OLTP vs. OLAP: Understanding the Difference&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before jumping into BigQuery, we first differentiated &lt;strong&gt;OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing)&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;OLTP&lt;/th&gt;
&lt;th&gt;OLAP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs essential business operations in real-time&lt;/td&gt;
&lt;td&gt;Supports decision-making, problem-solving, and analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Short, fast user-initiated updates&lt;/td&gt;
&lt;td&gt;Scheduled, long-running batch jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Normalized for efficiency&lt;/td&gt;
&lt;td&gt;Denormalized for analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Space requirements&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Smaller (historical data archived)&lt;/td&gt;
&lt;td&gt;Larger (aggregates vast amounts of data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backup &amp;amp; Recovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Essential for business continuity&lt;/td&gt;
&lt;td&gt;Can reload data from OLTP if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Users&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clerks, customer-facing staff, online shoppers&lt;/td&gt;
&lt;td&gt;Data analysts, executives, business intelligence teams&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; OLTP handles real-time transactions (e.g., banking apps, e-commerce), while OLAP is for analytics and reporting (e.g., dashboards, trend analysis).&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;BigQuery: A Serverless Data Warehouse&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;BigQuery is a &lt;strong&gt;fully managed, serverless data warehouse&lt;/strong&gt; that &lt;strong&gt;scales automatically&lt;/strong&gt;. That means &lt;strong&gt;no servers to manage, no infra headaches&lt;/strong&gt;, and &lt;strong&gt;built-in high availability&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Features:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Separation of compute and storage → Scale independently&lt;/li&gt;
&lt;li&gt;On-demand &amp;amp; flat-rate pricing → Pay per query or reserve capacity&lt;/li&gt;
&lt;li&gt;Built-in ML &amp;amp; geospatial analysis → Train models directly in BQ&lt;/li&gt;
&lt;li&gt;Business intelligence (BI) integration → Connect to Looker, Tableau, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;BigQuery Pricing&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pricing Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On-demand&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$5 per TB processed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flat rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$2,000/month for 100 slots (~400TB equivalent)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pro Tip:&lt;/strong&gt; &lt;strong&gt;Always estimate query costs&lt;/strong&gt; before running them to avoid unexpected charges.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Partitioning &amp;amp; Clustering in BigQuery&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Optimizing queries in BigQuery requires &lt;strong&gt;smart data organization&lt;/strong&gt;. Two key techniques:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Partitioning&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Types:&lt;/strong&gt; Time-unit (daily, hourly, monthly), ingestion time, integer range&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limit:&lt;/strong&gt; Max 4,000 partitions per table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Filtering on a single column (e.g., date-based queries)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Clustering&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Uses multiple columns to group data together&lt;/li&gt;
&lt;li&gt;Improves filter and aggregate queries&lt;/li&gt;
&lt;li&gt;Can be applied to: &lt;code&gt;DATE&lt;/code&gt;, &lt;code&gt;BOOL&lt;/code&gt;, &lt;code&gt;INT64&lt;/code&gt;, &lt;code&gt;STRING&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;Best for: Multi-column queries and high-cardinality datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Right Strategy:&lt;/strong&gt;  &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;th&gt;Partitioning&lt;/th&gt;
&lt;th&gt;Clustering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Querying multiple columns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Filter/aggregate on one column&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High cardinality datasets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Automatic Reclustering&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;BigQuery automatically re-clusters tables in the background, maintaining efficient query performance without manual intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;BigQuery Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cost Reduction Strategies:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Avoid &lt;code&gt;SELECT *&lt;/code&gt; – Query only necessary columns
&lt;/li&gt;
&lt;li&gt;Use partitioned/clustered tables – Faster, cheaper queries
&lt;/li&gt;
&lt;li&gt;Stream inserts with caution – Avoid unnecessary real-time data loads
&lt;/li&gt;
&lt;li&gt;Materialize query results – Store intermediate results instead of recomputing
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Performance Optimization:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Filter on partitioned columns – Use partition filters in WHERE clauses
&lt;/li&gt;
&lt;li&gt;Denormalize data – Reduce expensive JOINs when possible
&lt;/li&gt;
&lt;li&gt;Use nested/repeated columns – Optimize for analytics
&lt;/li&gt;
&lt;li&gt;Optimize JOIN order – Start with the largest table, then smaller ones
&lt;/li&gt;
&lt;li&gt;Avoid JavaScript UDFs – They can slow down queries significantly
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Golden Rule:&lt;/strong&gt; Just because you can run a query doesn't mean you should. Optimize before you execute! &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts &amp;amp; Gratitude&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;BigQuery is powerful, but with great power comes great responsibility (and costs!). Learning how to structure, query, and optimize data effectively is essential for any data engineer working with large-scale datasets.&lt;/p&gt;

&lt;p&gt;A huge thank you to &lt;strong&gt;Alexey Grigorev&lt;/strong&gt; and the &lt;strong&gt;DataTalks.Club&lt;/strong&gt; team for making this journey possible! &lt;/p&gt;

&lt;p&gt;On to &lt;strong&gt;Week 4!&lt;/strong&gt; &lt;/p&gt;

&lt;h1&gt;
  
  
  DataEngineering #BigQuery #DataWarehousing #LearningInPublic #DEZoomcamp #GoogleCloud #DataTalksClub
&lt;/h1&gt;

</description>
    </item>
    <item>
      <title>My DE Zoomcamp Journey: Week 2 - Exploring Workflow Orchestration with Kestra</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Thu, 06 Feb 2025 21:22:07 +0000</pubDate>
      <link>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-2-exploring-workflow-orchestration-with-kestra-gap</link>
      <guid>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-2-exploring-workflow-orchestration-with-kestra-gap</guid>
      <description>&lt;p&gt;This week, I took a deep dive into workflow orchestration with Kestra as part of the Data Engineering Zoomcamp by DataTalks.Club. It’s been an insightful journey, even though it started off a bit rough with me battling malaria. Despite that, I pushed through, and I’m really glad I did!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kestra?
&lt;/h2&gt;

&lt;p&gt;Kestra is an open-source, event-driven orchestration platform designed to simplify building scheduled and event-driven workflows. It uses Infrastructure as Code (IaC) practices, which makes creating reliable workflows as easy as writing a few lines of YAML. Think of it like Airflow but with a different flavor. I’m used to Airflow, but I wanted to follow along with Kestra to add another orchestration tool to my repertoire.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hands-On with Kestra
&lt;/h3&gt;

&lt;p&gt;Setting up Kestra was straightforward thanks to Docker Compose. With just a few commands, I had a Kestra server and a Postgres database up and running locally.&lt;/p&gt;

&lt;p&gt;Once up, accessing Kestra’s intuitive UI at &lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt; made managing workflows a breeze. A huge shoutout to Will Russel and his videos — they were instrumental in helping me navigate through Kestra’s features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building ETL Pipelines
&lt;/h2&gt;

&lt;p&gt;One of the main tasks this week was building ETL pipelines for NYC’s Yellow and Green Taxi data. Here’s what we did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracted data from CSV files.&lt;/li&gt;
&lt;li&gt;Loaded it into Postgres and later into Google Cloud Storage (GCS) and BigQuery.&lt;/li&gt;
&lt;li&gt;Explored scheduling and backfilling workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was fascinating to see how easily Kestra could schedule tasks and backfill historical data. The YAML configurations were easy to understand, making the flows straightforward to grasp.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using dbt for Transformation
&lt;/h2&gt;

&lt;p&gt;We also touched on dbt for transforming data in Postgres and BigQuery. Although it was optional, I gave it a shot to see how Kestra handles dbt models. It’s pretty cool that Kestra can sync dbt models from Git and execute them, streamlining the transformation process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Taking it to the Cloud
&lt;/h2&gt;

&lt;p&gt;Moving the ETL pipelines from a local Postgres database to Google Cloud Platform (GCP) was another highlight. Using GCS as a data lake and BigQuery as a data warehouse felt like a natural progression from local development to scalable cloud infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;I’m thrilled to add Kestra to my orchestration toolkit. It’s been a rewarding week, and I’m happy to see how versatile this tool is. &lt;/p&gt;

&lt;p&gt;Looking forward to Week 3 and sharing more of my journey with you all. If you’re also taking the DE Zoomcamp, let’s connect and learn together!&lt;/p&gt;

&lt;p&gt;Check out my code for Week 2 on GitHub and feel free to drop your thoughts in the comments! &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Ccinaza" rel="noopener noreferrer"&gt;
        Ccinaza
      &lt;/a&gt; / &lt;a href="https://github.com/Ccinaza/de-zoomcamp-2025" rel="noopener noreferrer"&gt;
        de-zoomcamp-2025
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;
 

</description>
    </item>
    <item>
      <title>My DE Zoomcamp Journey: Week 1: Diving Into Docker, Terraform, and PostgreSQL</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sun, 26 Jan 2025 18:42:49 +0000</pubDate>
      <link>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-1-diving-into-docker-terraform-and-postgresql-35hl</link>
      <guid>https://forem.com/ccinaza/my-de-zoomcamp-journey-week-1-diving-into-docker-terraform-and-postgresql-35hl</guid>
      <description>&lt;p&gt;Starting 2025 on a high note, I signed up for the DE Zoomcamp program, and wow, what a week it's been! Here’s what I’ve learned and built during my first week.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker and Docker-Compose: Laying the Foundation
&lt;/h3&gt;

&lt;p&gt;I’ve always been curious about Docker, and after this week, I can safely say that I’ve got a solid foundation in it. Docker allows you to create isolated environments for running applications, which is a huge deal when working with different tools in data engineering. &lt;code&gt;docker-compose&lt;/code&gt; was a game-changer, enabling me to spin up multiple services in one go. At first, I was a bit confused about how everything connects, but after some trial and error, it clicked!&lt;/p&gt;

&lt;h3&gt;
  
  
  Postgres and pgAdmin
&lt;/h3&gt;

&lt;p&gt;Setting up Postgres locally with Docker was straightforward, and I quickly became familiar with using pgcli to connect to the database. However, I think my preferred way of interacting with Postgres was through pgAdmin — it just felt more intuitive with its GUI. I could easily navigate through tables and execute SQL queries without having to rely on terminal commands.&lt;/p&gt;

&lt;p&gt;Here’s an example of a simple docker-compose.yml to run Postgres and pgAdmin:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ekyzh737u6rgqmauzax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ekyzh737u6rgqmauzax.png" alt=" " width="800" height="702"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion Script Optimization
&lt;/h3&gt;

&lt;p&gt;One of my highlights this week was optimizing my ingestion script. Initially, the script was set up to load one table at a time. I refactored it to load both tables in parallel by breaking out the logic into separate functions. This not only made the process more efficient but also gave me better insight into how to structure reusable code in data pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terraform and Google Cloud
&lt;/h3&gt;

&lt;p&gt;A major highlight for me was Terraform. Before starting this course, my boss had mentioned Terraform for a project I was working on, but I found the concept pretty overwhelming. When I saw it was part of the curriculum, I was pumped! The tutorials were thorough and easy to follow, and now I’m confident with the basics of setting up Google Cloud Storage and BigQuery using Terraform. It’s definitely something I’ll be using in future projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrapping Up
&lt;/h3&gt;

&lt;p&gt;Overall, week 1 has been an eye-opener! Docker, PostgreSQL, and Terraform — it’s a lot to digest, but it’s been a rewarding journey so far. The hands-on exercises, especially the Docker-Postgres setup and Terraform configuration, were incredibly valuable. I’m already looking forward to what week 2 has in store so stay tuned for more updates as I continue this journey.&lt;/p&gt;

&lt;p&gt;And hey, you should check out my GitHub Repository to see my project! &lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Ccinaza" rel="noopener noreferrer"&gt;
        Ccinaza
      &lt;/a&gt; / &lt;a href="https://github.com/Ccinaza/de-zoomcamp-2025" rel="noopener noreferrer"&gt;
        de-zoomcamp-2025
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;
 

</description>
    </item>
    <item>
      <title>From Concept to Impact: A Journey Through My Fraud Detection Model</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sat, 21 Dec 2024 10:06:20 +0000</pubDate>
      <link>https://forem.com/ccinaza/from-concept-to-impact-a-journey-through-my-fraud-detection-model-2nc3</link>
      <guid>https://forem.com/ccinaza/from-concept-to-impact-a-journey-through-my-fraud-detection-model-2nc3</guid>
      <description>&lt;p&gt;Fraud detection in financial systems is like finding a needle in a haystack—except the haystack is dynamic, ever-changing, and massive. How do you spot these fraudulent transactions? This was the challenge I set out to tackle: developing a fraud detection model designed not only to identify suspicious activity in a vast ocean of data but to adapt and evolve as new patterns of fraud emerge.&lt;/p&gt;

&lt;p&gt;Here’s the story of how I turned a blank slate into a robust fraud detection system, complete with insights, challenges, and breakthroughs along the way.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Spark: Why This Project?
&lt;/h3&gt;

&lt;p&gt;Imagine millions of transactions flowing every second, and hidden among them are activities that could cost businesses billions. My mission was clear: create a system that detects these anomalies without crying wolf at every shadow. With this in mind, I envisioned a solution powered by synthetic data, innovative feature engineering, and machine learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Playground: Data Generation
&lt;/h3&gt;

&lt;p&gt;Great models require great data but fraud data is rare. So, I built my own. Using Python’s ⁠ Faker ⁠ and ⁠ NumPy ⁠ libraries, I generated a synthetic dataset of &lt;strong&gt;1,000,000&lt;/strong&gt; transactions, designed to mimic real-world patterns. Each transaction carried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;⁠&lt;strong&gt;Transaction IDs&lt;/strong&gt;, unique yet random.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⁠&lt;strong&gt;Account IDs and Receiver Account IDs&lt;/strong&gt;, with 20% and 15% uniqueness, respectively, ensuring realistic overlaps.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⁠&lt;strong&gt;Transaction Amounts&lt;/strong&gt;, ranging from micro to mega, distributed to reflect plausible scenarios.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⁠&lt;strong&gt;Timestamps&lt;/strong&gt;, to capture hourly, daily, and seasonal trends.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⁠Categories like &lt;strong&gt;Account Type&lt;/strong&gt; (Personal or Business), &lt;strong&gt;Payment Type&lt;/strong&gt; (Credit or Debit), and &lt;strong&gt;Transaction Type&lt;/strong&gt; (Bank Transfer, Airtime, etc.).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3kmwt2kypkysc4cevv9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3kmwt2kypkysc4cevv9.png" alt="data generation 1" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6p6n5ox728cbxbkrfv1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6p6n5ox728cbxbkrfv1.png" alt="data generation" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dataset came alive with personal and business accounts, transactions ranging from tiny purchases to hefty transfers, and diverse transaction types like deposits, airtime purchases, and even sports betting.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Art of Transformation: Feature Engineering
&lt;/h3&gt;

&lt;p&gt;With the data ready, I turned my focus to feature engineering—a detective’s toolkit for uncovering hidden patterns. This is where the real excitement began. I calculated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ⁠&lt;strong&gt;Account Age&lt;/strong&gt;: How long had each account existed? This helps to spot new accounts behaving oddly.&lt;/li&gt;
&lt;li&gt;  ⁠&lt;strong&gt;Daily Transaction Amount&lt;/strong&gt;: How much money flowed through each account daily?&lt;/li&gt;
&lt;li&gt;  ⁠&lt;strong&gt;Frequency Metrics&lt;/strong&gt;: Tracking how often an account interacted with specific receivers within short windows.&lt;/li&gt;
&lt;li&gt;  ⁠&lt;strong&gt;Time Delta&lt;/strong&gt;: Measuring the gap between consecutive transactions to flag bursts of activity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These features would serve as clues, helping the model sniff out suspicious activity. For example, a brand-new account making unusually large transfers was worth investigating.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vvwqq1kfs1cnph5mhz9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vvwqq1kfs1cnph5mhz9.png" alt="Feature Engineering" width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Drawing from domain knowledge, I crafted rules to classify transactions as suspicious. These rules acted as a watchful guardian over the dataset. Here are a few:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Big Spender Alert&lt;/strong&gt;: Personal accounts transferring over 5 million in a single transaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid Fire Transactions&lt;/strong&gt;: More than three transactions to the same account in an hour.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Midnight Madness&lt;/strong&gt;: Large bank transfers during late-night hours.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I coded these rules into a function that flagged transactions as suspicious or safe.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffgcqqyv6f6e6kwn8mlws.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffgcqqyv6f6e6kwn8mlws.png" alt="Rules" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparing the Model’s Vocabulary
&lt;/h3&gt;

&lt;p&gt;Before teaching a machine learning model to detect fraud, I needed to make the data comprehensible. Think of it like teaching a new language—the model needed to understand categorical variables like account types or transaction methods as numerical values.&lt;/p&gt;

&lt;p&gt;I achieved this by encoding these categories. For instance, the transaction type (“Bank Transfer,” “Airtime,” etc.) was converted into numerical columns using one-hot encoding, where each unique value became its own column with binary indicators. This ensured the model could process the data without losing the meaning behind categorical features.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ls6mk72ogi6s0xrvncr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ls6mk72ogi6s0xrvncr.png" alt="Encoding" width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workhorses: Model Development
&lt;/h3&gt;

&lt;p&gt;With a dataset enriched by rules and features, it was time to bring in the big guns: machine learning. I trained several models, each with its unique strengths:&lt;br&gt;
1.⁠ ⁠&lt;strong&gt;Logistic Regression&lt;/strong&gt;: Reliable, interpretable, and a great starting point.&lt;br&gt;
2.⁠ ⁠&lt;strong&gt;XGBoost&lt;/strong&gt;: A powerhouse for detecting complex patterns.&lt;/p&gt;

&lt;p&gt;But first, I tackled the class imbalance—fraudulent transactions were far outnumbered by legitimate ones. Using the SMOTE oversampling technique, I balanced the scales.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before SMOTE:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0ozlww5myvy2lr99bwb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0ozlww5myvy2lr99bwb.png" alt="Imbalanced data" width="566" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After SMOTE&lt;/strong&gt;:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6kh7ec1v14v2nydyv40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6kh7ec1v14v2nydyv40.png" alt="After SMOTE" width="566" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Training and Results
&lt;/h4&gt;

&lt;p&gt;The models were evaluated using metrics like &lt;strong&gt;Precision&lt;/strong&gt;, &lt;strong&gt;Recall&lt;/strong&gt;, and &lt;strong&gt;AUC (Area Under the Curve)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;⁠&lt;strong&gt;Logistic Regression&lt;/strong&gt;: AUC of 0.97, Recall of 92%.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopii8j6f7wt406y8tdew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopii8j6f7wt406y8tdew.png" alt="Logistic Reg results" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;⁠XGBoost&lt;/strong&gt;: AUC of 0.99, Recall of 94%.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzfggatn6u27ywzpmcrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzfggatn6u27ywzpmcrg.png" alt="XGBoost results" width="800" height="590"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The clear winner? XGBoost, with its ability to capture intricate fraud patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Smarter Every Day: Feedback Loop Integration
&lt;/h3&gt;

&lt;p&gt;A standout feature of my system was its adaptability. I designed a feedback loop where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⁠Flagged transactions were reviewed by a fraud team.&lt;/li&gt;
&lt;li&gt;⁠Their feedback updated the training data.&lt;/li&gt;
&lt;li&gt;⁠Models retrained periodically to stay sharp against new fraud tactics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;

&lt;p&gt;After a journey filled with data wrangling, feature engineering, and machine learning, the model was ready for deployment. The XGBoost model, saved as a .pkl file, is now a reliable tool for fraud detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Epilogue: Reflections and Future Directions
&lt;/h2&gt;

&lt;p&gt;Building this fraud detection model taught me the power of combining business knowledge, data science, and machine learning. But the journey doesn’t end here. Fraud evolves, and so must the defenses against it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Learned
&lt;/h3&gt;

&lt;p&gt;This project was more than a technical exercise. It was a journey in:&lt;br&gt;
•⁠  ⁠&lt;strong&gt;Scalability&lt;/strong&gt;: Designing systems that handle vast amounts of data.&lt;br&gt;
•⁠  ⁠&lt;strong&gt;Adaptability&lt;/strong&gt;: Building models that evolve with feedback.&lt;br&gt;
•⁠  ⁠&lt;strong&gt;Collaboration&lt;/strong&gt;: Bridging the gap between technical teams and domain experts.&lt;/p&gt;

&lt;h3&gt;
  
  
  In the future, I plan to:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Explore deep learning for anomaly detection.&lt;/li&gt;
&lt;li&gt;Implement real-time monitoring systems.&lt;/li&gt;
&lt;li&gt;Continuously refine rules based on new fraud patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fraud detection isn’t just about numbers—it’s about safeguarding trust. And this project, I hope, is a small but meaningful step in that direction.&lt;/p&gt;

&lt;p&gt;Thank you for reading. Feel free to share your thoughts or questions in the comments.&lt;/p&gt;

</description>
      <category>frauddetection</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Unravelling Linearity: My Journey in Regression Modeling</title>
      <dc:creator>Blessing Angus</dc:creator>
      <pubDate>Sun, 14 Apr 2024 13:38:25 +0000</pubDate>
      <link>https://forem.com/ccinaza/unravelling-linearity-my-journey-in-regression-modeling-4k0e</link>
      <guid>https://forem.com/ccinaza/unravelling-linearity-my-journey-in-regression-modeling-4k0e</guid>
      <description>&lt;p&gt;Imagine a detective board filled with clues – features or independent variables – that might help solve a case (the dependent variable). Linearity states that the relationship between these clues and the outcome we're trying to predict should be linear, like a straight line on a graph. If this assumption isn't met, our model's predictions could be skewed. A curved line, for instance, might suggest a more complex relationship.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17i1gcsm3iadqgkah8li.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17i1gcsm3iadqgkah8li.png" alt="Graphs showing Linearity and Non-Linearity" width="688" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My Case: Unveiling Linearity in My Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's where my detective work began. I wanted to build a regression model that would predict the price of houses using a housing dataset.  To ensure the validity of my model, I needed to verify the linearity assumption between the independent and dependent variables. This involved employing a combination of diagnostic techniques, including visual inspection (residual plot) and statistical tests (Rainbow Test). Let's dive into the code snippets and diagnostic plots to gain a deeper understanding of this validation process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Imports&lt;/strong&gt;&lt;br&gt;
I imported the following libraries: &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;scipy&lt;/code&gt;, &lt;code&gt;statsmodels&lt;/code&gt;, and &lt;code&gt;sklearn&lt;/code&gt;.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwu56xckaip77h51qxos7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwu56xckaip77h51qxos7.png" alt="Imports" width="627" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cleaning and Feature Engineering&lt;/strong&gt;&lt;br&gt;
During the data cleaning process, I encountered a pivotal challenge regarding one of my predictors: the "location" column having over 800 unique values! Directly one-hot-encoding this variable would create a nightmare – a massive increase in features due to the curse of dimensionality. This can cripple my model's ability to learn.&lt;/p&gt;

&lt;p&gt;To tackle this, I implemented a &lt;code&gt;group_location&lt;/code&gt; function that grouped infrequent locations into an &lt;code&gt;"Other"&lt;/code&gt; category. This approach condensed the number of categorical values, mitigating the adverse effects of high dimensionality and facilitating smoother model training.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehbn9xm3riwv4np6v0mg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehbn9xm3riwv4np6v0mg.png" alt="group_location function" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fitting The Model - An OLS Model&lt;/strong&gt;&lt;br&gt;
After defining the dependent and independent variables and adding an intercept term, the model was then instantiated and fitted. Then, the data was split for evaluation, leading to predictions being made and residuals calculated.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbs2s9v9ga3acey957n2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbs2s9v9ga3acey957n2q.png" alt="fitting the model" width="781" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Residual Plot&lt;/strong&gt;&lt;br&gt;
The residual plot shows a slight curvature, with residuals becoming increasingly positive or negative for higher fitted values. This suggests a potential violation of the homoscedasticity assumption, where the variance of the errors might not be constant across the range of predicted prices.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkimoh2rwo9rjt9t259dq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkimoh2rwo9rjt9t259dq.png" alt="Residual Plot" width="790" height="817"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rainbow Test&lt;/strong&gt;&lt;br&gt;
A &lt;code&gt;high p-value (typically &amp;gt; 0.05)&lt;/code&gt; suggests that there is no evidence against linearity, meaning the linear model is an appropriate fit for the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51c5lc3iz4e6dhzdxhki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51c5lc3iz4e6dhzdxhki.png" alt="Rainbow Test" width="643" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the test statistic &lt;code&gt;(0.9555)&lt;/code&gt; still indicates a positive correlation, it might not be entirely reliable due to the observed pattern in the residuals.&lt;/p&gt;

&lt;p&gt;The high p-value &lt;code&gt;(0.9104)&lt;/code&gt; technically allows us to fail to reject linearity based on the test alone. However, the visual evidence from the residuals suggests further investigation might be needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
While the rainbow test didn't statistically reject linearity, the residual plot's non-random pattern hints at a potential issue with homoscedasticity.  This calls for further investigation!&lt;/p&gt;

&lt;p&gt;To address this, I plan to &lt;strong&gt;explore data transformations&lt;/strong&gt;, &lt;strong&gt;investigate alternative models&lt;/strong&gt;, and &lt;strong&gt;perform additional diagnostics.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How do you approach validating assumptions in your regression models? Share your strategies, insights, or questions in the comments below – I'm curious to hear your perspective!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>data</category>
    </item>
  </channel>
</rss>
