Forem: Blessing Angus

Building an Incremental Zoho Desk to BigQuery Pipeline: Lessons from the Trenches

Blessing Angus — Sat, 28 Feb 2026 22:35:39 +0000

What I thought would take a few days ended up taking weeks so here's everything I learned building a production Zoho Desk to BigQuery pipeline from scratch.

When my company decided to centralise customer support analytics, the task landed on my plate: pull data from Zoho Desk, land it in BigQuery, transform with dbt, done. The plan looked clean on paper.

What followed was a masterclass in why production data engineering is never as simple as the happy path suggests.

This is the story of building that pipeline; the architecture decisions, the walls I hit, and the lessons I'll carry into every pipeline I build from here.

The Starting Point: A Full Load That Took Forever

The first working version of the pipeline was blunt but functional. Every day, pull every ticket ever created from the Zoho Desk API and overwrite the BigQuery table. Simple. Predictable. And for tickets, completely untenable at scale because the table had over a million rows and was growing daily.

I needed incremental loads. But before I could get there, I had a different problem to solve: how do I get all the historical data into BigQuery in the first place without running an API job for days?

The Bootstrapping Problem: When the API Is Too Slow for the Initial Load

Loading years of historical data through a paginated API isn't a pipeline problem, it's a waiting problem. For a table with hundreds of thousands of rows, even a well-optimised API pull can take hours or days just for the seed load.

The solution was to sidestep the API entirely for the initial load. Zoho Desk has a built-in data backup feature that exports your entire account data as CSV files. I used this to export a full snapshot of tickets, threads, contacts, and calls, then loaded each CSV directly into BigQuery via the BQ Console UI.

The UI load process:

Format: CSV
Schema: defined manually (not auto-detect — more on why this matters later)
Skip leading rows: 1 (header row)
Allow quoted newlines: yes (critical for fields like ticket descriptions that contain line breaks)
Allow jagged rows: yes (API responses sometimes omit optional fields)

Once the historical snapshot was in BigQuery, I set the incremental pipeline's start_date to the backup date. The first scheduled run picks up any changes from that day forward, no gap, no overlap.

Lesson: For large initial loads, don't fight the API. Use native export features if they exist. The pipeline is for keeping data fresh; getting the history in is a one-time bootstrapping problem that deserves its own solution.

The Architecture: Code Generation Over Copy-Paste

Here's how the pipeline itself is structured. Rather than writing a separate Airflow DAG for every Zoho Desk endpoint, I built a code-generation system:

A custom Airflow operator (ZohoDeskToGCSOperator) handles all the API extraction logic; pagination, OAuth, concurrent detail fetching, incremental search
A Jinja template defines the DAG structure once
YAML config files one per endpoint and each defines the parameters: schedule, columns, schema, endpoint type
A generator script renders YAML + template to DAG Python file

Adding a new endpoint means writing a YAML file and running the generator, not copying a DAG from scratch. The template handles the branching logic for different endpoint types automatically.

Zoho Desk API
      │
      ▼
ZohoDeskToGCSOperator  ──►  GCS (staging CSV)
                                    │
                                    ▼
                GCSToBigQueryOperator  ──►  BigQuery (_staging table)
                                                              │
                                                              ▼
                                            BigQueryInsertJobOperator
                                              (MERGE into main table)

For large transactional tables (tickets, contacts, threads, calls), data lands in a _staging table first, then gets merged into the main table, updating existing rows and inserting new ones. For small reference tables (agents, teams, departments), a daily WRITE_TRUNCATE is sufficient.

Challenge 1: Not All APIs Are Created Equal

The ticket and contact endpoints support Zoho's modifiedTimeRange parameter. This allows you pass a start and end timestamp and get back only records modified in that window. Perfect for incremental loads.

The /calls endpoint does not. Pass modifiedTimeRange to it and you get a 422 back.

The workaround: sort by createdTime descending and stop paginating as soon as the oldest record on the current page predates your window. Since calls are append-only in practice, this is equivalent.

for rec in records:
    if rec["createdTime"] < data_interval_start:
        done = True
        break

Lesson: Don't assume API feature parity across endpoints from the same vendor. Test every endpoint independently before writing a single line of pipeline code.

Challenge 2: A Reserved Word Hiding in Your Column Names

The threads table has a column called to — as in the recipient of a thread. Perfectly reasonable name. Except TO is a reserved keyword in BigQuery SQL.

The MERGE statement was generating SQL like:

INSERT (`from`, `to`, `subject`, ...)
VALUES (S.`from`, S.`to`, S.`subject`, ...)

Where to was unquoted. BigQuery's parser sees the keyword TO in an unexpected position and throws:

Syntax error: Unexpected keyword TO at [40:130]

The fix was to backtick-quote the column name in the generated MERGE SQL

Lesson: When generating SQL programmatically, always quote all identifiers. You won't always know which column names will collide with reserved words, especially when the schema is driven by a third-party API you don't control.

Challenge 3: The MERGE That Couldn't Match Rows

After fixing the syntax error, the threads MERGE hit a different wall:

UPDATE/MERGE must match at most one source row for each target row

This one took some digging. The threads endpoint works like this: search for tickets modified in the time window, then fetch all threads for each of those tickets. The problem is that Zoho's modifiedTimeRange search is paginated and the same ticket can appear on multiple pages if the result set shifts between requests.

When that happens, threads get fetched for the same ticket twice. The staging table ends up with duplicate thread IDs. BigQuery's MERGE correctly refuses to update a target row when multiple source rows match it.

I fixed it at two layers:

In the operator (Python): deduplicate ticket IDs before fetching threads.

ticket_ids = list(dict.fromkeys(ticket_ids))

dict.fromkeys removes duplicates while preserving insertion order. I feel this is cleaner than converting to a set and back.

In the MERGE SQL (template): add a deduplication guard in the USING clause.

USING (
    SELECT * FROM `staging_table`
    QUALIFY ROW_NUMBER() OVER (PARTITION BY `id`) = 1
) S

The Python fix prevents the problem from occurring. The SQL fix is a safety net for edge cases I haven't thought of yet.

Lesson: For MERGE pipelines, always add a QUALIFY ROW_NUMBER() dedup guard in the staging select. Even if your source looks clean, it defends against edge cases you haven't anticipated.

Challenge 4: Auto-Detect Might Tell Lies

Here's where the bootstrapping decision from earlier came back to bite me. When I loaded the initial backup CSVs into BigQuery, I let auto-detect infer the schema rather than defining it explicitly. Fast and convenient, but caused some failures.

Auto-detect made choices that didn't match what the incremental pipeline expected:

onholdTime — stored by Zoho as a timestamp string, loaded by auto-detect as STRING
tagCount — a count field, auto-detected as INT64 (actually correct, but the YAML I'd written said STRING)
isEscalated — loaded as STRING because the CSV values were "true"/"false" strings, not proper booleans

None of these caused problems until the first MERGE ran and tried to assign a TIMESTAMP value to a STRING column. BigQuery's type enforcement at MERGE time is strict — and unforgiving.

The fix was to query INFORMATION_SCHEMA.COLUMNS on the main table and reconcile every column type against the YAML schema:

SELECT column_name, data_type
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'zoho_desk_tickets'
ORDER BY ordinal_position;

Lesson: The initial load is the source of truth — not your assumptions about what the types should be. Always verify INFORMATION_SCHEMA before writing a schema config for any table that was bootstrapped manually.

The dbt Layer: Fixing What the Raw Layer Can't

With the raw pipeline stable, dbt handles the transformation into clean, analyst-ready tables.

This keeps the raw layer as a faithful copy of the source, while the transformation layer handles type normalisation.

What I'd Do Differently

Use native export for the initial load: don't fight the API for historical data. Export a full backup, load via the UI, then let the pipeline handle increments from that point.
Never use auto-detect for a table that an incremental pipeline will MERGE into: define the schema explicitly, and verify with INFORMATION_SCHEMA immediately after loading.
Add the QUALIFY ROW_NUMBER() dedup guard from day one: it costs nothing and saves you from mysterious MERGE failures later.
Test every API endpoint's query parameter support independently: don't inherit assumptions from one endpoint to another.
Backtick-quote all identifiers in generated SQL: reserved word collisions are unpredictable and the fix is trivial.
Keep the raw and transformation layers separate: having raw data land in BigQuery with minimal transformation, and a separate dbt layer for typing and renaming, makes debugging far easier. You can always re-run dbt without re-hitting the API.

Wrapping Up

None of the challenges described here were exotic. Reserved words, duplicate rows, type mismatches, API inconsistencies, etc., these are bread-and-butter data engineering problems. What made them feel hard was hitting them one at a time in production, under pressure, on a pipeline that analysts were waiting on.

The pipeline is now running reliably in production: tickets, contacts, threads, agents, teams, departments, and accounts incrementally loaded daily.

If you're building something similar — whether with Zoho, Salesforce, Hubspot, or any SaaS API — I hope these lessons save you a few hours of head-scratching.

The pipeline is built with Apache Airflow, Google Cloud Storage, BigQuery, and dbt. The custom operator pattern and code-generation approach described here are applicable to any REST API integration.

Docker Explained with a Food Analogy

Blessing Angus — Wed, 17 Sep 2025 21:24:09 +0000

If you're just hearing about Docker, it can sound intimidatingly complex; containers, images, engines, orchestration… it's a lot! But let's break it down using something we all understand: food.

The Problem with Cooking Everywhere

Imagine you’ve cooked a delicious meal, let’s say jollof rice and chicken. Your dish is perfect, but when you ask a friend to cook it in their kitchen, things fall apart.

They don’t have the exact spices you used.
Their stove cooks faster than yours.
They bought a different type of rice.

The result? Their version of your meal doesn’t taste the same.

This is exactly what happens in software development. An application runs beautifully on a developer's laptop but crashes when deployed to a server because the environment is slightly different: different operating system versions, missing libraries, or incompatible dependencies.

Enter Docker: The Universal Lunchbox System

Now imagine a different approach. Instead of sharing just the recipe and hoping for the best, you decide to pack complete, ready-to-eat meals in specialized lunchboxes.

Here's how the Docker analogy maps out:

Your Application = The Perfectly Cooked Meal
Your software is the delicious jollof rice and chicken, the main attraction.
Dependencies = All the Ingredients + Cooking Method
Your app needs specific libraries, frameworks, and configurations to work properly, just like your meal needed exact spices and cooking techniques.
Container = The Magic Lunchbox
This isn't just any container; it's a special lunchbox that preserves everything perfectly. Once your meal is inside, it doesn't matter if someone opens it in a cold office, a hot car, or a humid cafeteria. The food tastes exactly the same.
Docker Image = The Master Template
Think of this as the blueprint of a pre-cooked dish that can be duplicated endlessly. Once the recipe has been followed, you get a finished meal that can be packed into lunchboxes over and over again.
Dockerfile = The Recipe Card
But where did that template come from? The Dockerfile is like the recipe card itself; it lists the exact steps, spices, and cooking instructions needed to prepare the master dish. Without it, you’d have no consistent way to recreate the meal.
Docker Engine = The Reliable Delivery Service
Docker is like the delivery system that ensures these lunchboxes can be transported anywhere and opened reliably, maintaining the meal's quality and consistency.

Why This Changes Everything

Docker solves the classic “It works on my machine” problem. With Docker:

Developers can package their applications with every single dependency into one neat container
DevOps teams can deploy those containers anywhere: cloud servers, local machines, or data centres, without worrying about environment differences
Scaling becomes as simple as ordering more lunchboxes instead of setting up entirely new kitchens.

The Real-World Impact

Just like how meal delivery services revolutionised food by ensuring consistent quality regardless of location, Docker has revolutionised software deployment. Companies can now:

Deploy applications across different environments with confidence
Scale services up or down instantly
Ensure every user gets the exact same experience
Reduce deployment errors by 90%

Wrapping It Up

Next time someone mentions Docker, don't think about complex technical infrastructure. Instead, picture a sophisticated lunchbox delivery system:

Every meal (application) is perfectly packaged with its ingredients (dependencies).
Every lunchbox (container) delivers identical flavour no matter where it's opened.
The delivery service (Docker) guarantees consistency across all locations.

Docker has essentially solved the software equivalent of ensuring your grandmother's secret recipe tastes exactly the same whether it's cooked in Lagos, London, or Kafanchan.

So the next time you hear developers talking about "containerization," just smile and think about perfectly packed lunchboxes traveling the world, delivering consistent, delicious experiences wherever they land.

Building an End-to-End Data Pipeline for NYC Citi Bike 2024: A DE Zoomcamp Journey

Blessing Angus — Sat, 12 Apr 2025 22:09:53 +0000

As part of my final project for the DE Zoomcamp 2025 cohort by DataTalksClub, I set out to build an end-to-end batch data pipeline to process and analyze over 1.1 million Citi Bike trips from NYC’s 2024 dataset.

The goal? Uncover trends in urban mobility and rider behavior to support bike-sharing operations and inform city planning.

In this blog post, I’ll walk you through the project’s motivation, architecture, implementation, insights, challenges, and what I learned along the way.

Why Citi Bike Data?

Citi Bike, New York City’s bike-sharing system, generates millions of trip records each month, capturing detailed data on rides, stations, and user behavior. This data is a goldmine for understanding urban mobility patterns, but without an automated pipeline, processing these large datasets and extracting actionable insights is a daunting task. My project aimed to solve this by building a scalable batch pipeline to ingest, store, transform, and visualize Citi Bike’s 2024 trip data.

Key Questions to Answer

How does Citi Bike usage vary over time, and what seasonal patterns emerge in 2024?
What’s the breakdown of rides between electric and classic bikes, and how does this impact operations?
How do ride patterns differ between member and casual users?
Which stations are the most popular starting points, and what does this suggest about urban mobility in NYC?
What’s the average trip duration, and how can this inform bike maintenance or rebalancing strategies?

Project Overview

The NYC Bike Rides Pipeline is a batch data pipeline that processes Citi Bike 2024 trip data, stores it in a data lake, transforms it in a data warehouse, and visualizes key metrics through an interactive dashboard. Built entirely on Google Cloud Platform (GCP), the pipeline leverages modern data engineering tools to automate the process and deliver insights.

Tech Stack

Cloud: GCP (Google Cloud Storage for the data lake, BigQuery for the data warehouse)
Infrastructure as Code (IaC): Terraform to provision infrastructure
Orchestration: Apache Airflow (running locally via Docker)
Data Transformations: dbt Cloud for modeling and transformations
Visualization: Looker Studio for the dashboard

Pipeline Architecture

The pipeline follows a batch processing workflow, orchestrated end-to-end using Apache Airflow. Here’s a breakdown of each stage:

Data Ingestion:
I sourced the Citi Bike 2024 trip data (CSV files) from a public S3 bucket. The data is downloaded and uploaded to a Google Cloud Storage (GCS) bucket named naza_nyc_bike_rides, which serves as the data lake.
Orchestration:
Apache Airflow, running locally in a Docker container, orchestrates the pipeline. A custom DAG (etl.py) manages the monthly ingestion, loading, and transformation steps, ensuring the pipeline runs smoothly for each month of 2024.
Data Warehouse:
Raw data is loaded from GCS into Google BigQuery’s staging dataset (nyc_bikes_staging). To optimize query performance, I partitioned the tables by the started_at timestamp and clustered them by start_station_id. After transformation, the data is loaded into the production dataset (nyc_bikes_prod).
Transformations:
Using dbt Cloud, I transformed the raw data into a production-ready dataset. This involved cleaning (e.g., removing records with null values in critical fields like started_at and rideable_type), aggregating (e.g., rides by month), and modeling the data for analysis.
Visualization:
The final dataset in nyc_bikes_prod is connected to Looker Studio, where I built an interactive dashboard to visualize key metrics like total rides, rideable type breakdown, and top start stations.
Infrastructure as Code (IaC):
I used Terraform to provision the GCS bucket (naza_nyc_bike_rides) and BigQuery datasets (nyc_bikes_staging and nyc_bikes_prod), ensuring reproducibility and scalability.

Architecture Diagram

Implementation Details

Setting Up the Infrastructure with Terraform
I started by defining the infrastructure in Terraform, which allowed me to provision the GCS bucket and BigQuery datasets as code. The terraform/ directory contains main.tf and variables.tf, where I specified the GCP project ID, region, and resource names. Running terraform init and terraform apply set up the infrastructure in minutes, ensuring consistency across environments.
Orchestrating with Apache Airflow
Airflow was the backbone of the pipeline, automating the ingestion, loading, and transformation steps. I containerized Airflow using Docker (docker-compose.yml) and configured it with environment variables in a .env file (e.g., GCP project ID, bucket name). The DAG (etl.py) in the dags/ folder handles the monthly batch processing, downloading CSV files, uploading them to GCS, and loading them into BigQuery.

One challenge I faced was initially attempting to use Google Cloud Composer for Airflow, but setup issues led me to switch to a local Airflow instance via Docker. This turned out to be a blessing in disguise, as it gave me more control over the environment and simplified debugging.

Transforming Data with dbt Cloud With the raw data in BigQuery’s nyc_bikes_staging dataset, I used dbt Cloud to transform it into a production-ready dataset (nyc_bikes_prod). My dbt models handled:

Cleaning: Removing records with null values in critical fields.
Aggregation: Calculating monthly ride counts, average trip durations, and user type breakdowns.
Modeling: Creating tables optimized for analysis (e.g., rides by station, rideable type).

Running the dbt models was straightforward, and the resulting dataset was ready for visualization.

Visualizing Insights in Looker Studio I connected Looker Studio to the nyc_bikes_prod dataset and built a dashboard titled NYC Bike Usage Insights (2024). You can explore the dashboard here:Looker Studio link.

Challenges and Solutions

Challenge 1: Airflow Setup with Cloud Composer
Initially, I planned to use Google Cloud Composer for Airflow, but I ran into setup issues, including dependency conflicts and longer-than-expected provisioning times. I pivoted to running Airflow locally with Docker, which gave me more control and faster iteration cycles. This taught me the importance of flexibility when working with cloud-managed services.

Challenge 2: Handling Large Datasets in BigQuery
The Citi Bike dataset, with over 1.1 million records, required careful optimization in BigQuery. I fine-tuned performance by partitioning tables by the started_at timestamp and clustering by start_station_id, which significantly reduced query costs and improved performance for downstream analyses.

Challenge 3: Data Quality Issues
Some records had null values in critical fields like started_at, rideable_type, and start_station_name. I addressed this in dbt by filtering out these records during transformation, ensuring the dashboard reflected accurate insights.

What I Learned

While I’ve worked with tools like BigQuery, Airflow, and Terraform before, this project deepened my understanding of how to apply them in a real-world context:

Tool Integration: I gained a deeper appreciation for integrating a full suite of tools (GCP, Airflow, dbt, Terraform) into a cohesive, scalable pipeline. Each tool has its strengths, and orchestrating them effectively is key to a successful project.
Performance Optimization: Fine-tuning BigQuery for large datasets with partitioning and clustering was a great exercise in balancing cost and performance, especially for a dataset of this scale.
Best Practices in Orchestration and IaC: I refined my approach to orchestrating complex workflows with Airflow and provisioning infrastructure with Terraform, focusing on modularity and reproducibility.
Data Storytelling: The project reinforced the power of data storytelling—turning raw trip data into actionable insights about urban mobility highlighted the importance of a solid pipeline as the foundation for impactful visualization.

Future Improvements

This project is a strong foundation, but there are several ways to take it further:

CI/CD Pipeline: Implement a CI/CD pipeline using GitHub Actions to automate testing and deployment of the DAGs and Terraform configurations.
Real-Time Data: Explore streaming ingestion for real-time Citi Bike data, enabling live dashboards and more timely insights.
Advanced Analytics: Add predictive models (e.g., bike demand forecasting) or anomaly detection (e.g., unusual station usage patterns) to provide deeper insights.

Conclusion

Building the NYC Bike Rides Pipeline for DE Zoomcamp 2025 was an incredible learning experience. It allowed me to apply my data engineering skills to a real-world problem, from ingestion to visualization, while uncovering meaningful insights about urban mobility in NYC. The project also highlighted the importance of automation, optimization, and storytelling in data engineering.

You can explore the full project on GitHub.
I’d love to hear your feedback or ideas for collaboration—feel free to reach out on LinkedIn or email me at blangus.c@gmail.com.

A huge thank you to Alexey Grigorev, the DataTalksClub team, and the DE Zoomcamp community for an amazing program that pushed me to grow as a data engineer.

Here’s to more data adventures! 🚀

Building an End-to-End ELT Pipeline: PostgreSQL BigQuery Metabase

Blessing Angus — Thu, 27 Mar 2025 23:47:32 +0000

ETL/ELT projects are never just about moving data—they’re about designing efficient, scalable, and maintainable pipelines. In this post, I’ll walk through how I built a full ELT process using PostgreSQL, Airflow, BigQuery, and dbt, along with lessons I learned along the way.

The Tech Stack

Data Source: Brazilian e-commerce dataset (CSV files)
Data Storage: PostgreSQL as the staging database
Orchestration: Apache Airflow to automate the ETL workflow
Data Warehousing: Google Cloud Storage, Google BigQuery
Transformation: dbt (Data Build Tool) for modeling
Visualization: Metabase for dashboarding

Step 1: Data Ingestion

I downloaded the csv files from Kaggle, then I used Airflow to automate loading the dataset into PostgreSQL. The biggest challenge? Handling the geolocation dataset (1M+ rows!) without performance issues.

Step 2: Orchestrating with Airflow

I created DAGs to move data from PostgreSQL to BigQuery using Airflow operators (PostgresToGCS → GCSToBigQuery). This setup ensured automated and reliable data movement.

Step 3: Transforming with dbt

My dbt models were structured into:

Staging – Cleaning raw data
Intermediate – Joining and reshaping
Marts – Final tables for analytics I added basic tests to ensure data integrity (e.g., uniqueness checks).

Step 4: Visualization in Metabase

After loading transformed data into BigQuery, I built a Metabase dashboard for insights. This was where the pipeline came to life!

Lessons Learned

This project was an eye-opener in many ways.

Orchestration is More Than Just Running Tasks: First, working with Airflow tested my patience. Debugging DAG failures—especially silent ones—was frustrating but rewarding when I finally got everything running smoothly. I learned to check logs meticulously and not just rely on UI-based error messages.

Dealing with Large Datasets is Tricky: I hit a roadblock when trying to ingest the geolocation dataset into PostgreSQL because it had over a million rows. It made me realize that blindly loading data without thinking about performance can be a nightmare. Next time, I’d explore partitioning, indexing, or even alternative storage formats like Parquet to improve performance.

dbt is a Game-Changer for Data Transformations: Before this project, I hadn’t fully grasped why dbt is so popular. Now, I see the beauty in modular transformations, clear model dependencies, and automated testing. Writing SQL in a structured, scalable way is a game-changer.

Visualization is the Final Mile: Building the dashboard in Metabase made me appreciate how important it is to structure data properly for end users. Raw data alone means nothing until it’s turned into insights people can understand. Simplicity is key—too many metrics can overwhelm users instead of helping them make decisions.

Documentation Matters: I’ll be honest, writing documentation felt like an afterthought at first. But looking back, having clear, step-by-step documentation makes this project reusable for anyone (including my future self!).

Overall, this project reinforced my problem-solving skills, patience, and ability to adapt to different tools. Next time, I’d plan error-handling mechanisms upfront instead of treating them as an afterthought. But most importantly, I’ve realized that no data pipeline is truly “done”—there’s always room for improvement

Want to See the project?
I documented the entire process with step-by-step instructions.

Check it out here: GitHub

Feel free to reach out if you have any questions or just want to connect!

Data Ingestion with dlt - Week 3 Bonus

Blessing Angus — Sun, 16 Feb 2025 22:31:21 +0000

Data Doesn’t Just Appear—Engineers Make It Happen!

Have you ever opened a dataset and thought, “Wow, this is so clean and structured”? Well, someone worked really hard to make it that way! Welcome to data ingestion—the first step in any powerful data pipeline.

Why Data Pipelines Matter

A data pipeline is more than just moving data from point A to point B. It ensures that raw, unstructured data becomes something usable, reliable, and insightful.

Here’s what happens under the hood:

Extract: Fetch data from APIs, databases, and files
Normalize: Clean and structure messy, inconsistent formats
Load: Store it in data warehouses/lakes for analysis
Optimize: Use incremental loading to refresh data efficiently

Becoming the Data Magician

During our dlt workshop, we explored how to build scalable, self-maintaining pipelines that handle:

Real-time and batch ingestion
Automated schema detection and normalization
Governance and best practices for high-quality data

Key takeaway? If you want to work in data, mastering ingestion pipelines is a game-changer! Whether you’re dealing with messy JSON, SQL databases, or REST APIs, a strong pipeline ensures that data is always ready when you need it.

What are your favorite tricks for handling messy data? Drop them in the comments!

DataEngineering #DLT #ETL #BigData #Python #DataPipelines

My DE Zoomcamp Journey :Week 3 – Data Warehousing & BigQuery!

Blessing Angus — Wed, 12 Feb 2025 09:27:31 +0000

Diving into the World of OLAP, OLTP, and BigQuery

This week was all about data at scale. We explored data warehousing, OLAP vs. OLTP, and Google BigQuery, diving deep into costs, best practices, and optimization techniques. If you're working with large datasets, these concepts are crucial for efficient querying, cost savings, and performance optimization.

OLTP vs. OLAP: Understanding the Difference

Before jumping into BigQuery, we first differentiated OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing):

Feature	OLTP	OLAP
Purpose	Runs essential business operations in real-time	Supports decision-making, problem-solving, and analytics
Data updates	Short, fast user-initiated updates	Scheduled, long-running batch jobs
Database design	Normalized for efficiency	Denormalized for analysis
Space requirements	Smaller (historical data archived)	Larger (aggregates vast amounts of data)
Backup & Recovery	Essential for business continuity	Can reload data from OLTP if needed
Users	Clerks, customer-facing staff, online shoppers	Data analysts, executives, business intelligence teams

Takeaway: OLTP handles real-time transactions (e.g., banking apps, e-commerce), while OLAP is for analytics and reporting (e.g., dashboards, trend analysis).

BigQuery: A Serverless Data Warehouse

BigQuery is a fully managed, serverless data warehouse that scales automatically. That means no servers to manage, no infra headaches, and built-in high availability.

Key Features:

Separation of compute and storage → Scale independently
On-demand & flat-rate pricing → Pay per query or reserve capacity
Built-in ML & geospatial analysis → Train models directly in BQ
Business intelligence (BI) integration → Connect to Looker, Tableau, and more

BigQuery Pricing

Pricing Model	Cost
On-demand	$5 per TB processed
Flat rate	$2,000/month for 100 slots (~400TB equivalent)

Pro Tip: Always estimate query costs before running them to avoid unexpected charges.

Partitioning & Clustering in BigQuery

Optimizing queries in BigQuery requires smart data organization. Two key techniques:

Partitioning

Types: Time-unit (daily, hourly, monthly), ingestion time, integer range
Limit: Max 4,000 partitions per table
Best for: Filtering on a single column (e.g., date-based queries)

Clustering

Uses multiple columns to group data together
Improves filter and aggregate queries
Can be applied to: DATE, BOOL, INT64, STRING, etc.
Best for: Multi-column queries and high-cardinality datasets

Choosing the Right Strategy:

When to Use	Partitioning	Clustering
Cost control	✅	❌
Querying multiple columns	❌	✅
Filter/aggregate on one column	✅	❌
High cardinality datasets	❌	✅

Automatic Reclustering

BigQuery automatically re-clusters tables in the background, maintaining efficient query performance without manual intervention.

BigQuery Best Practices

Cost Reduction Strategies:

Avoid SELECT * – Query only necessary columns
Use partitioned/clustered tables – Faster, cheaper queries
Stream inserts with caution – Avoid unnecessary real-time data loads
Materialize query results – Store intermediate results instead of recomputing

Performance Optimization:

Filter on partitioned columns – Use partition filters in WHERE clauses
Denormalize data – Reduce expensive JOINs when possible
Use nested/repeated columns – Optimize for analytics
Optimize JOIN order – Start with the largest table, then smaller ones
Avoid JavaScript UDFs – They can slow down queries significantly

Golden Rule: Just because you can run a query doesn't mean you should. Optimize before you execute!

Final Thoughts & Gratitude

BigQuery is powerful, but with great power comes great responsibility (and costs!). Learning how to structure, query, and optimize data effectively is essential for any data engineer working with large-scale datasets.

A huge thank you to Alexey Grigorev and the DataTalks.Club team for making this journey possible!

On to Week 4!

DataEngineering #BigQuery #DataWarehousing #LearningInPublic #DEZoomcamp #GoogleCloud #DataTalksClub

My DE Zoomcamp Journey: Week 2 - Exploring Workflow Orchestration with Kestra

Blessing Angus — Thu, 06 Feb 2025 21:22:07 +0000

This week, I took a deep dive into workflow orchestration with Kestra as part of the Data Engineering Zoomcamp by DataTalks.Club. It’s been an insightful journey, even though it started off a bit rough with me battling malaria. Despite that, I pushed through, and I’m really glad I did!

What is Kestra?

Kestra is an open-source, event-driven orchestration platform designed to simplify building scheduled and event-driven workflows. It uses Infrastructure as Code (IaC) practices, which makes creating reliable workflows as easy as writing a few lines of YAML. Think of it like Airflow but with a different flavor. I’m used to Airflow, but I wanted to follow along with Kestra to add another orchestration tool to my repertoire.

Hands-On with Kestra

Setting up Kestra was straightforward thanks to Docker Compose. With just a few commands, I had a Kestra server and a Postgres database up and running locally.

Once up, accessing Kestra’s intuitive UI at http://localhost:8080 made managing workflows a breeze. A huge shoutout to Will Russel and his videos — they were instrumental in helping me navigate through Kestra’s features.

Building ETL Pipelines

One of the main tasks this week was building ETL pipelines for NYC’s Yellow and Green Taxi data. Here’s what we did:

Extracted data from CSV files.
Loaded it into Postgres and later into Google Cloud Storage (GCS) and BigQuery.
Explored scheduling and backfilling workflows.

It was fascinating to see how easily Kestra could schedule tasks and backfill historical data. The YAML configurations were easy to understand, making the flows straightforward to grasp.

Using dbt for Transformation

We also touched on dbt for transforming data in Postgres and BigQuery. Although it was optional, I gave it a shot to see how Kestra handles dbt models. It’s pretty cool that Kestra can sync dbt models from Git and execute them, streamlining the transformation process.

Taking it to the Cloud

Moving the ETL pipelines from a local Postgres database to Google Cloud Platform (GCP) was another highlight. Using GCS as a data lake and BigQuery as a data warehouse felt like a natural progression from local development to scalable cloud infrastructure.

Wrapping Up

I’m thrilled to add Kestra to my orchestration toolkit. It’s been a rewarding week, and I’m happy to see how versatile this tool is.

Looking forward to Week 3 and sharing more of my journey with you all. If you’re also taking the DE Zoomcamp, let’s connect and learn together!

Check out my code for Week 2 on GitHub and feel free to drop your thoughts in the comments!

Ccinaza / de-zoomcamp-2025

My DE Zoomcamp Journey: Week 1: Diving Into Docker, Terraform, and PostgreSQL

Blessing Angus — Sun, 26 Jan 2025 18:42:49 +0000

Starting 2025 on a high note, I signed up for the DE Zoomcamp program, and wow, what a week it's been! Here’s what I’ve learned and built during my first week.

Docker and Docker-Compose: Laying the Foundation

I’ve always been curious about Docker, and after this week, I can safely say that I’ve got a solid foundation in it. Docker allows you to create isolated environments for running applications, which is a huge deal when working with different tools in data engineering. docker-compose was a game-changer, enabling me to spin up multiple services in one go. At first, I was a bit confused about how everything connects, but after some trial and error, it clicked!

Postgres and pgAdmin

Setting up Postgres locally with Docker was straightforward, and I quickly became familiar with using pgcli to connect to the database. However, I think my preferred way of interacting with Postgres was through pgAdmin — it just felt more intuitive with its GUI. I could easily navigate through tables and execute SQL queries without having to rely on terminal commands.

Here’s an example of a simple docker-compose.yml to run Postgres and pgAdmin:

Ingestion Script Optimization

One of my highlights this week was optimizing my ingestion script. Initially, the script was set up to load one table at a time. I refactored it to load both tables in parallel by breaking out the logic into separate functions. This not only made the process more efficient but also gave me better insight into how to structure reusable code in data pipelines.

Terraform and Google Cloud

A major highlight for me was Terraform. Before starting this course, my boss had mentioned Terraform for a project I was working on, but I found the concept pretty overwhelming. When I saw it was part of the curriculum, I was pumped! The tutorials were thorough and easy to follow, and now I’m confident with the basics of setting up Google Cloud Storage and BigQuery using Terraform. It’s definitely something I’ll be using in future projects.

Wrapping Up

Overall, week 1 has been an eye-opener! Docker, PostgreSQL, and Terraform — it’s a lot to digest, but it’s been a rewarding journey so far. The hands-on exercises, especially the Docker-Postgres setup and Terraform configuration, were incredibly valuable. I’m already looking forward to what week 2 has in store so stay tuned for more updates as I continue this journey.

And hey, you should check out my GitHub Repository to see my project!

Ccinaza / de-zoomcamp-2025

From Concept to Impact: A Journey Through My Fraud Detection Model

Blessing Angus — Sat, 21 Dec 2024 10:06:20 +0000

Fraud detection in financial systems is like finding a needle in a haystack—except the haystack is dynamic, ever-changing, and massive. How do you spot these fraudulent transactions? This was the challenge I set out to tackle: developing a fraud detection model designed not only to identify suspicious activity in a vast ocean of data but to adapt and evolve as new patterns of fraud emerge.

Here’s the story of how I turned a blank slate into a robust fraud detection system, complete with insights, challenges, and breakthroughs along the way.

The Spark: Why This Project?

Imagine millions of transactions flowing every second, and hidden among them are activities that could cost businesses billions. My mission was clear: create a system that detects these anomalies without crying wolf at every shadow. With this in mind, I envisioned a solution powered by synthetic data, innovative feature engineering, and machine learning.

Building the Playground: Data Generation

Great models require great data but fraud data is rare. So, I built my own. Using Python’s ⁠ Faker ⁠ and ⁠ NumPy ⁠ libraries, I generated a synthetic dataset of 1,000,000 transactions, designed to mimic real-world patterns. Each transaction carried:

⁠Transaction IDs, unique yet random.
⁠Account IDs and Receiver Account IDs, with 20% and 15% uniqueness, respectively, ensuring realistic overlaps.
⁠Transaction Amounts, ranging from micro to mega, distributed to reflect plausible scenarios.
⁠Timestamps, to capture hourly, daily, and seasonal trends.
⁠Categories like Account Type (Personal or Business), Payment Type (Credit or Debit), and Transaction Type (Bank Transfer, Airtime, etc.).

The dataset came alive with personal and business accounts, transactions ranging from tiny purchases to hefty transfers, and diverse transaction types like deposits, airtime purchases, and even sports betting.

The Art of Transformation: Feature Engineering

With the data ready, I turned my focus to feature engineering—a detective’s toolkit for uncovering hidden patterns. This is where the real excitement began. I calculated:

⁠Account Age: How long had each account existed? This helps to spot new accounts behaving oddly.
⁠Daily Transaction Amount: How much money flowed through each account daily?
⁠Frequency Metrics: Tracking how often an account interacted with specific receivers within short windows.
⁠Time Delta: Measuring the gap between consecutive transactions to flag bursts of activity.

These features would serve as clues, helping the model sniff out suspicious activity. For example, a brand-new account making unusually large transfers was worth investigating.

Drawing from domain knowledge, I crafted rules to classify transactions as suspicious. These rules acted as a watchful guardian over the dataset. Here are a few:

Big Spender Alert: Personal accounts transferring over 5 million in a single transaction.
Rapid Fire Transactions: More than three transactions to the same account in an hour.
Midnight Madness: Large bank transfers during late-night hours.

I coded these rules into a function that flagged transactions as suspicious or safe.

Preparing the Model’s Vocabulary

Before teaching a machine learning model to detect fraud, I needed to make the data comprehensible. Think of it like teaching a new language—the model needed to understand categorical variables like account types or transaction methods as numerical values.

I achieved this by encoding these categories. For instance, the transaction type (“Bank Transfer,” “Airtime,” etc.) was converted into numerical columns using one-hot encoding, where each unique value became its own column with binary indicators. This ensured the model could process the data without losing the meaning behind categorical features.

The Workhorses: Model Development

With a dataset enriched by rules and features, it was time to bring in the big guns: machine learning. I trained several models, each with its unique strengths:
1.⁠ ⁠Logistic Regression: Reliable, interpretable, and a great starting point.
2.⁠ ⁠XGBoost: A powerhouse for detecting complex patterns.

But first, I tackled the class imbalance—fraudulent transactions were far outnumbered by legitimate ones. Using the SMOTE oversampling technique, I balanced the scales.

Before SMOTE:

After SMOTE:

Training and Results

The models were evaluated using metrics like Precision, Recall, and AUC (Area Under the Curve):

⁠Logistic Regression: AUC of 0.97, Recall of 92%.
⁠XGBoost: AUC of 0.99, Recall of 94%.

The clear winner? XGBoost, with its ability to capture intricate fraud patterns.

Smarter Every Day: Feedback Loop Integration

A standout feature of my system was its adaptability. I designed a feedback loop where:

⁠Flagged transactions were reviewed by a fraud team.
⁠Their feedback updated the training data.
⁠Models retrained periodically to stay sharp against new fraud tactics.

Deployment

After a journey filled with data wrangling, feature engineering, and machine learning, the model was ready for deployment. The XGBoost model, saved as a .pkl file, is now a reliable tool for fraud detection.

Epilogue: Reflections and Future Directions

Building this fraud detection model taught me the power of combining business knowledge, data science, and machine learning. But the journey doesn’t end here. Fraud evolves, and so must the defenses against it.

What I Learned

This project was more than a technical exercise. It was a journey in:
•⁠ ⁠Scalability: Designing systems that handle vast amounts of data.
•⁠ ⁠Adaptability: Building models that evolve with feedback.
•⁠ ⁠Collaboration: Bridging the gap between technical teams and domain experts.

In the future, I plan to:

Explore deep learning for anomaly detection.
Implement real-time monitoring systems.
Continuously refine rules based on new fraud patterns.

Fraud detection isn’t just about numbers—it’s about safeguarding trust. And this project, I hope, is a small but meaningful step in that direction.

Thank you for reading. Feel free to share your thoughts or questions in the comments.

Unravelling Linearity: My Journey in Regression Modeling

Blessing Angus — Sun, 14 Apr 2024 13:38:25 +0000

Imagine a detective board filled with clues – features or independent variables – that might help solve a case (the dependent variable). Linearity states that the relationship between these clues and the outcome we're trying to predict should be linear, like a straight line on a graph. If this assumption isn't met, our model's predictions could be skewed. A curved line, for instance, might suggest a more complex relationship.

My Case: Unveiling Linearity in My Data

Here's where my detective work began. I wanted to build a regression model that would predict the price of houses using a housing dataset. To ensure the validity of my model, I needed to verify the linearity assumption between the independent and dependent variables. This involved employing a combination of diagnostic techniques, including visual inspection (residual plot) and statistical tests (Rainbow Test). Let's dive into the code snippets and diagnostic plots to gain a deeper understanding of this validation process.

Imports
I imported the following libraries: pandas, matplotlib, scipy, statsmodels, and sklearn.

Cleaning and Feature Engineering
During the data cleaning process, I encountered a pivotal challenge regarding one of my predictors: the "location" column having over 800 unique values! Directly one-hot-encoding this variable would create a nightmare – a massive increase in features due to the curse of dimensionality. This can cripple my model's ability to learn.

To tackle this, I implemented a group_location function that grouped infrequent locations into an "Other" category. This approach condensed the number of categorical values, mitigating the adverse effects of high dimensionality and facilitating smoother model training.

Fitting The Model - An OLS Model
After defining the dependent and independent variables and adding an intercept term, the model was then instantiated and fitted. Then, the data was split for evaluation, leading to predictions being made and residuals calculated.

Residual Plot
The residual plot shows a slight curvature, with residuals becoming increasingly positive or negative for higher fitted values. This suggests a potential violation of the homoscedasticity assumption, where the variance of the errors might not be constant across the range of predicted prices.

Rainbow Test
A high p-value (typically > 0.05) suggests that there is no evidence against linearity, meaning the linear model is an appropriate fit for the data.

While the test statistic (0.9555) still indicates a positive correlation, it might not be entirely reliable due to the observed pattern in the residuals.

The high p-value (0.9104) technically allows us to fail to reject linearity based on the test alone. However, the visual evidence from the residuals suggests further investigation might be needed.

Conclusion
While the rainbow test didn't statistically reject linearity, the residual plot's non-random pattern hints at a potential issue with homoscedasticity. This calls for further investigation!

To address this, I plan to explore data transformations, investigate alternative models, and perform additional diagnostics.

How do you approach validating assumptions in your regression models? Share your strategies, insights, or questions in the comments below – I'm curious to hear your perspective!