Forem: Ng'ang'a Njongo

Building Your First Airflow DAG: Extracting Stock Data with Massive

Ng'ang'a Njongo — Tue, 05 May 2026 09:36:54 +0000

When stepping into the world of data engineering, Apache Airflow is likely one of the first tools you will encounter. It is the industry standard for programmatically authoring, scheduling, and monitoring workflows.

Understanding the New Airflow 3.1.0 Syntax

Before building our first DAG, it's important to know what has changed in Airflow 3.1.0.

Initially, Airflow users imported DAGs and tasks from airflow.models and airflow.decorators. In Airflow 3.0 and later, versions introduced the airflow.sdk

This means, you will now use:

from airflow.sdk import dag, task

This ensures your DAGs remain compatible with future Airflow upgrades

The Pipeline Overview

Our data pipeline consists of two main tasks:
1 Extract: Fetch the daily open, high, low, close, and volume data for a list of stock tickers from the Massive API.
2 Transform and Load: Convert the raw data into a structured format, connect to a PostgreSQL database, create the target table if it does not exist, and insert the records.

Step 1: Setting Up Dependencies and Configuration

First, we import the necessary libraries. We need the RESTClient from the massive Python package to interact with the Massive API. We also import pandas for data manipulation, psycopg2 for PostgreSQL database connectivity, and the new dag and task decorators from airflow.sdk.

from massive import RESTClient 
from datetime import datetime, timedelta
import pandas as pd
import psycopg2 
from psycopg2 import sql
from psycopg2 import extras
from airflow.sdk import dag, task

# Database Configuration:
DB_HOST = 'my_host'
DB_NAME = 'my_db'
DB_USER = 'my_user'
DB_PASSWORD = 'my_password'
DB_PORT = '1234'
TARGET_TABLE = 'my_table'

@dag(
   dag_id="nganga_massive_tickers",
   start_date=datetime(2026,4,24),
   schedule=timedelta(days=1),
   catchup=False
)

Step 2: Extracting Data from Massive

The first task in our DAG is to extract the stock data. We define a function extract_tickers and decorate it with @task to register it as an Airflow task.

def nganga_massive_tickers():

    #Extract from Massive
    @task()
    def extract_tickers():

        #Extract from Massive
        client = RESTClient("my_api_key")
        symbols = ["AAPL","GOOGL","TSLA","NFLX","AMZN"]
        data_list = []
        yesterday = datetime.today() - timedelta(days=2)
        yesterday_formatted = yesterday.strftime('%Y-%m-%d')

        for ticker in symbols:
            try:
                request = client.get_daily_open_close_agg(ticker,yesterday_formatted,adjusted="true",)
                data_list.append({
                    "ticker": ticker,
                    "open": request.open,
                    "high": request.high,
                    "low": request.low,
                    "close": request.close,
                    "volume": request.volume,
                    "ticker_date": yesterday_formatted
                })

            except Exception as e:
                print(f"Error fetching tickers: {e}")

        return data_list

In this function, we call the RESTClient with our API key. We then loop through a list of popular chosen stocks. For each stock, we call get_daily_open_close_agg function to get the stock data for a specific date. The results are appended to a list of dictionaries, which is returned by the task.

Step 3: Transforming and Loading the Data

The second task takes the raw data extracted in the previous step, transforms it, and loads it into PostgreSQL that is hosted on Aiven.

#Transform and load Tickers
    @task()
    def transform_load_tickers(raw_tickers):

        #Transform data_list to a DataFrame
        data_df = pd.DataFrame(raw_tickers)

        data_df["volume"] = data_df["volume"].round(2)

        #Connect to Postgres 
        conn = psycopg2.connect(
                host = DB_HOST,
                dbname = DB_NAME,
                user = DB_USER,
                password = DB_PASSWORD,
                port = DB_PORT
                )

        cur = conn.cursor()

        try:
            print(f"Successfully connected to database: {DB_NAME} with user: {DB_USER}")

        except psycopg2.Error as e:
            print(f"Error in connection: {e}")


        #Create table on Postgres 
        ##data_df.columns.to_list

        create_table = sql.SQL("""
        CREATE TABLE IF NOT EXISTS {table} (                       
            ticker VARCHAR(10), 
            open NUMERIC(10,2), 
            high NUMERIC(10,2), 
            low NUMERIC(10,2), 
            close NUMERIC(10,2),
            volume NUMERIC(10,2),
            ticker_date DATE, 
            load_date TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP); 
        """).format(table = sql.Identifier(TARGET_TABLE))

        try:
            cur.execute(create_table)
            conn.commit()
            print(f"Successfully created table: {TARGET_TABLE} or already exists")

        except psycopg2.Error as e:
            print(f"Error in syntax: {e}")


        #Load data to Postrgres 
        ##data_df.columns.to_list

        columns_to_load = ['ticker', 'open', 'high', 'low', 'close', 'volume','ticker_date']
        data_df = data_df[columns_to_load]

        values_to_load = [tuple(row) for row in data_df.values]

        insert_query = sql.SQL("""
                    INSERT INTO {table} ({columns})
                    VALUES %s
        """).format(table = sql.Identifier(TARGET_TABLE),
                    columns = sql.SQL(', ').join(map(sql.Identifier,columns_to_load)))

        try:
            extras.execute_values(cur, insert_query, values_to_load)
            conn.commit()
            print(f"Sucessfully inserted {len(values_to_load)} rows in table: {TARGET_TABLE}")

        except psycopg2.Error as e:
            conn.rollback()
            print(f"Check syntax error: {e}")
    #Define Task Dependencies
    extracted = extract_tickers()

    Loaded = transform_load_tickers(extracted)
dag = nganga_massive_tickers()

Here, we convert the list of dictionaries into a Pandas DataFrame. We then connect to the PostgreSQL database using psycopg2.

We execute a CREATE TABLE IF NOT EXISTS statement to ensure our destination table is ready. Finally, we use psycopg2.extras.execute_values to insert the DataFrame rows into the database.

Step 4: Defining the DAG and Task Dependencies

The final step, as shown above, ties everything together into a DAG and define the execution order of our tasks.

    #Define Task Dependencies
    extracted = extract_tickers()

    Loaded = transform_load_tickers(extracted)
dag = nganga_massive_tickers()

Conclusion

Congratulations! You have just built a modern data pipeline using Apache Airflow 3.1.0. By leveraging the new airflow.sdk, your DAG is future-proofed against core architectural changes. You also learned how to interact with the Massive API to pull financial data and load it into a relational database. .

ETL vs ELT: Which One Should You Use and Why?

Ng'ang'a Njongo — Wed, 15 Apr 2026 12:42:33 +0000

The data landscape has undergone a huge shift over the last few years. As organizations move from on-premise servers to cloud architectures, the methods used to move and process data have evolved. At the heart of this evolution is the debate between two fundamental data integration strategies: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). While they share the same three core components, the order in which these steps occur completely changes the architecture, cost, and performance of a data pipeline. This article provides a technical comparison to help you decide which approach is right for your modern data stack.

Understanding ETL: The Traditional Workhorse

ETL, which stands for Extract, Transform, and Load, is the traditional method of data integration that has dominated the industry since the 1970s. In an ETL architecture, data is extracted from one or more source systems, moved to a separate "staging area" or processing server, transformed into a structured format, and finally loaded into a target data warehouse.

The ETL Workflow

Extract: Data is pulled from various sources, such as relational databases (SQL Server, Oracle), CRM systems (Salesforce), or flat files (CSV, XML).
Transform: This is the most compute-intensive stage. On a dedicated transformation server, the raw data is cleaned, filtered, deduplicated, and formatted. Complex business logic is applied to ensure the data matches the strict schema of the target warehouse.
Load: The "clean" and fully transformed data is then loaded into the data warehouse, ready for BI tools and analysts to query.

Why Use ETL?

ETL is highly effective for organizations with strict compliance requirements, such as those in healthcare or finance. Because data is transformed before it reaches the warehouse, sensitive information like Personally Identifiable Information (PII) can be masked or removed entirely during the transformation phase. This ensures that sensitive raw data never enters the storage layer. Furthermore, ETL is ideal for legacy on-premise systems where the target data warehouse lacks the processing power to handle large-scale transformations.

Understanding ELT: The Cloud-Native Revolution

ELT, or Extract, Load, and Transform, is a modern approach that has gained massive popularity with the rise of cloud data warehouses like Snowflake, Google BigQuery, and Amazon Redshift. Unlike ETL, which relies on an external processing server, ELT leverages the massive, horizontally scalable compute power of the data warehouse itself to perform transformations.

The ELT Workflow

Extract: Just like ETL, data is pulled from source systems.
Load: Instead of going to a staging server, the raw data is loaded directly into the target data warehouse. Modern cloud warehouses can ingest vast amounts of raw data (structured, semi-structured, or unstructured) at incredibly high speeds.
Transform: Once the raw data is inside the warehouse, it is transformed using SQL or specialized tools. The raw data is often preserved in "bronze" or "staging" tables, while transformed versions are created in "silver" or "gold" tables for analysis.

Why Use ELT?

ELT offers unparalleled flexibility and speed. Because the raw data is stored within the warehouse, data scientists and analysts can re-query and re-transform it whenever business requirements change without needing to re-extract it from the source. It is the backbone of the "Modern Data Stack," enabling faster ingestion and better support for Big Data and real-time analytics.

Key Differences Between ETL and ELT

While both methods achieve the same end goal—making data available for analysis—the technical trade-offs are significant. The following table summarizes the core differences between these two approaches:

Feature	ETL (Extract, Transform, Load)	ELT (Extract, Load, Transform)
Transformation Location	Separate dedicated processing server	Target data warehouse (Cloud)
Data Format Support	Primarily structured data	Structured, semi-structured, and unstructured
Flexibility	Rigid; requires schema-on-write	Highly flexible; supports schema-on-read
Loading Speed	Slower (waits for transformation)	Faster (direct ingestion of raw data)
Scalability	Limited by the staging server's capacity	Highly scalable via cloud MPP architecture
Maintenance	High; complex pipelines and server management	Lower; automated ingestion and SQL-based logic
Compliance	Superior for masking PII before storage	Requires careful management within the warehouse
Cost Model	High upfront hardware/software costs	Pay-as-you-go compute and storage

Real-World Use Cases

Choosing between ETL and ELT often depends on the specific industry, data volume, and regulatory environment. Below are common real-world applications for each.

ETL Use Cases

1. Healthcare Data Integration: Healthcare providers often use ETL to merge patient records from fragmented Electronic Health Record (EHR) systems. Before loading this data into a centralized warehouse for clinical research, ETL pipelines must anonymize patient names and other PII.

2. Financial Fraud Detection: Banks use ETL to process transaction logs from legacy mainframes. By transforming this data in a secure staging area, they can detect suspicious patterns and flag anomalies before the data is archived, ensuring that only verified, high-quality data is used for regulatory reporting.

3. Legacy System Modernization: Organizations still running on-premise ERP systems often lack the cloud infrastructure for ELT. ETL allows them to extract data from these older systems, clean it on a mid-tier server, and load it into a structured reporting database without overwhelming their existing hardware.

ELT Use Cases

1. E-commerce Customer 360: Modern retailers like Shopify or Amazon-based sellers use ELT to ingest massive streams of behavioral data (clicks, views, and cart additions). By loading this raw data into BigQuery or Snowflake, they can use tools like dbt to build "Customer 360" profiles that drive real-time product recommendations and personalized marketing.

2. Log Analysis and IoT Monitoring: Tech companies and manufacturers deal with millions of log entries and sensor readings per second. ELT allows them to "dump" these logs into a cloud data lake or warehouse immediately. Analysts can then perform transformations on specific subsets of that data only when a security audit or system failure occurs.

3. Marketing Attribution: Marketing teams pull data from dozens of disparate APIs, including Google Ads, Facebook, and LinkedIn. ELT is used to ingest all this data in its raw form first. This allows analysts to experiment with different attribution models (first-click, last-click, or linear) by re-transforming the same raw data multiple times.

The Tooling Landscape

The tools you choose will largely define your architecture. The industry has split into traditional ETL vendors and modern ELT-focused platforms.

1) Traditional ETL Tools

These tools are designed for complex, server-side transformations and often feature "drag-and-drop" visual interfaces.

• Informatica PowerCenter: The enterprise standard for decades, known for its robustness and complex workflow management.

• Talend (Qlik): An open-source-based platform that provides extensive connectors for both on-premise and cloud systems.

• Microsoft SSIS: A popular choice for organizations already deep within the Microsoft SQL Server ecosystem.

• IBM InfoSphere DataStage: A high-performance ETL tool designed for large-scale enterprise data integration.

2) Modern ELT Tools

These tools focus on high-speed ingestion and "in-warehouse" transformation, often using SQL as the primary language.

• Fivetran & Airbyte: These are the leaders in "automated ingestion." They focus on the E and L of ELT, moving data from hundreds of sources into a warehouse with zero configuration.

• dbt (data build tool): The industry standard for the T in ELT. It allows data analysts to write transformations in SQL and manage them like software code (version control, testing, and documentation).

• Matillion: A cloud-native tool that provides a visual interface for building ELT pipelines specifically for Snowflake, Redshift, and BigQuery.

• AWS Glue & Azure Data Factory: These cloud-native services are hybrid; they can perform traditional ETL using Spark or ELT by orchestrating warehouse-native commands.

Conclusion: Which One Should You Use?

The decision between ETL and ELT is no longer a simple binary choice, but a strategic one based on your organization's maturity and needs.

Choose ETL if:

• You operate in a highly regulated industry (Finance, Healthcare) and must mask sensitive data before it reaches your storage layer.

• You are working with legacy on-premise systems that cannot handle the compute load of modern transformations.

• Your data volumes are relatively small and predictable, and you require highly structured, "clean" data from day one.

Choose ELT if:

• You are building a "Modern Data Stack" in the cloud and want to leverage the scalability of Snowflake, BigQuery, or Redshift.

• You deal with high-volume, high-velocity data (Big Data, IoT, or web logs) that requires fast ingestion.

• Your team values flexibility and wants to retain raw data for future exploration and re-analysis.

In 2026, the trend is undeniably toward ELT. The cost of cloud storage has plummeted, while the power of cloud compute has skyrocketed. By moving transformations into the warehouse, organizations can empower their analysts, reduce pipeline maintenance, and build a more agile data-driven culture. However, for those with strict security mandates, the tried-and-true ETL approach remains a vital tool in the data engineer's arsenal.

Connecting Power BI to SQL Databases: A Beginner's Guide

Ng'ang'a Njongo — Mon, 23 Mar 2026 15:21:45 +0000

Power BI is one of the most powerful tools for data analysis and business intelligence. It allows users to visualize their data through interactive dashboards and reports, making it easier for companies to track performance, identify trends, and make informed decisions.

While Power BI can import data from simple files like Excel or CSVs, most professional organizations store their data in SQL databases. These databases are essential for managing large volumes of structured data efficiently, ensuring data integrity, and providing a "single source of truth" for the entire company. By connecting Power BI directly to a SQL database, analysts can work with real-time data and build scalable reporting solutions.

Connecting to a Local PostgreSQL Database

PostgreSQL is a popular open-source relational database. If you have a local instance of PostgreSQL running on your machine, connecting it to Power BI is a straightforward process.

Step-by-Step Connection:

1 Open Power BI Desktop: Start by launching the application on your computer.
2 Select Get Data: On the Home ribbon, click the "Get Data" icon.
3 Choose PostgreSQL Database: From the "Get Data" window, select "PostgreSQL" from the list.

4 Enter Server Details: Enter the Server name, in our case it's: localhost. Also specify the Database name you want to connect to.
5 Authentication: Choose the Database tab in the authentication window. Enter your PostgreSQL Username and Password.
6 Load Tables: Once connected, the Navigator window will display all available tables. You can select the ones you need and click Load.

Connecting to a Cloud Database: Aiven PostgreSQL

Many companies use cloud-managed databases like Aiven for PostgreSQL to handle their production data. Cloud databases offer high availability and security but require a few extra steps to connect.

Step-by-Step Connection:

1. Login to Aiven: Login into Aiven and select create service as below:

2. Postgres Configurations: Select PostgreSQL service and on the configuration page, configure below:

Service Tier
Region
Plan

3. Create Service: Once above configurations are done, click "Create Service"

Obtaining Connection Details

Once you've created your PostgreSQL service on Aiven, gather the following information from your service overview:
• Host
• Port
• Database Name
• Username & Password

See example below:

The Role of SSL Certificates

Cloud connections often require SSL (Secure Sockets Layer) certificates. SSL encrypts the data moving between the database and Power BI, preventing unauthorized parties from intercepting sensitive information.

To include the certificate in Power BI:

Download the ca.pem file from the Aiven console.
Open Command Prompt or PowerShell as Administrator. Run the following command to add the certificate to the Root Store: certutil -addstore -f "Root" <path_to_your_ca.pem_file>
In Power BI, when prompted for the connection, ensure the "Encrypt connections" option is checked.

Get Data → PostgreSQL database
Enter Aiven connection details:

NB:
Advanced options → Add SSL parameters:
powerquery
let
Source = PostgreSQL.Database(
"your-service.aivencloud.com:12345",
"defaultdb",
[
CreateNavigationProperties = true,
SSLMode = "Require",
UseSSL = true
]
)
in
Source

Why SQL Skills are Vital for Power BI Analysts

While Power BI provides a user-friendly interface for connecting to data, having SQL (Structured Query Language) skills is a game-changer for any data analyst.

• Data Retrieval: SQL allows you to write custom queries to pull only the specific columns and rows you need, reducing the load on Power BI and improving performance.
• Data Filtering and Aggregation: Instead of bringing millions of rows into Power BI, you can use SQL to aggregate data at the database level.
• Data Cleaning: SQL is efficient at handling "messy" data—renaming columns, handling null values, and formatting dates—before the data even reaches your dashboard.
• Complex Logic: Some business logic is easier to write in SQL than in Power BI’s DAX language, especially when involving complex joins or window functions.

By mastering both SQL and Power BI, you become a versatile analyst capable of handling the entire data pipeline—from the raw database to the final executive dashboard.

A Beginner's Guide to SQL Joins and Window Functions

Ng'ang'a Njongo — Sat, 07 Mar 2026 09:37:51 +0000

Structured Query Language (SQL) is the backbone of data management, enabling us to interact with and extract meaningful insights from relational databases. Two powerful concepts within SQL that are essential for any data professional are Joins and Window Functions. This article will demystify these concepts, providing clear explanations and practical examples based on a hypothetical e-commerce database.

Our database consists of four tables:

• Customers: customer_id, first_name, last_name, email, phone_number, registration_date, membership_status
• Inventory: product_id, stock_quantity
• Products: product_id, product_name, category, price, supplier, stock_quantity
• Sales: sale_id, customer_id, product_id, quantity_sold, sale_date, total_amount

Understanding SQL Joins: Connecting Related Data

In relational databases, data is often spread across multiple tables to ensure efficiency and reduce redundancy. Joins are SQL clauses that combine rows from two or more tables based on a related column between them. They allow us to retrieve a complete picture by linking disparate pieces of information.

1. INNER JOIN

An INNER JOIN returns only the rows that have matching values in both tables. It's the most common type of join and is used when you want to see data where a relationship exists in both datasets.

Example: Finding customers who have purchased products with a price greater than 1000.

SELECT c.first_name || ' ' || c.last_name AS Cust_Above_1000 FROM assignment.Customers c INNER JOIN assignment.Sales s ON c.customer_id = s.customer_id INNER JOIN assignment.Products p ON s.product_id = p.product_id WHERE p.price > 1000;

Explanation: This query combines Customers, Sales, and Products tables. It links Customers to Sales using customer_id and Sales to Products using product_id. The WHERE clause then filters these combined results to show only customers who bought products priced over 1000. Only customers who have made a sale are included, and only products that have been sold are considered.

Example: Joining Sales and Products to calculate total sales for each product.

SELECT p.product_name, SUM(s.quantity_sold) AS product_sales FROM assignment.Sales s INNER JOIN assignment.Products p ON s.product_id = p.product_id GROUP BY p.product_name;

Explanation: Here, we join Sales and Products to get the product names associated with each sale. We then use SUM(s.quantity_sold) and GROUP BY p.product_name to calculate the total quantity sold for each product. This query effectively shows how many units of each product have been sold.

2. LEFT JOIN (or LEFT OUTER JOIN)

A LEFT JOIN returns all rows from the left table, and the matching rows from the right table. If there's no match in the right table, NULL values are returned for columns from the right table. This is useful when you want to include all records from one table, even if they don't have a corresponding record in another.

Example: List all customers and any sales they have made.

SELECT c.first_name, c.last_name, s.sale_id, s.total_amount FROM assignment.Customers c LEFT JOIN assignment.Sales s ON c.customer_id = s.customer_id;

Explanation: This query would list every customer from the Customers table. If a customer has made sales, their sales details will appear alongside their name. If a customer has not made any sales, their name will still appear, but the sale_id and total_amount columns will show NULL.

3. RIGHT JOIN (or RIGHT OUTER JOIN)

A RIGHT JOIN is the inverse of a LEFT JOIN. It returns all rows from the right table, and the matching rows from the left table. If there's no match in the left table, NULL values are returned for columns from the left table.

Example: List all products and any sales made for them.

SELECT p.product_name, s.sale_id, s.quantity_sold FROM assignment.Sales s RIGHT JOIN assignment.Products p ON s.product_id = p.product_id;

Explanation: This query would list every product from the Products table. If a product has been sold, its sales details will appear. If a product has never been sold, its name will still appear, but sale_id and quantity_sold will be NULL.

4. FULL JOIN (or FULL OUTER JOIN)

A FULL JOIN returns all rows when there is a match in either the left or the right table. It essentially combines the results of both LEFT JOIN and RIGHT JOIN.

Example: Show all customers and all products, linking them by sales where applicable.

SELECT c.first_name, c.last_name, p.product_name, s.sale_id FROM assignment.Customers c FULL JOIN assignment.Sales s ON c.customer_id = s.customer_id FULL JOIN assignment.Products p ON s.product_id = p.product_id;

Explanation: This query would show all customers, all products, and any sales that connect them. If a customer has no sales, their details will appear with NULL for sales and product information. If a product has no sales, its details will appear with NULL for customer and sales information. If both exist and are linked by a sale, all information will be present.

5. SELF JOIN

A SELF JOIN is a regular join, but the table is joined with itself. This is useful for comparing rows within the same table.

Example: Find all pairs of customers who have the same membership status.

SELECT c1.first_name AS customer1_first_name, c1.last_name AS customer1_last_name, c2.first_name AS customer2_first_name, c2.last_name AS customer2_last_name, c1.membership_status FROM assignment.Customers c1 JOIN assignment.Customers c2 ON c1.membership_status = c2.membership_status WHERE c1.customer_id > c2.customer_id;

Explanation: We join the Customers table to itself, aliasing it as c1 and c2. The join condition c1.membership_status = c2.membership_status finds customers with the same membership status. The WHERE c1.customer_id > c2.customer_id clause is crucial to avoid duplicate pairs (e.g., (Alice, Bob) and (Bob, Alice)) and to prevent a customer from being paired with themselves.

Exploring SQL Window Functions: Advanced Analytics

Window Functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions (SUM, AVG, COUNT) which collapse rows into a single summary row, window functions return a value for each row, making them incredibly powerful for analytical tasks like ranking, moving averages, and cumulative sums.

The key to understanding window functions is the OVER() clause, which defines the window or set of rows on which the function operates. The OVER() clause can include:

• PARTITION BY: Divides the rows into groups or partitions. The window function is applied independently to each partition.
• ORDER BY: Orders the rows within each partition. This is crucial for functions that depend on the order of rows.

Common Window Functions and Their Uses:

1. Ranking Functions (ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE())

These functions assign a rank to each row within its partition based on the specified ordering. They are invaluable for identifying top performers, most recent entries, or other ordered subsets of data.

Example: Rank products by total sales within each category.

SELECT p.category, p.product_name, SUM(s.total_amount) AS total_sales, RANK() OVER (PARTITION BY p.category ORDER BY SUM(s.total_amount) DESC) AS sales_rank FROM assignment.Sales s INNER JOIN assignment.Products p ON s.product_id = p.product_id GROUP BY p.category, p.product_name;

Explanation: This query first groups sales by product and category to get total_sales. Then, RANK() OVER (PARTITION BY p.category ORDER BY SUM(s.total_amount) DESC) assigns a rank to each product within its category based on its total_sales in descending order. Products with the same total sales within a category will receive the same rank.

2. Aggregate Window Functions (SUM(), AVG(), COUNT(), MIN(), MAX())

These are standard aggregate functions used as window functions. When used with OVER(), they perform their aggregation over the defined window, but instead of collapsing rows, they return the aggregate value for each row.

Example: Calculate the running total of sales for each customer over time.

SELECT c.first_name, c.last_name, s.sale_date, s.total_amount, SUM(s.total_amount) OVER (PARTITION BY c.customer_id ORDER BY s.sale_date) AS running_total_sales FROM assignment.Sales s INNER JOIN assignment.Customers c ON s.customer_id = c.customer_id ORDER BY c.customer_id, s.sale_date;

Explanation: This query calculates a running_total_sales for each customer. PARTITION BY c.customer_id ensures the sum restarts for each new customer, and ORDER BY s.sale_date makes it a cumulative sum based on the sale date. Each row will show the total amount spent by that customer up to that specific sale date.

3. Lag/Lead Functions (LAG(), LEAD())

These functions allow you to access data from a previous (LAG()) or subsequent (LEAD()) row within the same result set without using a self-join. This is particularly useful for comparing values across rows, such as calculating the difference between consecutive sales.

Example: Find the previous sale amount for each customer.

SELECT c.first_name, c.last_name, s.sale_date, s.total_amount, LAG(s.total_amount, 1, 0) OVER (PARTITION BY c.customer_id ORDER BY s.sale_date) AS previous_sale_amount FROM assignment.Sales s INNER JOIN assignment.Customers c ON s.customer_id = c.customer_id ORDER BY c.customer_id, s.sale_date;

Explanation: LAG(s.total_amount, 1, 0) retrieves the total_amount from the previous row within each customer's partition, ordered by sale_date. If there is no previous row (e.g., the first sale), it defaults to 0 (the third argument).

Conclusion

SQL Joins and Window Functions are indispensable tools for anyone working with relational databases. Joins allow you to combine data from multiple tables, creating a comprehensive view of your information. Window functions, on the other hand, provide a powerful way to perform complex analytical calculations over related sets of rows without aggregating them away. Mastering these concepts will significantly enhance your ability to extract, analyze, and report on data effectively, transforming raw data into actionable insights.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

Ng'ang'a Njongo — Fri, 13 Feb 2026 17:21:37 +0000

In this article we'll use a real-world example of data from farms in Kenya to show you how it's done. You'll learn how to take a messy file, clean it up, use some simple but powerful formulas, and create insightful charts—all using Power BI.

1. Cleaning Up Your Data

The first and most important step in Power BI is to clean and prepare your data. Power BI has a powerful tool for this called the Power Query Editor.

Getting Your Data into Power BI

We're using a simple CSV file with data about crops in Kenya. To pull data into Power BI, we simply use the "Get data" feature and select our csv file. Once it's loaded, Power Query shows you a preview of your data.

Cleaning the Dataset

Our Kenya Crops dataset has a few common problems that we need to fix:

• "Error" messages: Some cells just say "Error." We replace these cells with a blank or label it as "Unknown."
• Empty cells: Some cells in our data are just blank. We fill them with a zero, a label like "Unknown".
• Numbers that are text: We need to make sure our numbers are actually numbers.
• Missing information: We notice that some rows are missing the final profit number, even though they have the revenue and cost.

By taking the time to clean the data, we make sure our final analysis is accurate and trustworthy.

2. Using DAX (Data Analysis Expressions)

DAX (Data Analysis Expressions) in Power BI is a formula language used to create custom measures, calculated columns, and tables for advanced data analysis, manipulation, and modeling. It enables dynamic filtering, complex calculations like time intelligence (e.g., Year-over-Year), and row-level security, enhancing raw data into actionable insights.

Let's look at some common DAX functions and how they help us understand our Kenya Crops data better.

SUM and AVERAGE

• SUM(): This adds up all the numbers in a chosen column. If you want to know the total revenue from all crops, you'd use SUM() on the 'Revenue (KES)' column.

◦ Example: Total Revenue = SUM('Kenya Crops'[Revenue (KES)])

• AVERAGE(): This calculates the average of all numbers in a chosen column. To find the average amount of crop harvested (yield), you'd use AVERAGE() on the 'Yield (Kg)' column.

◦ Example: Average Yield = AVERAGE('Kenya Crops'[Yield (Kg)])

SUMX and AVERAGEX

Sometimes, you need to do a calculation for each individual row before adding or averaging them up. This is where SUMX() and AVERAGEX() are incredibly powerful.

• SUMX(Table, Expression): This function goes through each row of a specified Table, performs a calculation (Expression) for that row, and then adds up all those individual results.

◦ Example: Total Calculated Profit = SUMX('Kenya Crops', 'Kenya Crops'[Revenue (KES)] - 'Kenya Crops'[Cost of Production (KES)])

• AVERAGEX(Table, Expression): Similar to SUMX(), this goes through each row, performs a calculation, and then finds the average of those results.

◦ Example: Average Profit per Acre = AVERAGEX('Kenya Crops', ('Kenya Crops'[Revenue (KES)] - 'Kenya Crops'[Cost of Production (KES)]) / 'Kenya Crops'[Planted Area (Acres)])

CALCULATE

CALCULATE() helps you focus your calculations on specific parts of your data.

◦ Example: What was the total revenue only from 'Potatoes'?

Total Revenue Potatoes = CALCULATE(SUM('Kenya Crops'[Revenue (KES)]), 'Kenya Crops'[Crop Type] = "Potatoes")

◦ Here, CALCULATE tells Power BI to only looking at rows where the 'Crop Type' is 'Potatoes', and then SUM the 'Revenue'.

Joining Text Together: Concatenation with &

Sometimes you want to combine text from different columns. The ampersand (&) symbol lets you do this easily.

• Example: To create a clear label like "Potatoes - Organic" by combining the 'Crop Type' and 'Crop Variety' columns:

Crop Identifier = 'Kenya Crops'[Crop Type] & " - " & 'Kenya Crops'[Crop Variety]

• This makes our data easier to read and understand at a glance.

3. Making Dashboards That Tell a Story

After cleaning and calculating, we use Power BI's visualizations to turn the numbers into engaging charts and graphs that anyone can understand. These visuals are what help people make smart decisions quickly.

Cards

In Power BI, a card is a type of visual specifically designed to display a single, important data point or a small set of related summary numbers. Their purpose is to provide an immediate, at-a-glance summary of performance, cutting through the complexity of larger charts and tables

Bar and Column Charts

Bar and column charts are fantastic for comparing different things. Column charts usually compare things over time or across different groups, while bar charts are great when you have long names for your categories.

Line Charts

Line charts are perfect for showing how something changes over a period, like days, months, or years. They connect the dots to reveal patterns and trends.

Conclusion

From a jumbled spreadsheet to clear, actionable insights—that's the magic a Power BI analyst performs. By carefully cleaning data, using the powerful DAX language to create smart calculations, and then building engaging visuals, analysts transform raw numbers into compelling stories. These stories help businesses understand what's happening, why it's happening, and what they should do next. Our Kenya Crops example shows how these technical skills aren't just about numbers; they're about making a real difference in the world, helping farmers and businesses thrive.

Understanding Schemas and Data Modelling in Power BI

Ng'ang'a Njongo — Tue, 03 Feb 2026 07:55:59 +0000

Introduction

The core of any business intelligence solution lies in its data model. The same applies to Power BI. A quality data model allows you to build solid and powerful solutions that don't break. The speed, reliability and power of a solution all stem from a great data model. Let's have a look at some concepts that Power BI data modelers interact with to build models optimized for performance and usability.

1. Dimension Tables

Dimension tables describe business entities—the things you model. Entities can include products, people, places, and concepts including time itself. The most consistent table you'll find in a star schema is a date dimension table. A dimension table contains a key column (or columns) that acts as a unique identifier, and other columns. Other columns support filtering and grouping your data.

2. Fact Tables

A fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables.

3. Relationships

These links between tables, allow Power BI to group, filter, and aggregate data correctly. Example: The Sales (fact) table connects to the Products (dimension) table using Product ID.

4. Star Schema

The star schema or star model is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables.

Above is an example of a star schema that has a Sales table (fact table) referencing other dimension tables such as Employee, Date, Product etc.

5. Snowflake Schema

A snowflake schema or snowflake model is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. See illustration below:

Key Benefits of a Good Data Model

Faster Performance – Organized data reduces Power BI’s workload, making reports load faster. Example: Removing duplicate data speeds up calculations.
Easier Reporting – A clear structure simplifies visualization and calculations. Example: Using fact and dimension tables makes chart creation intuitive.
Accurate Insights – Proper relationships prevent errors in calculations. Example: Incorrect joins can cause double counting of sales.
Handles Large Data Efficiently – Optimized models process millions of rows smoothly.

Conclusion

A well-structured Power BI data model ensures better performance, accurate insights, and efficient reporting. By following best practices like Star Schema, reducing complexity, and using proper relationships, you can unlock the full potential of Power BI and make data-driven decisions with confidence.

Introduction to Linux for Data Engineers, Including Practical Use of Vi and Nano with Examples

Ng'ang'a Njongo — Sun, 25 Jan 2026 13:17:21 +0000

Why Linux?

Shifting from Windows to Linux for the first time can be daunting. Having no graphical interface to maneuver around where you can click on icons and folders, instead there's a black screen waiting for you to key in commands. So why is it essential in a data engineer's day-to-day?

1. Servers run Linux
Nearly all public cloud workloads and majority of servers powering data systems run on Linux.

2. Data engineering tools
Hadoop/Spark/Kafka were built for Unix-like systems. These core data engineering tools are designed and optimized to run on Linux. Development, testing, and production deployment naturally happen there.

3. Performance & Stability
Linux servers can run for years without reboots, crucial for long-running data pipelines and streaming jobs.

Another is that Linux is free & open-source, which is critical for scalable, cost-effective data infrastructure. These are just some of the reasons why Linux is crucial for a data engineer. Let's see examples of some basic Linux commands which we can correlate to when were using Windows.

Basic Linux Commands

To demonstrate this, we're going to connect to a remote server from Git Bash. We do this by using SSH (Secure Shell) which allows us to access and use the server's resources and run commands. The syntax to connect to the server is:
ssh user@ip

See below on connecting to a remote server provisioned on DigitalOcean:

After successfully connecting to the remote server, let's run some commands.

whoami: This prints out the current user

df -h: Displays disk usage of all mount points

pwd: Prints the current working directory

ls: Lists all files and folders in your current directory.

NB: files are highlighted in white while folders are highlighted in blue. Zipped files in orange / red.

cd: Changes directory to the folder you specify

From the above illustration, we changed directory from /root to /root/eveningClass. eveningClass is a folder within /root and we confirmed it by printing the current working directory (pwd) then by listing (ls) all files and folders within /root/eveningClass

cat: Allows a user to read / display the contents of a file

sudo adduser username: This creates a new user account with the specified username

From above, we've created a new user account 'nganga', we can verify the user has been added by displaying (cat) the contents of the file in the directory: /etc/passwd

mkdir: Creates a new directory / folder. You also need to specify the name of directory you want to create

touch: Creates a new file

echo: Can be used to write to a file

scp: Stands for Secure Copy. Which allows us to copy files from your local machine to a remote server and vice versa. Let's begin with copying from the local machine:

1. scp from local machine to remote host

From above illustration, we have a file SecondFile.txt on the local machine and we've copied it to the remote host in the directory: /root/eveningClass/new_dir as seen below:

2. scp from remote host to local machine

We can copy a file from the remote host as shown below:

cp: This copies a file to a specified destination directory

The file NewFile.txt has been copied from /root/eveningClass/new_dir/ to /root/eveningClass/

mv: This moves a file from directory to another

In above example, we've moved the file NewFile.txt from /root/eveningClass/ to /root/eveningClass/new_dir/

Creating and Editing Files with Nano and Vi

Vi and Nano are text editors used in Linux/Unix terminal environments. Nano is simple and intuitive (like Notepad), while Vi/Vim is more powerful but has a learning curve.

1. Nano Editor

Opening/Creating Files

Type normally - just start typing
Move cursor - Arrow keys work as expected
Save file - Ctrl + O (Write Out), then press Enter
Exit - Ctrl + X
Search - Ctrl + W, type word, press Enter

2. Vi Editor

Vi has 3 main modes:

Normal mode (default) - for navigation and commands

Insert mode - for typing text

Visual mode - for selecting text

Opening/Creating Files

Essential Vi Commands

From Normal mode to Insert mode:

i - insert before cursor
a - append after cursor
o - open new line below
I - insert at beginning of line
A - append at end of line

Saving and Quitting (Normal mode):

:w - save (write)
:q - quit
:wq or ZZ - save and quit
:q! - quit without saving
:w filename - save as new file

Navigation (Normal mode):

h - left
j - down
k - up
l - right
0 - beginning of line
$ - end of line
gg - top of file
G - bottom of file
:5 - go to line 5

Editing (Normal mode):

x - delete character
dd - delete line
yy - copy line
p - paste below
P - paste above
u - undo
Ctrl+r - redo

Conclusion

In this article we have covered the following:

Explained why Linux is important for data engineers
Demonstrated basic Linux commands
Shown practical usage of Vi and Nano (e.g., creating and editing files)

Basics of Git and GitHub

Ng'ang'a Njongo — Sat, 17 Jan 2026 07:45:26 +0000

What is Git?

Git is a tool that tracks changes to files that are specified in your folder. Every change made in the project folder is tracked and is saved using commits. Each commit saves the state of your files at that moment and records who made the change and why. Each commit corresponds to a new version of the project folder.

GitHub

This is the web interface that safely stores all the versions that have been made and pushed from Git.

Importance of Version Control

From the Git definition earlier, one benefit is that it allows you to track the history of changes to a project folder (who, when, why). Others are listed below:

You can experiment freely and safely as you can always revert back to a previous version without the fear of breaking things
It provides clear and documented history of changes. You are not caught in the chaos of saving files as "final_project_v7.zip"
Allows members / developers to collaborate on a project simultaneously without overwriting each others work
Avoids a situation where you have a single point of failure as each developer has a copy of the entire project

1. Pushing Code to GitHub

Create an empty folder
We'll begin by creating an empty folder on your local machine using Git and navigate to it as shown below:

Initialize Git
In the current directory, we then need to initialize Git. This is done using the "git init" command: git init

Create a repository on GitHub
Before we push code to GitHub, we'll need to create a repository that will store our pushed code from Git. You can do this by navigating to GitHub and selecting "New Repository" as shown below:

Connecting Git to GitHub repository
Once the repository is created, we can then connect Git with the specific repository we created using the code below:
git remote add origin https://github.com/Nganga7/LuxDev_Assignments.git

Creating a new file
In the directory "C:\Nganga\LuxDevHQ\Assignment1\", we created an empty text file "Doc_one.txt" and added the line "First Doc" as shown below:

Writing to the new file

Track Git changes
Since we've made changes to our project folder (creating a new file & writing to the file), Git will track the changes. Initially there was nothing to be tracked. See below:

Before creating the file
We use git status to track changes made in our project folder

After creating the file

Staging and committing changes
After making the changes, we notice that there are untracked. We therefore need to add the files to the staging area and commit these changes before we push our code. We do this by using git add Doc_one.txt and commit (with comments) by git commit -m "First Commit" as shown below:

We can view what has been committed using git log and as shown below, we see the: Commit ID, author (who), date & time of commit and comments made during the commit:

Pushing to GitHub repository
All good to push our file with it's contents to GitHub. We do this using the command git push -u origin master

File available on GitHub
The file has now been pushed to GitHub and is now available to multiple collaborators who may need to pull it:

2. Pulling Code from GitHub

To pull a file from GitHub, we'll create a new file in our repository called "Doc_two.txt" and we'll write to it "Second Doc" as shown below:

Adding a file

Creating the file

We then need to commit these changes and add a comment to it. Similar to what we did when we pushed the "Doc_one.txt" to GitHub using the command git commit -m Doc_one.txt

Our GitHub repository now has both files "Doc_one.txt" and "Doc_two.txt" but our local just has the "Doc_one.txt" file

GitHub repository

Local Git repository

To finally pull the new file to our local we use the git pull command

As shown above, we have successfully pulled the "Doc_two.txt" file from GitHub to our local Git repository.

Conclusion

In this article, we were able to cover the following:

What Git is and why version control is important
How to push code to GitHub
How to pull code from GitHub
How to track changes using Git