Forem: jim kinyua

ETL vs ELT: Understanding the Two Pillars of Modern Data Engineering

jim kinyua — Thu, 16 Apr 2026 15:15:55 +0000

If you work with data in any capacity, you've almost certainly encountered the terms ETL and ELT. They sound nearly identical, differ by just one swapped letter, and yet the architectural implications of choosing one over the other can shape your entire data infrastructure. Getting this choice wrong doesn't just slow things down — it can quietly bloat your cloud bill, bottleneck your analysts, or lock your pipelines into patterns that made sense a decade ago but feel painful today.

This article breaks down both approaches from the ground up: what they are, how they differ, when to reach for each one, and which tools the industry actually uses to implement them. Whether you're a data engineer designing pipelines, a backend developer touching data for the first time, or an analyst trying to understand why your dashboard takes forty minutes to refresh, this is for you.

What is ETL?

ETL stands for Extract, Transform, Load. It's the elder statesman of data integration, born in an era when storage was expensive and compute lived on dedicated, on-premise servers. The core idea is straightforward: pull data from source systems, reshape it into the format your destination expects, and then load the cleaned result into your target data store.

Let's walk through each stage so you know exactly what's happening at every step.

The Three Stages

Extract is the first step. Your pipeline reaches into one or more source systems — a transactional database, a REST API, flat files sitting on an SFTP server, spreadsheets, and so on — and pulls raw data out. The goal here is simple retrieval. You're not filtering or fixing anything yet; you're just getting the data out of wherever it lives.

Transform is where the heavy lifting happens, and it's the stage that defines ETL's character. Before any data lands in your target warehouse or database, it passes through a transformation layer. Think of this as a "cleaning station" that sits between your data sources and your destination. This might be a dedicated server, an ETL tool's processing engine, or a script running on a scheduled virtual machine.

What kind of cleaning happens here? Things like: removing duplicate records, filling in or flagging missing values, joining data from multiple sources into a single table, applying business rules (like converting prices from one currency to another), and making sure data types are consistent (so a date column actually contains dates, not random text). The key point is that all of this happens in transit, on intermediate infrastructure, before the data reaches its final home.

Load is the final step. The now-clean, structured, validated data gets written into its destination — traditionally an on-premise data warehouse like Oracle or SQL Server. Because storage was expensive and warehouse compute was precious, you only wanted to land data that was already in its final, queryable shape. Loading messy data meant wasting resources that cost real money per terabyte.

Why ETL Became the Standard

ETL dominated for decades because it matched the constraints of its time. Storage cost a lot, so you filtered and compressed before loading. Warehouse compute was limited, so you offloaded transformation to cheaper middleware. Data governance teams wanted guarantees that nothing dirty or non-conforming ever entered the warehouse, and ETL gave them a gatekeeping layer to enforce that.

This approach works well when your data volumes are predictable, your schemas are stable, and you have a team capable of maintaining the transformation logic in a centralized tool. It's battle-tested, well-understood, and still runs inside thousands of organizations today.

What is ELT?

ELT stands for Extract, Load, Transform. Read that order carefully — the swap isn't cosmetic. Instead of transforming data on intermediate infrastructure before it reaches the warehouse, ELT loads raw data directly into the destination and transforms it there, using the warehouse's own compute power.

This approach emerged alongside a fundamental shift in how data infrastructure is priced and provisioned. When cloud data warehouses like BigQuery, Snowflake, and Redshift entered the picture, two things changed simultaneously: storage became absurdly cheap, and warehouse compute became elastic (meaning you could scale it up or down on demand). Suddenly, the core economic argument behind ETL — "clean it first because warehouse resources are scarce" — stopped holding up.

The Three Stages, Reordered

Extract works the same way it does in ETL. You connect to source systems — databases, APIs, SaaS platforms, event streams, file stores — and pull data out. Nothing unusual here.

Load comes next, and this is where ELT diverges sharply. Instead of routing data through a transformation layer, you dump the raw, unprocessed data straight into your target warehouse or data lake. Every column, every null, every weird encoding artifact — it all goes in. The destination becomes the staging area. You might land it into a raw schema or a dedicated landing zone, but the data arrives essentially untouched.

Transform happens last, and it happens inside the warehouse. This is the defining move. You write SQL queries, dbt models, stored procedures, or warehouse-native scripts to clean, reshape, join, and enrich the data after it's already been loaded. The warehouse's own compute engine — which can be scaled up or down based on demand — does the heavy lifting. Your transformation logic lives as version-controlled code (often SQL) rather than as configuration inside a proprietary ETL tool.

To put that in concrete terms: imagine your app sends raw JSON payment data into your warehouse. In ELT, that JSON lands as-is in a raw table. Then an analytics engineer writes a SQL file that reads from that raw table, extracts the fields they need, converts cents to dollars, filters out failed transactions, and writes the result into a clean table. That SQL file is saved in Git, reviewed by teammates, and runs on a schedule. The warehouse does all the computation. No separate server needed.

Why ELT Gained Momentum

Several forces came together to make ELT the preferred pattern for modern data teams.

Storage economics flipped. Storing a terabyte in Snowflake or BigQuery costs a fraction of what the same terabyte cost on an on-premise appliance. When storage is cheap, the argument for filtering data before loading it weakens dramatically. You can afford to keep everything, and "keep everything" turns out to be a powerful default — it means you never lose source data, and you can always go back and re-transform it when requirements change.

Warehouse compute became elastic. You're no longer sharing a fixed pool of processing power with every other query in the building. Cloud warehouses let you spin up dedicated compute clusters, run a transformation pipeline, and shut them down. You pay for what you use. This made it practical to run heavy transformations inside the warehouse rather than on separate middleware.

The tooling ecosystem matured. dbt (data build tool) arguably did more to accelerate ELT adoption than any single technology. It gave analysts and analytics engineers a framework for writing, testing, and documenting transformation logic in pure SQL, managed through version control like any other codebase. Before dbt, "transform inside the warehouse" often meant a mess of undocumented stored procedures. After dbt, it meant a structured, testable, reviewable project.

Schema flexibility became a feature, not a bug. Modern data teams frequently deal with semi-structured data — JSON payloads from APIs, nested event data from product analytics, log files with variable fields. ETL pipelines struggle with this because they need to flatten and conform the data before loading. ELT pipelines can load the raw JSON directly and parse it later using the warehouse's native JSON functions, which most modern platforms handle well.

The Raw Layer Advantage

One of ELT's underappreciated strengths is the raw layer itself. Because you're loading unprocessed data first, you always have a source of truth that reflects exactly what the source system sent you. When a transformation has a bug — and it will, eventually — you don't have to re-extract from the source. You re-run the transformation against the raw data already sitting in your warehouse. This makes debugging faster, recovery simpler, and iterating on business logic far less risky.

It also decouples extraction from transformation in a way that matters day-to-day. The team responsible for getting data into the warehouse doesn't need to know how that data will be used downstream. Analysts can build new models on top of existing raw data without filing a ticket to modify an extraction pipeline. This separation of concerns is one of the reasons ELT meshes well with the "analytics engineering" movement.

Key Differences Between ETL and ELT

Now that you understand how each approach works individually, let's place them side by side. Some of these differences are architectural, some are economic, and some are about team workflow — but they all matter when you're deciding which pattern fits your situation.

Where Transformation Happens

This is the foundational difference, and everything else flows from it. In ETL, data passes through a separate processing layer — a dedicated server or tool — before it ever touches the warehouse. The warehouse receives pre-cleaned, pre-shaped data.

In ELT, the warehouse is the processing layer. Raw data lands first, and the warehouse's own compute handles every transformation. This single architectural choice has cascading effects on cost, flexibility, team structure, and debugging.

Data Pipeline Complexity

ETL pipelines tend to be more complex at the infrastructure level. You're managing source connections, a transformation engine (with its own scaling, monitoring, and failure handling), and the final load into the warehouse. That's three moving parts that can each break independently.

ELT simplifies the pipeline's middle section. Extraction and loading become a single, relatively thin operation — just get the data in. Transformation then lives as code inside the warehouse, which is already a system your team monitors and maintains. You trade infrastructure complexity for SQL complexity, which most data teams consider a favorable exchange.

Speed of Ingestion

ETL is inherently slower at getting data into the warehouse because every record must pass through the transformation layer first. If your transformation logic is heavy — complex joins, lookups against reference tables, currency conversions — that bottleneck sits between your source and your warehouse. Nothing is queryable until the entire pipeline finishes.

ELT decouples ingestion from transformation. Raw data can land in the warehouse within minutes of being extracted, even if the transformation models don't run until later. This means analysts can at least see fresh data in the raw layer immediately, even before it's been cleaned. For time-sensitive use cases, that gap matters.

Handling Schema Changes

Source systems change. An API adds a new field. A column gets renamed. A vendor starts sending nested JSON where they used to send flat CSV. ETL pipelines are brittle in the face of these changes because the transformation layer has explicit expectations about what the incoming data looks like. A new field might break a mapping. A type change might crash a conversion step. You discover the problem when the pipeline fails at 3 AM.

ELT absorbs schema changes more gracefully. Since you're loading raw data without enforcing a strict schema upfront, a new field just appears in the raw table. It doesn't break anything — it sits there until someone writes a transformation that uses it. This makes ELT pipelines more resilient to the kind of upstream changes that are inevitable in any real data ecosystem.

Cost Profile

ETL costs are weighted toward the middle of the pipeline. You're paying for transformation infrastructure — servers, tool licenses, compute time — that sits between source and warehouse. The warehouse itself can be smaller because it only stores clean data.

ELT shifts costs into the warehouse. You're storing more data (raw plus transformed) and running more compute inside the warehouse for transformations. With cloud pricing, this is usually cheaper overall because storage is pennies per gigabyte and compute is elastic. But it does mean your warehouse bill is larger, and poorly optimized transformation queries can run up costs if no one is watching.

Data Governance and Compliance

ETL offers a natural gatekeeping layer. Because data is transformed before loading, you can enforce masking, redaction, and compliance rules before sensitive data ever enters the warehouse. If a regulation says certain fields must never be stored in their raw form, ETL makes that straightforward — you strip or hash those fields in the transformation step.

ELT requires more deliberate governance. Raw data hits the warehouse first, which means sensitive fields — personally identifiable information, financial data, health records — exist in the raw layer in their original form, at least temporarily. You need warehouse-level access controls, column-level security policies, and careful management of who can query the raw schema. It's entirely solvable, but it's a responsibility you have to plan for.

Team Workflow and Ownership

ETL pipelines often require specialized engineers who know a particular tool — Informatica, DataStage, Talend. The transformation logic lives inside that tool's proprietary environment, which can create knowledge silos. If the one person who understands the Informatica mappings leaves, the team inherits a black box.

ELT pushes transformation logic into SQL, which is the most widely known language in the data world. Analytics engineers, analysts, and data engineers can all read, write, and review transformation code. Combined with dbt's project structure and Git-based workflows, this democratizes pipeline ownership across the team rather than concentrating it in a specialist role.

Reprocessing and Iteration

When business logic changes — and it always does — ETL forces you to re-extract and re-transform from scratch. The warehouse doesn't have the raw data, so there's no shortcut.

ELT makes reprocessing simple. The raw data is already there. You update your transformation SQL, run it again, and the new logic applies to the full historical dataset. This makes ELT far more forgiving when requirements evolve, when you discover a bug in a calculation, or when a new team member spots an improvement.

Real-World Use Cases

Knowing the theoretical differences is useful, but the real question is always: which one do I actually pick for this project? The answer depends on your data volumes, your regulatory environment, your team's skill set, and what your downstream consumers need. Neither approach is universally superior. Here's where each one genuinely earns its place.

When ETL is the Right Call

Banking and Financial Services Compliance

A mid-size bank ingests transaction data from its core banking system, credit card processor, and wire transfer platform. Regulators require that personally identifiable information — Social Security numbers, account numbers, customer addresses — is masked or tokenized before it enters any analytical system. The bank cannot afford even a brief window where raw PII sits in a queryable warehouse table, because an auditor finding unmasked data in a staging schema is a compliance finding, not a technicality.

ETL fits this scenario naturally. The transformation layer sits between source systems and the warehouse, acting as an enforcement boundary. PII gets hashed or redacted in transit. What lands in the warehouse is already clean from a compliance perspective. Trying to achieve this with ELT is possible, but it requires airtight warehouse access controls, column-level security, and constant auditing of the raw layer — a level of operational overhead that many regulated institutions prefer to avoid entirely.

Legacy System Migration

A manufacturing company runs its ERP on a 15-year-old Oracle database with decades of accumulated data. Column names are cryptic (CUST_TYP_CD_3), data types are inconsistent, and some tables have undocumented business logic baked into triggers and views. The company is migrating to a modern analytics platform and needs to rationalize this data into a clean dimensional model.

ETL shines here because the transformation step is doing genuinely heavy structural work — not just renaming columns, but fundamentally reshaping data that was never designed for analytics. Dedicated ETL tools offer visual mapping interfaces that make it easier to trace how a cryptic source field maps to a clean target column.

Embedded Analytics with Tight SLAs

A SaaS company embeds analytics dashboards inside its product. Customers expect their dashboards to refresh every hour with accurate, fully joined data. The underlying data comes from multiple microservices, each with its own database.

ETL works well here because the transformation is stable, well-defined, and performance-critical. Running this transformation on dedicated infrastructure means the warehouse's compute is reserved for serving dashboard queries to end users, rather than competing with transformation workloads. Predictability matters more than flexibility in this case.

When ELT is the Right Call

Startup Building Its First Data Stack

A Series A startup has a product built on PostgreSQL, uses Stripe for billing, HubSpot for CRM, and Mixpanel for product analytics. The data team is two people. They need to build a pipeline that feeds dashboards, supports ad hoc analysis, and can evolve fast as the business model is still shifting.

ELT is the obvious choice. The team uses a tool like Fivetran to sync raw data from all four sources into Snowflake. No transformation happens during ingestion. Then the analytics engineer writes dbt models to clean, join, and shape the data into marts that power dashboards. When the product team adds a new event, it automatically appears in the raw layer. When the finance team asks for a new metric, the analyst writes a new SQL model. No pipeline reconfiguration, no infrastructure changes.

Large-Scale Event Data Processing

A media streaming platform generates billions of playback events per day. Each event is a JSON payload containing device information, content metadata, playback quality metrics, and user identifiers. The data engineering team needs to make this data available for recommendation model training, A/B test analysis, and executive reporting — three very different use cases with different transformation requirements.

ELT handles this well because the raw event data, loaded as-is into the warehouse, serves as a shared foundation. The ML team writes transformations that extract feature vectors. The experimentation team writes transformations that aggregate events by test variant. The reporting team writes transformations that summarize engagement metrics. Each team owns their own transformation layer, all reading from the same raw source.

Data Exploration and Rapid Prototyping

A retail company's analytics team suspects that weather patterns correlate with regional sales spikes, but they're not sure yet. They want to pull in weather API data alongside their point-of-sale data and experiment with different ways of joining the two.

ELT is ideal for this kind of exploratory work. Load the weather data raw, load the POS data raw, and let the analysts experiment with transformations in SQL. If the hypothesis doesn't pan out, you've wasted some warehouse compute but haven't invested in building a dedicated ETL pipeline. If it does pan out, you promote the experimental models into production. The cost of being wrong is low, and the speed of iteration is high.

Tools Used in Both Approaches

Knowing the difference between ETL and ELT is one thing. Knowing which tools actually implement each approach is what turns theory into something you can use on a Monday morning. Don't worry if you haven't heard of most of these — the goal is to understand what role each tool plays so that when you see these names in a job posting or a team discussion, you know where they fit.

ETL Tools

These tools follow the traditional pattern: they extract data from sources, transform it on their own infrastructure, and load the cleaned result into a destination.

Apache NiFi is a free, open-source tool originally built by the U.S. National Security Agency and later donated to the Apache Software Foundation. It gives you a visual, drag-and-drop interface for building data pipelines. You connect "processors" — small building blocks that each do one thing, like read from a database or rename columns — into a flow. Think of it like snapping together building blocks where each block is a data operation.

Informatica PowerCenter is one of the oldest and most established names in the ETL world. It's been around since the 1990s and is still running inside thousands of large enterprises — banks, insurance companies, government agencies. It's powerful but expensive, and comes with a steep learning curve. If you see Informatica on a job listing, it usually signals a large company with legacy infrastructure.

Talend Open Studio sits between the open-source world and enterprise tooling. Its free version provides a visual interface for designing ETL jobs, and under the hood it generates Java code that runs the pipeline. This means if you outgrow the visual interface, you can customize the generated code directly.

AWS Glue is Amazon's managed ETL service. "Managed" means you don't have to set up or maintain servers — Amazon handles the infrastructure. You write transformation logic in Python or Spark, and Glue runs it for you. It's tightly integrated with other AWS services, so it's a natural fit if you're already building on AWS.

Microsoft SSIS (SQL Server Integration Services) is Microsoft's ETL tool, bundled with SQL Server. If your company runs on Microsoft technology, SSIS is often the default choice. It's not the most modern option, but it's reliable and has a massive community of practitioners and tutorials.

ELT Tools

ELT tools split into two categories: tools that handle the Extract and Load part (getting raw data into the warehouse) and tools that handle the Transform part (reshaping data inside the warehouse). This split is important to understand — in the ELT world, these are usually separate tools doing separate jobs.

Extract and Load Tools

These tools specialize in one thing: connecting to data sources and replicating the data into your warehouse. They deliberately don't transform anything. Their job is to be reliable pipes.

Fivetran is probably the most well-known managed EL tool. You sign up, select a data source — say Stripe, or PostgreSQL, or Google Ads — enter your credentials, point it at your warehouse, and it starts syncing. Fivetran maintains hundreds of pre-built connectors and handles all the tricky parts: schema changes, API pagination, rate limiting, incremental updates. The trade-off is cost — Fivetran charges based on data volume, and at scale those bills grow.

Airbyte is the open-source alternative to Fivetran. It does the same thing — connects to sources, replicates data into your warehouse — but you can self-host it on your own servers without per-row fees. The downside is that self-hosting means you're responsible for keeping it running and debugging it. There's also a managed cloud version if you'd rather not deal with that.

Stitch, now owned by Talend, is another managed EL tool similar to Fivetran. It's simpler and often cheaper, making it appealing for smaller teams with straightforward data needs.

Transformation Tools

Once raw data is in the warehouse, these tools help you shape it into something useful.

dbt (data build tool) is the tool that arguably made ELT mainstream. In plain terms: it lets you write SQL files that define how raw data should be transformed, and then runs those SQL files against your warehouse in the right order. Each SQL file is called a "model." Models can reference other models, creating a dependency chain. dbt figures out the order automatically. It also lets you write tests — simple checks like "this column should never be null" — that run automatically and alert you when something breaks. dbt comes in two flavors: dbt Core (free, open-source, command-line) and dbt Cloud (paid, with a web interface and scheduling).

Warehouse-Native SQL is worth mentioning too. Every modern cloud warehouse — Snowflake, BigQuery, Redshift, Databricks — has powerful SQL engines that can handle complex transformations on their own. For smaller teams or simpler pipelines, writing scheduled SQL queries inside the warehouse can be enough. dbt adds structure on top of this, but the underlying capability is the warehouse itself.

Tools That Work in Both Worlds

Some tools don't fit neatly into one camp.

Apache Spark can be used for ETL (transforming before loading) or for ELT (transforming data already in a data lake). It's a distributed processing engine built for massive datasets, and it shows up in both architectures depending on how a team deploys it.

Apache Airflow is a workflow orchestrator — it doesn't extract, transform, or load anything itself, but it schedules and coordinates the tools that do. Think of it as the conductor of an orchestra. Whether your pipeline is ETL or ELT, you might use Airflow to run the steps in order, retry failures, and send alerts.

Wrapping Up

ETL and ELT aren't competing philosophies — they're different tools shaped by different constraints. ETL was built for a world where storage was expensive and governance required data to be clean before it landed anywhere. ELT was built for a world where storage is cheap, compute is elastic, and teams want to iterate on transformation logic without re-extracting data from scratch.

If you're working with strict compliance requirements, stable schemas, and dedicated data engineering teams, ETL still earns its keep. If you're dealing with fast-changing sources, semi-structured data, small teams, or a cloud-first warehouse, ELT will probably serve you better.

The best advice for a beginner: don't overthink the choice before you've built anything. Pick the approach that matches your current constraints, build a pipeline that works, and refactor when the pain points become obvious. The concepts transfer between both patterns, and understanding why each exists matters more than memorizing which acronym is "correct."

Start building. The data is waiting.

Understanding SQL Joins and Window Functions

jim kinyua — Mon, 02 Mar 2026 16:00:25 +0000

If you've ever stared at a SQL query and wondered what PARTITION BY actually does, this article is for you. I'm going to break down two of the most important SQL concepts Joins and Window Functions

PART 1: SQL JOINS

What is a Join?

A Join is how you combine data from two or more tables. Think of it like connecting two spreadsheets using a shared column like a customer ID that appears in both a Customers table and an Orders table.

The Tables We'll Use

Customers:

id	name
1	Alice
2	Bob
3	Carol

Orders:

id	customer_id	amount
1	1	500
2	1	300
3	2	200

Notice that Carol (id = 3) has no orders, and all orders belong to either Alice or Bob. Keep that in mind. it's going to matter a lot when we look at how different JOINs behave.

Understanding Left and Right Tables

Before we look at any specific JOIN type, there's one concept you need to lock in first. Every JOIN has a Left table and a Right table, and the difference matters.

The table written after FROM is the Left table
The table written after JOIN is the Right table

SELECT customers.name, orders.amount
FROM customers          -- LEFT table
LEFT JOIN orders        -- RIGHT table
  ON customers.id = orders.customer_id;

This isn't just terminology. SQL uses this position to decide which table's rows get priority when there's no matching data on the other side.

INNER JOIN

An INNER JOIN returns rows where there is a match in BOTH tables.

SELECT customers.name, orders.amount
FROM customers
INNER JOIN orders ON customers.id = orders.customer_id;

Result:

name	amount
Alice	500
Alice	300
Bob	200

Carol is gone. She has no orders, so she doesn't appear. INNER JOIN only shows you where both sides match.

LEFT JOIN

A LEFT JOIN returns ALL rows from the left table, and matches from the right. Where there's no match, you get NULL.

SELECT customers.name, orders.amount
FROM customers
LEFT JOIN orders ON customers.id = orders.customer_id;

Result:

name	amount
Alice	500
Alice	300
Bob	200
Carol	NULL

Carol is back. She has no orders, so her amount is NULL. Use LEFT JOIN when you don't want to lose rows from your main table.

RIGHT JOIN

The opposite of LEFT JOIN. All rows from the right table are kept, and matches come from the left. Less common, but useful in certain situations.

FULL OUTER JOIN

Returns all rows from both tables, with NULLs where there's no match on either side.

SELECT customers.name, orders.amount
FROM customers
FULL OUTER JOIN orders ON customers.id = orders.customer_id;

Use this when you want to see unmatched records from both sides.

A Simple Way to Remember Joins

INNER = Only the overlap between the two tables
LEFT = All of left + whatever matches on the right
RIGHT = All of right + whatever matches on the left
FULL = Everything from both, matched or not

Key Takeaways for Joins

Joins combine tables using a shared column
INNER JOIN is the most common — only matched rows
LEFT JOIN is your friend when you don't want to lose rows from your primary table
Always think: "What happens to rows with no match?" this is what determines which join to use

PART 2: WINDOW FUNCTIONS

What is a Window Function?

A window function performs a calculation across a set of rows related to the current row without collapsing them like GROUP BY does.

This is the key difference:

GROUP BY: Collapses multiple rows into one result per group
Window Function: Keeps all rows, and adds a calculated column to each one

Think of it like this. for each row, you "open a window" into the data, do a calculation across that window, and write the result back into that row without losing any rows.

The Syntax

function() OVER (PARTITION BY column ORDER BY column)

OVER is what makes it a window function. Without OVER, it's just a regular function.

The Table We'll Use

Sales:

employee	dept	sales
Alice	IT	500
Bob	IT	300
Carol	HR	400
Dave	HR	600

ROW_NUMBER() — Rank Rows Within a Group

SELECT employee, dept, sales,
  ROW_NUMBER() OVER (PARTITION BY dept ORDER BY sales DESC) AS rank
FROM sales;

Result:

employee	dept	sales	rank
Alice	IT	500	1
Bob	IT	300	2
Dave	HR	600	1
Carol	HR	400	2

*What happened here?
*
PARTITION BY dept splits the data into two groups: IT and HR.
ORDER BY sales DESC sorts each group from highest to lowest sales.
ROW_NUMBER() assigns 1, 2, 3... to each row within that group.

Both Alice and Dave get rank 1 — because the ranking restarted for each department. That's the power of PARTITION BY.

RANK() vs DENSE_RANK() vs ROW_NUMBER()

These three are easy to confuse. Here's the difference when two people tie:

SELECT employee, sales,
  ROW_NUMBER() OVER (ORDER BY sales DESC) AS row_num,
  RANK()       OVER (ORDER BY sales DESC) AS rnk,
  DENSE_RANK() OVER (ORDER BY sales DESC) AS dense_rnk
FROM sales;

employee	sales	row_num	rnk	dense_rnk
Dave	600	1	1	1
Alice	500	2	2	2
Carol	500	3	2	2
Bob	300	4	4	3

ROW_NUMBER: Always unique. No ties. Just counts 1, 2, 3, 4.
RANK: Ties get the same number, then skips. Alice and Carol both get 2, next is 4.
DENSE_RANK: Ties get the same number, but does NOT skip. Next after 2 is 3.

SUM() OVER()

SELECT employee, dept, sales,
  SUM(sales) OVER (PARTITION BY dept) AS dept_total
FROM sales;

Result:

employee	dept	sales	dept_total
Alice	IT	500	800
Bob	IT	300	800
Dave	HR	600	1000
Carol	HR	400	1000

Every row still exists. But now each row also shows the total for their department. With GROUP BY, you'd lose the individual rows. With SUM() OVER(), you keep them all.

LAG() and LEAD()

LAG() lets you look at the value from the previous row. LEAD() looks ahead to the next row. Perfect for comparing month-over-month changes.

SELECT employee, sales,
  LAG(sales)  OVER (ORDER BY sales) AS previous_sales,
  LEAD(sales) OVER (ORDER BY sales) AS next_sales
FROM sales;

Result:

employee	sales	previous_sales	next_sales
Bob	300	NULL	400
Carol	400	300	500
Alice	500	400	600
Dave	600	500	NULL

The first row has no previous row, so it's NULL. The last row has no next row, so that's NULL too.

ROWS BETWEEN

This is more advanced, but incredibly useful for moving averages. It is mostly used to control the Size of Your Window

SELECT employee, sales,
  AVG(sales) OVER (
    ORDER BY sales
    ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
  ) AS moving_avg
FROM sales;

This calculates the average of the current row and the one before it. Change the numbers to control how many rows back or forward your window looks.

Key Takeaways for Window Functions

Window functions do NOT collapse rows — this is the biggest difference from GROUP BY
OVER() is what makes it a window function
PARTITION BY is like GROUP BY but keeps all rows
ORDER BY inside OVER() controls the order within the win

How to Turn Messy Data, DAX Headaches, and Ugly Dashboards into Decisions Using Power BI

jim kinyua — Mon, 09 Feb 2026 15:20:17 +0000

Let’s be honest, no dataset has ever arrived on an analyst's desk completely clean. Not once. In the real world, data usually shows up in a chaotic state. Dates saved as text, revenue strings mixed with currency symbols, twelve different spellings for the same county, blanks that should be zeros, and zeros that are actually missing values.

Inevitably, this is followed by a stakeholder asking, "Can you build a dashboard by Friday?" The answer is usually "sure," but the gap between receiving that raw data and delivering a "wow" dashboard is where the actual work happens. It is a journey that moves from invisible data engineering to strategic business storytelling.

The Invisible Foundation: Power Query

The first reality of analytics is that building the dashboard is often the easiest part; dragging charts onto a canvas takes minutes. The real labor lies in ensuring those charts tell the truth, and that work begins in Power Query. This is the "cleaning room" where we fix data types (because revenue should never be text), standardize categories, remove duplicates, and handle null values properly. It is also where we create derived fields, such as "Age Group" or "Price Band," to make analysis easier later on.

If you skip this step, you will inevitably try to fix data quality issues using DAX measures. This is a mistake. DAX is a calculation engine, not a cleanup tool, and it will punish you with slow performance and overly complex formulas if your data isn't prepared correctly.

The Backbone: Data Modeling

Most beginners make the mistake of dumping all their data into one massive, flat table. While this might work for simple spreadsheets, Power BI’s engine is optimized for a Star Schema. This means separating your data into a Fact table (containing transactions, visits, or sales numbers) and Dimension tables (containing descriptive context like dates, counties, products, or departments).

When your relationships are modeled correctly in a star schema, filters flow logically, totals don't double-count, and performance improves significantly. A bad model forces you to write complex DAX to work around the structure; a good model allows for elegant, simple DAX.

The Logic: Context Over Formulas

Once the model is solid, we move to DAX. On the surface, it looks simple—a formula like Total Revenue = SUM(Sales[Revenue]) seems straightforward. However, the real power of DAX is context. That single measure will return different results based on slicers, filters, relationships, and the visual it is placed in.

For example, a measure like Revenue per Visit = DIVIDE([Total Revenue], [Total Visits]) does more than just report a number; it measures performance. Understanding "filter context" how the user's interaction with the report changes the calculation on the fly—is the moment DAX stops being frustrating and starts making sense.

The Art of Visualization: How to Choose the Right Chart

The visual layer is where your data meets the user’s eye. The wrong visual can obscure the truth, while the right one illuminates it. To choose the best visual, you must first identify the question you are trying to answer.

Here is a framework for selecting the right visual.

1. Comparison
If you want to compare values across categories (e.g., Sales by Department), use a Bar Chart. If you are comparing values over time (e.g., Sales by Month), use a Line Chart or Area Chart to show the trend

2. Correlation
If you need to see if two variables are related for instance, "Does higher patient visits always mean higher revenue?" you should use a Scatter Plot. With "Total Visits" on the X-axis and "Total Revenue" on the Y-axis, you can instantly see positive correlations, outliers, or underperforming counties.

3. Composition
If you need to show how parts make up a whole (e.g., Market Share), use a Donut Chart or Treemap. Use these sparingly; too many slices make them unreadable.

4. KPIs
If you just need to show a single, critical number (e.g., Total Year-to-Date Revenue), use a Card or a KPI Visual that shows the number alongside a trend indicator.

Dashboards are storytelling tools. Each visual should answer exactly one question. If a chart requires a paragraph of explanation, it is likely the wrong chart.

From Insight to Action

The ultimate goal of this entire process is not to build a dashboard that looks "clean," but to drive action. We transform messy data into business strategy by highlighting anomalies and trends that require intervention.

For example, seeing high visits but low revenue might indicate a pricing issue. Noticing high medication usage in a specific age group drives inventory planning. A declining trend over time signals operational risk. A good analyst reports the numbers; a great analyst explains what they mean and what to do next.

Final Thoughts

Power BI is not just a dashboard tool; it is a thinking tool. The workflow—moving from Messy Data to Clean Transformations, building a Strong Model, writing Smart DAX, designing Clear Visuals, and finally driving Business Action—is the true craft of analytics.

It isn’t about knowing every DAX function by heart. It’s about knowing when to clean, how to simplify logic, and how to explain insights to stakeholders. That is what makes a good analyst, and honestly, that is what makes analytics fun.

Schemas and Data Modelling in Power BI

jim kinyua — Mon, 02 Feb 2026 12:34:24 +0000

Everyone loves building flashy visuals and writing clever DAX measures in Power BI, and those things definitely matter. But the truth is, the single biggest factor in whether your report performs well, scales nicely, and gives trustworthy results is data modelling.

Get the model right, and everything else becomes easier: calculations run faster, measures are simpler to write, filters behave as expected, and totals add up correctly. Get it wrong, and you’ll spend endless hours fighting performance issues, chasing phantom duplicates, or explaining why the numbers don’t match what the business expects.

Data modelling is basically about answering three core questions:

Where do the numbers live? (sales amounts, quantities, revenue)
Where do the descriptive attributes live? (product names, customer details, dates etc)
How do all these tables connect?

Fact Tables vs. Dimension Tables

Fact tables
They hold the measurable events. These are the things that happened in the business. Think sales transactions, orders, inventory movements, clicks, or support tickets. They’re usually full of numeric values (facts) and foreign keys that reference other tables.

Dimension tables
They provide the "who, what, where, when, why" context. They contain descriptive fields and are typically much smaller than fact tables.

The Star Schema
In a star schema, you have the following

One central fact table
Several dimension tables connected directly to it (usually via single-direction relationships from dimension to fact)

Hence the name Star.
It's important to note that the star schema is the gold standard for Power BI (and most modern BI tools). It’s simple, fast, and plays perfectly to how Power BI’s engine (VertiPaq) works.

The Snowflake Schema
A snowflake schema is basically a normalized star schema. Instead of keeping everything in one dimension table, you split dimensions into multiple related tables (e.g., Product → Category → Subcategory → Department).

Comparison Between Star and Snowflake

Feature	Star Schema	Snowflake Schema
Complexity	Simple and intuitive	More complex
Query performance	Faster (fewer joins)	Slightly slower
DAX development	Much easier	More complicated

Bottom line: Use a star schema in Power BI unless you have a massive warehouse where normalization savings outweigh speed.

Why Good Data Modelling Actually Matters

A good model is not a “nice to have” but will affect every aspect of your reporting experience. Performance comes first. The column-store engine (VertiPaq) in Power BI is highly optimized for star schema models. This means there are fewer tables to join, and the relationships are simpler, so queries run much faster — especially as users begin to slice and dice the data. A flat table or a “snowflake” mess will slow things down considerably

Accuracy is a must. When relationships are defined correctly (with the right cardinality and cross-filtering), filters work as expected, totals aren’t double-counted, and grand totals match what’s expected at lower levels. Get this wrong, and you’ll get inconsistent or wrong numbers — the quickest way to undermine trust in your reports.

Simpler DAX formulas are a massive benefit. In a properly modeled star schema, most calculations can be simple aggregations (SUM, COUNT, AVERAGE) with CALCULATE filters as needed. No more complex workarounds, heavy use of iterators, or “treat the table like an Excel sheet.” Debugging is simpler too — usually, problems are visible in seconds, not hours.

Finally, maintainability and scalability. A model that’s clean and logical is easy for others (or for you, later) to understand. It scales well as you add more data or subject areas, and you can often reuse the same model in multiple reports or even share it in a Power BI dataset for the whole team.

In short: Take the time to model your data properly. The difference between a report that feels snappy and reliable and one that’s always fighting you is huge. The visuals and DAX formulas are great, but they’re only as good as the foundation they’re built on.

Git for Beginners: How to Push & Pull Code, Track Changes, and Understand Version Control

jim kinyua — Sun, 18 Jan 2026 07:38:53 +0000

Git Basics for Beginners: Push, Pull, and Track Code Changes

Git is a version control tool that helps developers track changes in their code and collaborate safely. Instead of creating multiple versions of the same file, Git records what changed, when it changed, and who made the change. This makes it easy to work in teams and recover from mistakes.

Git works in a project folder called a repository. You work locally on your computer, then sync your changes to a remote repository such as GitHub.

Creating a Git Repository

Start tracking a new project:
git init

Copy an existing project:
git clone https://github.com/username/project.git

Tracking Changes

Check the current state of your project:
git status

Stage a specific file:
git add filename.js

Stage all changes:
git add .

Save a checkpoint:
git commit -m "Add login validation"

Pushing and Pulling Code

Send your commits to the remote repository:
git push

First-time push:
git push -u origin main

Get the latest updates from the remote repository:
git pull

Always pull before starting work to avoid conflicts.

A Typical Daily Git Workflow

git pull

write code

git status

git add .

git commit -m "Fix payment validation bug"

git push

Viewing History and Changes

View commit history:
git log

See what changed in files:
git diff

Common Beginner Mistakes

Forgetting to pull before working, writing unclear commit messages, committing broken code, or being afraid to make mistakes. Git is designed to recover from errors safely.

Final Thoughts

Git is simply about tracking changes, saving checkpoints, and syncing work with others. Learn these core commands first: git status, git add, git commit, git push, and git pull. Everything else builds on them.