<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: jim kinyua</title>
    <description>The latest articles on Forem by jim kinyua (@jim_kinyua_3f7d191b865bed).</description>
    <link>https://forem.com/jim_kinyua_3f7d191b865bed</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3717512%2F58e17c42-7e6c-4be7-990b-ddf40e98f2fb.jpg</url>
      <title>Forem: jim kinyua</title>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jim_kinyua_3f7d191b865bed"/>
    <language>en</language>
    <item>
      <title>ETL vs ELT: Understanding the Two Pillars of Modern Data Engineering</title>
      <dc:creator>jim kinyua</dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:15:55 +0000</pubDate>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed/etl-vs-elt-understanding-the-two-pillars-of-data-engineering-27m4</link>
      <guid>https://forem.com/jim_kinyua_3f7d191b865bed/etl-vs-elt-understanding-the-two-pillars-of-data-engineering-27m4</guid>
      <description>&lt;p&gt;If you work with data in any capacity, you've almost certainly encountered the terms ETL and ELT. They sound nearly identical, differ by just one swapped letter, and yet the architectural implications of choosing one over the other can shape your entire data infrastructure. Getting this choice wrong doesn't just slow things down — it can quietly bloat your cloud bill, bottleneck your analysts, or lock your pipelines into patterns that made sense a decade ago but feel painful today.&lt;/p&gt;

&lt;p&gt;This article breaks down both approaches from the ground up: what they are, how they differ, when to reach for each one, and which tools the industry actually uses to implement them. Whether you're a data engineer designing pipelines, a backend developer touching data for the first time, or an analyst trying to understand why your dashboard takes forty minutes to refresh, this is for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ETL?
&lt;/h2&gt;

&lt;p&gt;ETL stands for &lt;strong&gt;Extract, Transform, Load&lt;/strong&gt;. It's the elder statesman of data integration, born in an era when storage was expensive and compute lived on dedicated, on-premise servers. The core idea is straightforward: pull data from source systems, reshape it into the format your destination expects, and then load the cleaned result into your target data store.&lt;/p&gt;

&lt;p&gt;Let's walk through each stage so you know exactly what's happening at every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Stages
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Extract&lt;/strong&gt; is the first step. Your pipeline reaches into one or more source systems — a transactional database, a REST API, flat files sitting on an SFTP server, spreadsheets, and so on — and pulls raw data out. The goal here is simple retrieval. You're not filtering or fixing anything yet; you're just getting the data out of wherever it lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transform&lt;/strong&gt; is where the heavy lifting happens, and it's the stage that &lt;strong&gt;defines ETL's character&lt;/strong&gt;. Before any data lands in your target warehouse or database, it passes through a transformation layer. Think of this as a "cleaning station" that sits between your data sources and your destination. This might be a dedicated server, an ETL tool's processing engine, or a script running on a scheduled virtual machine.&lt;/p&gt;

&lt;p&gt;What kind of cleaning happens here? Things like: removing duplicate records, filling in or flagging missing values, joining data from multiple sources into a single table, applying business rules (like converting prices from one currency to another), and making sure data types are consistent (so a date column actually contains dates, not random text). The key point is that all of this happens &lt;em&gt;in transit&lt;/em&gt;, on intermediate infrastructure, before the data reaches its final home.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load&lt;/strong&gt; is the final step. The now-clean, structured, validated data gets written into its destination — traditionally an on-premise data warehouse like Oracle or SQL Server. Because storage was expensive and warehouse compute was precious, you only wanted to land data that was already in its final, queryable shape. Loading messy data meant wasting resources that cost real money per terabyte.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ETL Became the Standard
&lt;/h2&gt;

&lt;p&gt;ETL dominated for decades because it matched the constraints of its time. Storage cost a lot, so you filtered and compressed before loading. Warehouse compute was limited, so you offloaded transformation to cheaper middleware. Data governance teams wanted guarantees that nothing dirty or non-conforming ever entered the warehouse, and ETL gave them a gatekeeping layer to enforce that.&lt;/p&gt;

&lt;p&gt;This approach works well when your data volumes are predictable, your schemas are stable, and you have a team capable of maintaining the transformation logic in a centralized tool. It's battle-tested, well-understood, and still runs inside thousands of organizations today.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ELT?
&lt;/h2&gt;

&lt;p&gt;ELT stands for &lt;strong&gt;Extract, Load, Transform&lt;/strong&gt;. Read that order carefully — the swap isn't cosmetic. Instead of transforming data on intermediate infrastructure before it reaches the warehouse, ELT loads raw data directly into the destination and transforms it &lt;em&gt;there&lt;/em&gt;, using the warehouse's own compute power.&lt;/p&gt;

&lt;p&gt;This approach emerged alongside a fundamental shift in how data infrastructure is priced and provisioned. When cloud data warehouses like BigQuery, Snowflake, and Redshift entered the picture, two things changed simultaneously: storage became absurdly cheap, and warehouse compute became elastic (meaning you could scale it up or down on demand). Suddenly, the core economic argument behind ETL — "clean it first because warehouse resources are scarce" — stopped holding up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Stages, Reordered
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Extract&lt;/strong&gt; works the same way it does in ETL. You connect to source systems — databases, APIs, SaaS platforms, event streams, file stores — and pull data out. Nothing unusual here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load&lt;/strong&gt; comes next, and this is where ELT diverges sharply. Instead of routing data through a transformation layer, you dump the raw, unprocessed data straight into your target warehouse or data lake. Every column, every null, every weird encoding artifact — it all goes in. The destination becomes the staging area. You might land it into a &lt;code&gt;raw&lt;/code&gt; schema or a dedicated landing zone, but the data arrives essentially untouched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transform&lt;/strong&gt; happens last, and it happens &lt;em&gt;inside&lt;/em&gt; the warehouse. This is the defining move. You write SQL queries, dbt models, stored procedures, or warehouse-native scripts to clean, reshape, join, and enrich the data after it's already been loaded. The warehouse's own compute engine — which can be scaled up or down based on demand — does the heavy lifting. Your transformation logic lives as version-controlled code (often SQL) rather than as configuration inside a proprietary ETL tool.&lt;/p&gt;

&lt;p&gt;To put that in concrete terms: imagine your app sends raw JSON payment data into your warehouse. In ELT, that JSON lands as-is in a raw table. Then an analytics engineer writes a SQL file that reads from that raw table, extracts the fields they need, converts cents to dollars, filters out failed transactions, and writes the result into a clean table. That SQL file is saved in Git, reviewed by teammates, and runs on a schedule. The warehouse does all the computation. No separate server needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ELT Gained Momentum
&lt;/h2&gt;

&lt;p&gt;Several forces came together to make ELT the preferred pattern for modern data teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage economics flipped.&lt;/strong&gt; Storing a terabyte in Snowflake or BigQuery costs a fraction of what the same terabyte cost on an on-premise appliance. When storage is cheap, the argument for filtering data before loading it weakens dramatically. You can afford to keep everything, and "keep everything" turns out to be a powerful default — it means you never lose source data, and you can always go back and re-transform it when requirements change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warehouse compute became elastic.&lt;/strong&gt; You're no longer sharing a fixed pool of processing power with every other query in the building. Cloud warehouses let you spin up dedicated compute clusters, run a transformation pipeline, and shut them down. You pay for what you use. This made it practical to run heavy transformations inside the warehouse rather than on separate middleware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tooling ecosystem matured.&lt;/strong&gt; dbt (data build tool) arguably did more to accelerate ELT adoption than any single technology. It gave analysts and analytics engineers a framework for writing, testing, and documenting transformation logic in pure SQL, managed through version control like any other codebase. Before dbt, "transform inside the warehouse" often meant a mess of undocumented stored procedures. After dbt, it meant a structured, testable, reviewable project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema flexibility became a feature, not a bug.&lt;/strong&gt; Modern data teams frequently deal with semi-structured data — JSON payloads from APIs, nested event data from product analytics, log files with variable fields. ETL pipelines struggle with this because they need to flatten and conform the data before loading. ELT pipelines can load the raw JSON directly and parse it later using the warehouse's native JSON functions, which most modern platforms handle well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Raw Layer Advantage
&lt;/h2&gt;

&lt;p&gt;One of ELT's underappreciated strengths is the raw layer itself. Because you're loading unprocessed data first, you always have a source of truth that reflects exactly what the source system sent you. When a transformation has a bug — and it will, eventually — you don't have to re-extract from the source. You re-run the transformation against the raw data already sitting in your warehouse. This makes debugging faster, recovery simpler, and iterating on business logic far less risky.&lt;/p&gt;

&lt;p&gt;It also decouples extraction from transformation in a way that matters day-to-day. The team responsible for getting data into the warehouse doesn't need to know how that data will be used downstream. Analysts can build new models on top of existing raw data without filing a ticket to modify an extraction pipeline. This separation of concerns is one of the reasons ELT meshes well with the "analytics engineering" movement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Differences Between ETL and ELT
&lt;/h2&gt;

&lt;p&gt;Now that you understand how each approach works individually, let's place them side by side. Some of these differences are architectural, some are economic, and some are about team workflow — but they all matter when you're deciding which pattern fits your situation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Transformation Happens
&lt;/h3&gt;

&lt;p&gt;This is the foundational difference, and everything else flows from it. In &lt;strong&gt;ETL&lt;/strong&gt;, data passes through a separate processing layer — a dedicated server or tool — before it ever touches the warehouse. The warehouse receives pre-cleaned, pre-shaped data.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;ELT&lt;/strong&gt;, the warehouse &lt;em&gt;is&lt;/em&gt; the processing layer. Raw data lands first, and the warehouse's own compute handles every transformation. This single architectural choice has cascading effects on cost, flexibility, team structure, and debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Pipeline Complexity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; pipelines tend to be more complex at the infrastructure level. You're managing source connections, a transformation engine (with its own scaling, monitoring, and failure handling), and the final load into the warehouse. That's three moving parts that can each break independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; simplifies the pipeline's middle section. Extraction and loading become a single, relatively thin operation — just get the data in. Transformation then lives as code inside the warehouse, which is already a system your team monitors and maintains. You trade infrastructure complexity for SQL complexity, which most data teams consider a favorable exchange.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speed of Ingestion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; is inherently slower at getting data into the warehouse because every record must pass through the transformation layer first. If your transformation logic is heavy — complex joins, lookups against reference tables, currency conversions — that bottleneck sits between your source and your warehouse. Nothing is queryable until the entire pipeline finishes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; decouples ingestion from transformation. Raw data can land in the warehouse within minutes of being extracted, even if the transformation models don't run until later. This means analysts can at least &lt;em&gt;see&lt;/em&gt; fresh data in the raw layer immediately, even before it's been cleaned. For time-sensitive use cases, that gap matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Schema Changes
&lt;/h3&gt;

&lt;p&gt;Source systems change. An API adds a new field. A column gets renamed. A vendor starts sending nested JSON where they used to send flat CSV. &lt;strong&gt;ETL&lt;/strong&gt; pipelines are brittle in the face of these changes because the transformation layer has explicit expectations about what the incoming data looks like. A new field might break a mapping. A type change might crash a conversion step. You discover the problem when the pipeline fails at 3 AM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; absorbs schema changes more gracefully. Since you're loading raw data without enforcing a strict schema upfront, a new field just appears in the raw table. It doesn't break anything — it sits there until someone writes a transformation that uses it. This makes ELT pipelines more resilient to the kind of upstream changes that are inevitable in any real data ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Profile
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; costs are weighted toward the middle of the pipeline. You're paying for transformation infrastructure — servers, tool licenses, compute time — that sits between source and warehouse. The warehouse itself can be smaller because it only stores clean data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; shifts costs into the warehouse. You're storing more data (raw plus transformed) and running more compute inside the warehouse for transformations. With cloud pricing, this is usually cheaper overall because storage is pennies per gigabyte and compute is elastic. But it does mean your warehouse bill is larger, and poorly optimized transformation queries can run up costs if no one is watching.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Governance and Compliance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; offers a natural gatekeeping layer. Because data is transformed before loading, you can enforce masking, redaction, and compliance rules before sensitive data ever enters the warehouse. If a regulation says certain fields must never be stored in their raw form, ETL makes that straightforward — you strip or hash those fields in the transformation step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; requires more deliberate governance. Raw data hits the warehouse first, which means sensitive fields — personally identifiable information, financial data, health records — exist in the raw layer in their original form, at least temporarily. You need warehouse-level access controls, column-level security policies, and careful management of who can query the raw schema. It's entirely solvable, but it's a responsibility you have to plan for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Team Workflow and Ownership
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; pipelines often require specialized engineers who know a particular tool — Informatica, DataStage, Talend. The transformation logic lives inside that tool's proprietary environment, which can create knowledge silos. If the one person who understands the Informatica mappings leaves, the team inherits a black box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; pushes transformation logic into SQL, which is the most widely known language in the data world. Analytics engineers, analysts, and data engineers can all read, write, and review transformation code. Combined with dbt's project structure and Git-based workflows, this democratizes pipeline ownership across the team rather than concentrating it in a specialist role.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reprocessing and Iteration
&lt;/h3&gt;

&lt;p&gt;When business logic changes — and it always does — &lt;strong&gt;ETL&lt;/strong&gt; forces you to re-extract and re-transform from scratch. The warehouse doesn't have the raw data, so there's no shortcut.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; makes reprocessing simple. The raw data is already there. You update your transformation SQL, run it again, and the new logic applies to the full historical dataset. This makes ELT far more forgiving when requirements evolve, when you discover a bug in a calculation, or when a new team member spots an improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;Knowing the theoretical differences is useful, but the real question is always: which one do I actually pick for &lt;em&gt;this&lt;/em&gt; project? The answer depends on your data volumes, your regulatory environment, your team's skill set, and what your downstream consumers need. Neither approach is universally superior. Here's where each one genuinely earns its place.&lt;/p&gt;

&lt;h3&gt;
  
  
  When ETL is the Right Call
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Banking and Financial Services Compliance
&lt;/h4&gt;

&lt;p&gt;A mid-size bank ingests transaction data from its core banking system, credit card processor, and wire transfer platform. Regulators require that personally identifiable information — Social Security numbers, account numbers, customer addresses — is masked or tokenized before it enters any analytical system. The bank cannot afford even a brief window where raw PII sits in a queryable warehouse table, because an auditor finding unmasked data in a staging schema is a compliance finding, not a technicality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; fits this scenario naturally. The transformation layer sits between source systems and the warehouse, acting as an enforcement boundary. PII gets hashed or redacted in transit. What lands in the warehouse is already clean from a compliance perspective. Trying to achieve this with ELT is possible, but it requires airtight warehouse access controls, column-level security, and constant auditing of the raw layer — a level of operational overhead that many regulated institutions prefer to avoid entirely.&lt;/p&gt;

&lt;h4&gt;
  
  
  Legacy System Migration
&lt;/h4&gt;

&lt;p&gt;A manufacturing company runs its ERP on a 15-year-old Oracle database with decades of accumulated data. Column names are cryptic (&lt;code&gt;CUST_TYP_CD_3&lt;/code&gt;), data types are inconsistent, and some tables have undocumented business logic baked into triggers and views. The company is migrating to a modern analytics platform and needs to rationalize this data into a clean dimensional model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; shines here because the transformation step is doing genuinely heavy structural work — not just renaming columns, but fundamentally reshaping data that was never designed for analytics. Dedicated ETL tools offer visual mapping interfaces that make it easier to trace how a cryptic source field maps to a clean target column.&lt;/p&gt;

&lt;h4&gt;
  
  
  Embedded Analytics with Tight SLAs
&lt;/h4&gt;

&lt;p&gt;A SaaS company embeds analytics dashboards inside its product. Customers expect their dashboards to refresh every hour with accurate, fully joined data. The underlying data comes from multiple microservices, each with its own database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; works well here because the transformation is stable, well-defined, and performance-critical. Running this transformation on dedicated infrastructure means the warehouse's compute is reserved for serving dashboard queries to end users, rather than competing with transformation workloads. Predictability matters more than flexibility in this case.&lt;/p&gt;

&lt;h3&gt;
  
  
  When ELT is the Right Call
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Startup Building Its First Data Stack
&lt;/h4&gt;

&lt;p&gt;A Series A startup has a product built on PostgreSQL, uses Stripe for billing, HubSpot for CRM, and Mixpanel for product analytics. The data team is two people. They need to build a pipeline that feeds dashboards, supports ad hoc analysis, and can evolve fast as the business model is still shifting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; is the obvious choice. The team uses a tool like Fivetran to sync raw data from all four sources into Snowflake. No transformation happens during ingestion. Then the analytics engineer writes dbt models to clean, join, and shape the data into marts that power dashboards. When the product team adds a new event, it automatically appears in the raw layer. When the finance team asks for a new metric, the analyst writes a new SQL model. No pipeline reconfiguration, no infrastructure changes.&lt;/p&gt;

&lt;h4&gt;
  
  
  Large-Scale Event Data Processing
&lt;/h4&gt;

&lt;p&gt;A media streaming platform generates billions of playback events per day. Each event is a JSON payload containing device information, content metadata, playback quality metrics, and user identifiers. The data engineering team needs to make this data available for recommendation model training, A/B test analysis, and executive reporting — three very different use cases with different transformation requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; handles this well because the raw event data, loaded as-is into the warehouse, serves as a shared foundation. The ML team writes transformations that extract feature vectors. The experimentation team writes transformations that aggregate events by test variant. The reporting team writes transformations that summarize engagement metrics. Each team owns their own transformation layer, all reading from the same raw source.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data Exploration and Rapid Prototyping
&lt;/h4&gt;

&lt;p&gt;A retail company's analytics team suspects that weather patterns correlate with regional sales spikes, but they're not sure yet. They want to pull in weather API data alongside their point-of-sale data and experiment with different ways of joining the two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; is ideal for this kind of exploratory work. Load the weather data raw, load the POS data raw, and let the analysts experiment with transformations in SQL. If the hypothesis doesn't pan out, you've wasted some warehouse compute but haven't invested in building a dedicated ETL pipeline. If it does pan out, you promote the experimental models into production. The cost of being wrong is low, and the speed of iteration is high.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools Used in Both Approaches
&lt;/h2&gt;

&lt;p&gt;Knowing the difference between ETL and ELT is one thing. Knowing which tools actually implement each approach is what turns theory into something you can use on a Monday morning. Don't worry if you haven't heard of most of these — the goal is to understand what role each tool plays so that when you see these names in a job posting or a team discussion, you know where they fit.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL Tools
&lt;/h3&gt;

&lt;p&gt;These tools follow the traditional pattern: they extract data from sources, transform it on their own infrastructure, and load the cleaned result into a destination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache NiFi&lt;/strong&gt; is a free, open-source tool originally built by the U.S. National Security Agency and later donated to the Apache Software Foundation. It gives you a visual, drag-and-drop interface for building data pipelines. You connect "processors" — small building blocks that each do one thing, like read from a database or rename columns — into a flow. Think of it like snapping together building blocks where each block is a data operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Informatica PowerCenter&lt;/strong&gt; is one of the oldest and most established names in the ETL world. It's been around since the 1990s and is still running inside thousands of large enterprises — banks, insurance companies, government agencies. It's powerful but expensive, and comes with a steep learning curve. If you see Informatica on a job listing, it usually signals a large company with legacy infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Talend Open Studio&lt;/strong&gt; sits between the open-source world and enterprise tooling. Its free version provides a visual interface for designing ETL jobs, and under the hood it generates Java code that runs the pipeline. This means if you outgrow the visual interface, you can customize the generated code directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt; is Amazon's managed ETL service. "Managed" means you don't have to set up or maintain servers — Amazon handles the infrastructure. You write transformation logic in Python or Spark, and Glue runs it for you. It's tightly integrated with other AWS services, so it's a natural fit if you're already building on AWS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft SSIS (SQL Server Integration Services)&lt;/strong&gt; is Microsoft's ETL tool, bundled with SQL Server. If your company runs on Microsoft technology, SSIS is often the default choice. It's not the most modern option, but it's reliable and has a massive community of practitioners and tutorials.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT Tools
&lt;/h3&gt;

&lt;p&gt;ELT tools split into two categories: tools that handle the &lt;strong&gt;Extract and Load&lt;/strong&gt; part (getting raw data into the warehouse) and tools that handle the &lt;strong&gt;Transform&lt;/strong&gt; part (reshaping data inside the warehouse). This split is important to understand — in the ELT world, these are usually separate tools doing separate jobs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Extract and Load Tools
&lt;/h4&gt;

&lt;p&gt;These tools specialize in one thing: connecting to data sources and replicating the data into your warehouse. They deliberately don't transform anything. Their job is to be reliable pipes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fivetran&lt;/strong&gt; is probably the most well-known managed EL tool. You sign up, select a data source — say Stripe, or PostgreSQL, or Google Ads — enter your credentials, point it at your warehouse, and it starts syncing. Fivetran maintains hundreds of pre-built connectors and handles all the tricky parts: schema changes, API pagination, rate limiting, incremental updates. The trade-off is cost — Fivetran charges based on data volume, and at scale those bills grow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Airbyte&lt;/strong&gt; is the open-source alternative to Fivetran. It does the same thing — connects to sources, replicates data into your warehouse — but you can self-host it on your own servers without per-row fees. The downside is that self-hosting means you're responsible for keeping it running and debugging it. There's also a managed cloud version if you'd rather not deal with that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stitch&lt;/strong&gt;, now owned by Talend, is another managed EL tool similar to Fivetran. It's simpler and often cheaper, making it appealing for smaller teams with straightforward data needs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Transformation Tools
&lt;/h4&gt;

&lt;p&gt;Once raw data is in the warehouse, these tools help you shape it into something useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt (data build tool)&lt;/strong&gt; is the tool that arguably made ELT mainstream. In plain terms: it lets you write SQL files that define how raw data should be transformed, and then runs those SQL files against your warehouse in the right order. Each SQL file is called a "model." Models can reference other models, creating a dependency chain. dbt figures out the order automatically. It also lets you write tests — simple checks like "this column should never be null" — that run automatically and alert you when something breaks. dbt comes in two flavors: dbt Core (free, open-source, command-line) and dbt Cloud (paid, with a web interface and scheduling).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warehouse-Native SQL&lt;/strong&gt; is worth mentioning too. Every modern cloud warehouse — Snowflake, BigQuery, Redshift, Databricks — has powerful SQL engines that can handle complex transformations on their own. For smaller teams or simpler pipelines, writing scheduled SQL queries inside the warehouse can be enough. dbt adds structure on top of this, but the underlying capability is the warehouse itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools That Work in Both Worlds
&lt;/h3&gt;

&lt;p&gt;Some tools don't fit neatly into one camp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt; can be used for ETL (transforming before loading) or for ELT (transforming data already in a data lake). It's a distributed processing engine built for massive datasets, and it shows up in both architectures depending on how a team deploys it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt; is a workflow orchestrator — it doesn't extract, transform, or load anything itself, but it schedules and coordinates the tools that do. Think of it as the conductor of an orchestra. Whether your pipeline is ETL or ELT, you might use Airflow to run the steps in order, retry failures, and send alerts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;ETL and ELT aren't competing philosophies — they're different tools shaped by different constraints. ETL was built for a world where storage was expensive and governance required data to be clean before it landed anywhere. ELT was built for a world where storage is cheap, compute is elastic, and teams want to iterate on transformation logic without re-extracting data from scratch.&lt;/p&gt;

&lt;p&gt;If you're working with strict compliance requirements, stable schemas, and dedicated data engineering teams, ETL still earns its keep. If you're dealing with fast-changing sources, semi-structured data, small teams, or a cloud-first warehouse, ELT will probably serve you better.&lt;/p&gt;

&lt;p&gt;The best advice for a beginner: don't overthink the choice before you've built anything. Pick the approach that matches your current constraints, build a pipeline that works, and refactor when the pain points become obvious. The concepts transfer between both patterns, and understanding &lt;em&gt;why&lt;/em&gt; each exists matters more than memorizing which acronym is "correct."&lt;/p&gt;

&lt;p&gt;Start building. The data is waiting.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>database</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Understanding SQL Joins and Window Functions</title>
      <dc:creator>jim kinyua</dc:creator>
      <pubDate>Mon, 02 Mar 2026 16:00:25 +0000</pubDate>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed/understanding-sql-joins-and-window-functions-30do</link>
      <guid>https://forem.com/jim_kinyua_3f7d191b865bed/understanding-sql-joins-and-window-functions-30do</guid>
      <description>&lt;p&gt;If you've ever stared at a SQL query and wondered what PARTITION BY actually does, this article is for you. I'm going to break down two of the most important SQL concepts &lt;strong&gt;Joins&lt;/strong&gt; and &lt;strong&gt;Window Functions&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  PART 1: SQL JOINS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a Join?
&lt;/h3&gt;

&lt;p&gt;A Join is how you combine data from two or more tables. Think of it like connecting two spreadsheets using a shared column  like a customer ID that appears in both a Customers table and an Orders table.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tables We'll Use
&lt;/h3&gt;

&lt;p&gt;Customers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Orders:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice that Carol (id = 3) has no orders, and all orders belong to either Alice or Bob. Keep that in mind. it's going to matter a lot when we look at how different JOINs behave.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding Left and Right Tables
&lt;/h3&gt;

&lt;p&gt;Before we look at any specific JOIN type, there's one concept you need to lock in first. Every JOIN has a Left table and a Right table, and the difference matters.&lt;/p&gt;

&lt;p&gt;The table written after FROM is the Left table&lt;br&gt;
The table written after JOIN is the Right table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;          &lt;span class="c1"&gt;-- LEFT table&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;        &lt;span class="c1"&gt;-- RIGHT table&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't just terminology. SQL uses this position to decide which table's rows get priority when there's no matching data on the other side.&lt;/p&gt;

&lt;h3&gt;
  
  
  INNER JOIN
&lt;/h3&gt;

&lt;p&gt;An INNER JOIN returns rows where there is a match in BOTH tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Carol is gone. She has no orders, so she doesn't appear. &lt;strong&gt;INNER JOIN&lt;/strong&gt; only shows you where both sides match.&lt;/p&gt;

&lt;h3&gt;
  
  
  LEFT JOIN
&lt;/h3&gt;

&lt;p&gt;A LEFT JOIN returns ALL rows from the left table, and matches from the right. Where there's no match, you get NULL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Carol is back. She has no orders, so her amount is NULL. Use LEFT JOIN when you don't want to lose rows from your main table.&lt;/p&gt;

&lt;h3&gt;
  
  
  RIGHT JOIN
&lt;/h3&gt;

&lt;p&gt;The opposite of LEFT JOIN. All rows from the right table are kept, and matches come from the left. Less common, but useful in certain situations.&lt;/p&gt;

&lt;h3&gt;
  
  
  FULL OUTER JOIN
&lt;/h3&gt;

&lt;p&gt;Returns all rows from both tables, with NULLs where there's no match on either side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;FULL&lt;/span&gt; &lt;span class="k"&gt;OUTER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use this when you want to see unmatched records from both sides.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Simple Way to Remember Joins
&lt;/h3&gt;

&lt;p&gt;INNER  = Only the overlap between the two tables&lt;br&gt;
LEFT   = All of left + whatever matches on the right&lt;br&gt;
RIGHT  = All of right + whatever matches on the left&lt;br&gt;
FULL   = Everything from both, matched or not&lt;/p&gt;


&lt;h3&gt;
  
  
  Key Takeaways for Joins
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Joins combine tables using a shared column&lt;/li&gt;
&lt;li&gt;INNER JOIN is the most common — only matched rows&lt;/li&gt;
&lt;li&gt;LEFT JOIN is your friend when you don't want to lose rows from your primary table&lt;/li&gt;
&lt;li&gt;Always think: "What happens to rows with no match?" this is what determines which join to use&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  PART 2: WINDOW FUNCTIONS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is a Window Function?
&lt;/h3&gt;

&lt;p&gt;A window function performs a calculation across a set of rows related to the current row  without collapsing them like GROUP BY does.&lt;/p&gt;

&lt;p&gt;This is the key difference:&lt;/p&gt;

&lt;p&gt;GROUP BY: Collapses multiple rows into one result per group&lt;br&gt;
Window Function: Keeps all rows, and adds a calculated column to each one&lt;/p&gt;

&lt;p&gt;Think of it like this. for each row, you "open a window" into the data, do a calculation across that window, and write the result back into that row without losing any rows.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Syntax
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;OVER is what makes it a window function. Without OVER, it's just a regular function.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Table We'll Use
&lt;/h3&gt;

&lt;p&gt;Sales:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;employee&lt;/th&gt;
&lt;th&gt;dept&lt;/th&gt;
&lt;th&gt;sales&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  ROW_NUMBER() — Rank Rows Within a Group
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;employee&lt;/th&gt;
&lt;th&gt;dept&lt;/th&gt;
&lt;th&gt;sales&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*&lt;em&gt;What happened here?&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
PARTITION BY dept splits the data into two groups: IT and HR.&lt;br&gt;
ORDER BY sales DESC sorts each group from highest to lowest sales.&lt;br&gt;
ROW_NUMBER() assigns 1, 2, 3... to each row within that group.&lt;/p&gt;

&lt;p&gt;Both Alice and Dave get rank 1 — because the ranking restarted for each department. That's the power of PARTITION BY.&lt;/p&gt;
&lt;h3&gt;
  
  
  RANK() vs DENSE_RANK() vs ROW_NUMBER()
&lt;/h3&gt;

&lt;p&gt;These three are easy to confuse. Here's the difference when two people tie:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dense_rnk&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;employee&lt;/th&gt;
&lt;th&gt;sales&lt;/th&gt;
&lt;th&gt;row_num&lt;/th&gt;
&lt;th&gt;rnk&lt;/th&gt;
&lt;th&gt;dense_rnk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ROW_NUMBER: Always unique. No ties. Just counts 1, 2, 3, 4.&lt;br&gt;
RANK: Ties get the same number, then skips. Alice and Carol both get 2, next is 4.&lt;br&gt;
DENSE_RANK: Ties get the same number, but does NOT skip. Next after 2 is 3.&lt;/p&gt;
&lt;h3&gt;
  
  
  SUM() OVER()
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dept_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;employee&lt;/th&gt;
&lt;th&gt;dept&lt;/th&gt;
&lt;th&gt;sales&lt;/th&gt;
&lt;th&gt;dept_total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;IT&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every row still exists. But now each row also shows the total for their department. With GROUP BY, you'd lose the individual rows. With SUM() OVER(), you keep them all.&lt;/p&gt;


&lt;h3&gt;
  
  
  LAG() and LEAD()
&lt;/h3&gt;

&lt;p&gt;LAG() lets you look at the value from the previous row. LEAD() looks ahead to the next row. Perfect for comparing month-over-month changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;previous_sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;next_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;employee&lt;/th&gt;
&lt;th&gt;sales&lt;/th&gt;
&lt;th&gt;previous_sales&lt;/th&gt;
&lt;th&gt;next_sales&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first row has no previous row, so it's NULL. The last row has no next row, so that's NULL too.&lt;/p&gt;

&lt;h3&gt;
  
  
  ROWS BETWEEN
&lt;/h3&gt;

&lt;p&gt;This is more advanced, but incredibly useful for moving averages. It is mostly used to control the Size of Your Window&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
    &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;moving_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This calculates the average of the current row and the one before it. Change the numbers to control how many rows back or forward your window looks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways for Window Functions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Window functions do NOT collapse rows — this is the biggest difference from GROUP BY&lt;/li&gt;
&lt;li&gt;OVER() is what makes it a window function&lt;/li&gt;
&lt;li&gt;PARTITION BY is like GROUP BY but keeps all rows&lt;/li&gt;
&lt;li&gt;ORDER BY inside OVER() controls the order within the win&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>beginners</category>
      <category>database</category>
      <category>sql</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Turn Messy Data, DAX Headaches, and Ugly Dashboards into Decisions Using Power BI</title>
      <dc:creator>jim kinyua</dc:creator>
      <pubDate>Mon, 09 Feb 2026 15:20:17 +0000</pubDate>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed/how-to-turn-messy-data-dax-headaches-and-ugly-dashboards-into-decisions-using-power-bi-1fgd</link>
      <guid>https://forem.com/jim_kinyua_3f7d191b865bed/how-to-turn-messy-data-dax-headaches-and-ugly-dashboards-into-decisions-using-power-bi-1fgd</guid>
      <description>&lt;p&gt;Let’s be honest, no dataset has ever arrived on an analyst's desk completely clean. Not once. In the real world, data usually shows up in a chaotic state. Dates saved as text, revenue strings mixed with currency symbols, twelve different spellings for the same county, blanks that should be zeros, and zeros that are actually missing values.&lt;/p&gt;

&lt;p&gt;Inevitably, this is followed by a stakeholder asking, "Can you build a dashboard by Friday?" The answer is usually "sure," but the gap between receiving that raw data and delivering a "wow" dashboard is where the actual work happens. It is a journey that moves from invisible data engineering to strategic business storytelling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Invisible Foundation: Power Query&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first reality of analytics is that building the dashboard is often the easiest part; dragging charts onto a canvas takes minutes. The real labor lies in ensuring those charts tell the truth, and that work begins in Power Query. This is the "cleaning room" where we fix data types (because revenue should never be text), standardize categories, remove duplicates, and handle null values properly. It is also where we create derived fields, such as "Age Group" or "Price Band," to make analysis easier later on.&lt;/p&gt;

&lt;p&gt;If you skip this step, you will inevitably try to fix data quality issues using DAX measures. This is a mistake. DAX is a calculation engine, not a cleanup tool, and it will punish you with slow performance and overly complex formulas if your data isn't prepared correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Backbone: Data Modeling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most beginners make the mistake of dumping all their data into one massive, flat table. While this might work for simple spreadsheets, Power BI’s engine is optimized for a Star Schema. This means separating your data into a Fact table (containing transactions, visits, or sales numbers) and Dimension tables (containing descriptive context like dates, counties, products, or departments).&lt;/p&gt;

&lt;p&gt;When your relationships are modeled correctly in a star schema, filters flow logically, totals don't double-count, and performance improves significantly. A bad model forces you to write complex DAX to work around the structure; a good model allows for elegant, simple DAX.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Logic: Context Over Formulas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the model is solid, we move to DAX. On the surface, it looks simple—a formula like &lt;code&gt;Total Revenue = SUM(Sales[Revenue])&lt;/code&gt; seems straightforward. However, the real power of DAX is context. That single measure will return different results based on slicers, filters, relationships, and the visual it is placed in.&lt;/p&gt;

&lt;p&gt;For example, a measure like &lt;code&gt;Revenue per Visit = DIVIDE([Total Revenue], [Total Visits])&lt;/code&gt; does more than just report a number; it measures performance. Understanding "filter context" how the user's interaction with the report changes the calculation on the fly—is the moment DAX stops being frustrating and starts making sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Art of Visualization: How to Choose the Right Chart&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The visual layer is where your data meets the user’s eye. The wrong visual can obscure the truth, while the right one illuminates it. To choose the best visual, you must first identify the question you are trying to answer.&lt;/p&gt;

&lt;p&gt;Here is a framework for selecting the right visual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Comparison&lt;/strong&gt;&lt;br&gt;
If you want to compare values across categories (e.g., Sales by Department), use a Bar Chart. If you are comparing values over time (e.g., Sales by Month), use a Line Chart or Area Chart to show the trend&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Correlation&lt;/strong&gt;&lt;br&gt;
If you need to see if two variables are related for instance, "Does higher patient visits always mean higher revenue?" you should use a Scatter Plot. With "Total Visits" on the X-axis and "Total Revenue" on the Y-axis, you can instantly see positive correlations, outliers, or underperforming counties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Composition&lt;/strong&gt;&lt;br&gt;
If you need to show how parts make up a whole (e.g., Market Share), use a &lt;strong&gt;Donut Chart&lt;/strong&gt; or &lt;strong&gt;Treemap&lt;/strong&gt;. Use these sparingly; too many slices make them unreadable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. KPIs&lt;/strong&gt;&lt;br&gt;
If you just need to show a single, critical number (e.g., Total Year-to-Date Revenue), use a Card or a KPI Visual that shows the number alongside a trend indicator.&lt;/p&gt;

&lt;p&gt;Dashboards are storytelling tools. Each visual should answer exactly one question. If a chart requires a paragraph of explanation, it is likely the wrong chart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From Insight to Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ultimate goal of this entire process is not to build a dashboard that looks "clean," but to drive action. We transform messy data into business strategy by highlighting anomalies and trends that require intervention.&lt;/p&gt;

&lt;p&gt;For example, seeing high visits but low revenue might indicate a pricing issue. Noticing high medication usage in a specific age group drives inventory planning. A declining trend over time signals operational risk. A good analyst reports the numbers; a great analyst explains what they mean and what to do next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Power BI is not just a dashboard tool; it is a thinking tool. The workflow—moving from Messy Data to Clean Transformations, building a Strong Model, writing Smart DAX, designing Clear Visuals, and finally driving Business Action—is the true craft of analytics.&lt;/p&gt;

&lt;p&gt;It isn’t about knowing every DAX function by heart. It’s about knowing when to clean, how to simplify logic, and how to explain insights to stakeholders. That is what makes a good analyst, and honestly, that is what makes analytics fun.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>microsoft</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Schemas and Data Modelling in Power BI</title>
      <dc:creator>jim kinyua</dc:creator>
      <pubDate>Mon, 02 Feb 2026 12:34:24 +0000</pubDate>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed/schemas-and-data-modelling-in-power-bi-34k5</link>
      <guid>https://forem.com/jim_kinyua_3f7d191b865bed/schemas-and-data-modelling-in-power-bi-34k5</guid>
      <description>&lt;p&gt;Everyone loves building flashy visuals and writing clever DAX measures in Power BI, and those things definitely matter. But the truth is, the single biggest factor in whether your report performs well, scales nicely, and gives trustworthy results is data modelling.&lt;/p&gt;

&lt;p&gt;Get the model right, and everything else becomes easier: calculations run faster, measures are simpler to write, filters behave as expected, and totals add up correctly. Get it wrong, and you’ll spend endless hours fighting performance issues, chasing phantom duplicates, or explaining why the numbers don’t match what the business expects.&lt;/p&gt;

&lt;p&gt;Data modelling is basically about answering three core questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where do the numbers live? (sales amounts, quantities, revenue)&lt;/li&gt;
&lt;li&gt;Where do the descriptive attributes live? (product names, customer details, dates etc)&lt;/li&gt;
&lt;li&gt;How do all these tables connect?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Fact Tables vs. Dimension Tables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt;&lt;br&gt;
They hold the measurable events. These are the things that happened in the business. Think sales transactions, orders, inventory movements, clicks, or support tickets. They’re usually full of numeric values (facts) and foreign keys that reference other tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt;&lt;br&gt;
They provide the "who, what, where, when, why" context. They contain descriptive fields and are typically much smaller than fact tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Star Schema&lt;/strong&gt;&lt;br&gt;
In a star schema, you have the following&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;One central fact table
&lt;/li&gt;
&lt;li&gt;Several dimension tables connected directly to it (usually via single-direction relationships from dimension to fact)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hence the name Star.&lt;br&gt;
It's important to note that the star schema is the gold standard for Power BI (and most modern BI tools). It’s simple, fast, and plays perfectly to how Power BI’s engine (VertiPaq) works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Snowflake Schema&lt;/strong&gt;&lt;br&gt;
A snowflake schema is basically a normalized star schema. Instead of keeping everything in one dimension table, you split dimensions into multiple related tables (e.g., Product → Category → Subcategory → Department).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparison Between Star and  Snowflake&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Star Schema&lt;/th&gt;
&lt;th&gt;Snowflake Schema&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Simple and intuitive&lt;/td&gt;
&lt;td&gt;More complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query performance&lt;/td&gt;
&lt;td&gt;Faster (fewer joins)&lt;/td&gt;
&lt;td&gt;Slightly slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DAX development&lt;/td&gt;
&lt;td&gt;Much easier&lt;/td&gt;
&lt;td&gt;More complicated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bottom line: Use a star schema in Power BI unless you have a massive warehouse where normalization savings outweigh speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Good Data Modelling Actually Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A good model is not a “nice to have” but will affect every aspect of your reporting experience. Performance comes first. The column-store engine (VertiPaq) in Power BI is highly optimized for star schema models. This means there are fewer tables to join, and the relationships are simpler, so queries run much faster — especially as users begin to slice and dice the data. A flat table or a “snowflake” mess will slow things down considerably&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accuracy is a must&lt;/strong&gt;. When relationships are defined correctly (with the right cardinality and cross-filtering), filters work as expected, totals aren’t double-counted, and grand totals match what’s expected at lower levels. Get this wrong, and you’ll get inconsistent or wrong numbers — the quickest way to undermine trust in your reports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler DAX formulas are a massive benefit&lt;/strong&gt;. In a properly modeled star schema, most calculations can be simple aggregations (SUM, COUNT, AVERAGE) with CALCULATE filters as needed. No more complex workarounds, heavy use of iterators, or “treat the table like an Excel sheet.” Debugging is simpler too — usually, problems are visible in seconds, not hours.&lt;/p&gt;

&lt;p&gt;Finally, &lt;strong&gt;maintainability and scalability&lt;/strong&gt;. A model that’s clean and logical is easy for others (or for you, later) to understand. It scales well as you add more data or subject areas, and you can often reuse the same model in multiple reports or even share it in a Power BI dataset for the whole team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In short&lt;/strong&gt;: Take the time to model your data properly. The difference between a report that feels snappy and reliable and one that’s always fighting you is huge. The visuals and DAX formulas are great, but they’re only as good as the foundation they’re built on.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>architecture</category>
      <category>data</category>
      <category>microsoft</category>
    </item>
    <item>
      <title>Git for Beginners: How to Push &amp; Pull Code, Track Changes, and Understand Version Control</title>
      <dc:creator>jim kinyua</dc:creator>
      <pubDate>Sun, 18 Jan 2026 07:38:53 +0000</pubDate>
      <link>https://forem.com/jim_kinyua_3f7d191b865bed/git-for-beginners-how-to-push-pull-code-track-changes-and-understand-version-control-4443</link>
      <guid>https://forem.com/jim_kinyua_3f7d191b865bed/git-for-beginners-how-to-push-pull-code-track-changes-and-understand-version-control-4443</guid>
      <description>&lt;h1&gt;
  
  
  Git Basics for Beginners: Push, Pull, and Track Code Changes
&lt;/h1&gt;

&lt;p&gt;Git is a version control tool that helps developers track changes in their code and collaborate safely. Instead of creating multiple versions of the same file, Git records what changed, when it changed, and who made the change. This makes it easy to work in teams and recover from mistakes.&lt;/p&gt;

&lt;p&gt;Git works in a project folder called a &lt;strong&gt;repository&lt;/strong&gt;. You work locally on your computer, then sync your changes to a remote repository such as GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Git Repository
&lt;/h2&gt;

&lt;p&gt;Start tracking a new project:&lt;br&gt;
git init&lt;/p&gt;

&lt;p&gt;Copy an existing project:&lt;br&gt;
git clone &lt;a href="https://github.com/username/project.git" rel="noopener noreferrer"&gt;https://github.com/username/project.git&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracking Changes
&lt;/h2&gt;

&lt;p&gt;Check the current state of your project:&lt;br&gt;
git status&lt;/p&gt;

&lt;p&gt;Stage a specific file:&lt;br&gt;
git add filename.js&lt;/p&gt;

&lt;p&gt;Stage all changes:&lt;br&gt;
git add .&lt;/p&gt;

&lt;p&gt;Save a checkpoint:&lt;br&gt;
git commit -m "Add login validation"&lt;/p&gt;

&lt;h2&gt;
  
  
  Pushing and Pulling Code
&lt;/h2&gt;

&lt;p&gt;Send your commits to the remote repository:&lt;br&gt;
git push&lt;/p&gt;

&lt;p&gt;First-time push:&lt;br&gt;
git push -u origin main&lt;/p&gt;

&lt;p&gt;Get the latest updates from the remote repository:&lt;br&gt;
git pull&lt;/p&gt;

&lt;p&gt;Always pull before starting work to avoid conflicts.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Typical Daily Git Workflow
&lt;/h2&gt;

&lt;p&gt;git pull  &lt;/p&gt;

&lt;h1&gt;
  
  
  write code
&lt;/h1&gt;

&lt;p&gt;git status&lt;br&gt;&lt;br&gt;
git add .&lt;br&gt;&lt;br&gt;
git commit -m "Fix payment validation bug"&lt;br&gt;&lt;br&gt;
git push&lt;/p&gt;

&lt;h2&gt;
  
  
  Viewing History and Changes
&lt;/h2&gt;

&lt;p&gt;View commit history:&lt;br&gt;
git log&lt;/p&gt;

&lt;p&gt;See what changed in files:&lt;br&gt;
git diff&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Beginner Mistakes
&lt;/h2&gt;

&lt;p&gt;Forgetting to pull before working, writing unclear commit messages, committing broken code, or being afraid to make mistakes. Git is designed to recover from errors safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Git is simply about tracking changes, saving checkpoints, and syncing work with others. Learn these core commands first: git status, git add, git commit, git push, and git pull. Everything else builds on them.&lt;/p&gt;

</description>
      <category>webdev</category>
    </item>
  </channel>
</rss>
