Forem: Lawrence Murithi

A Beginner’s Guide to Apache Kafka: The Engine of Real-Time Data

Lawrence Murithi — Thu, 14 May 2026 07:34:43 +0000

Introduction

Imagine you are running a massive online store. Every second, hundreds of users are clicking items, adding them to carts and making purchases. Your inventory system needs to know about the purchases, your recommendation engine needs to know about the clicks and your security system needs to monitor for fraud.
If you connect every single system directly to each other, you get a tangled, unmanageable mess.
This is the exact problem Apache Kafka was built to solve. Instead of systems talking directly to each other, they all send their data to a central hub (Kafka) and any system that needs that data simply reads it from the hub. This creates a completely decoupled architecture; the system sending the data doesn't need to know anything about the systems receiving it.
This article delves into everything you need to know to understand Apache Kafka, from its history to running your first commands.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform.
Let's break that down.
• Event - This is a record of something that happened (e.g. User A clicked button B at 12:00 PM). These events, also known as messages or records, are the fundamental immutable data structures consisting of a key, value, timestamp and headers that are continuously transmitted.
• Streaming - The data flows continuously in real-time, rather than waiting to be processed in daily batches. Kafka allows you to publish (write) and subscribe to (read) streams of events, store them indefinitely, and process them as they occur.
• Distributed - It doesn't just run on one computer. It runs across many computers working together, making it incredibly fast and virtually impossible to crash.
Think of Kafka as a massive, high-speed, highly organized post office. Senders drop off packages (data) and the post office holds onto them until the receivers come to pick them up. This complete journey of data, from generation and publishing to storage, consumption, and eventual deletion, represents the Kafka lifecycle.

Kafka was originally created at LinkedIn in 2011 by software engineers Jay Kreps, Neha Narkhede and Jun Rao.
LinkedIn was generating billions of data points daily (profile views, messages, connections), and their existing databases and message queues couldn't keep up. They needed a system that could handle these massive amounts of data in real-time without slowing down.
It was named Kafka, after the author Franz Kafka, because he was a writer and the software was an optimized system for writing data. Eventually, LinkedIn gave it out to the Apache Software Foundation, making it free and open-source.

Core Characteristics of Kafka

Thousands of companies such as Netflix, Uber and Airbnb use Kafka for various reasons.
1. High Throughput - Kafka can handle millions of messages per second.
2. Scalable - Kafka expands seamlessly without any downtime. If you need more power, you just add another computer (node) to the Kafka system.
3. Permanent (Durable) - Unlike traditional message queues that delete a message once it is read, Kafka writes data to a hard drive and keeps it for a set amount of time (days, weeks, or forever).
4. Fault-Tolerant - Kafka keeps copies (replicas) of your data on different computers. If one computer crashes or catches fire, the data is still safe on another, and the system automatically switches to the backup without missing a beat.

When to Use Kafka

• Real-time tracking - Tracking website activity (page views, clicks) as it happens.
• Log aggregation - Collecting logs from hundreds of different servers into one central place for monitoring, debugging, or auditing.
• Location tracking - Apps like Uber use Kafka to process the real-time GPS locations of drivers and riders.
• Stream processing - Transforming data on the fly such as using the Kafka Streams API to convert currencies in real-time as transactions happen.
• Event sourcing - Storing state-changing events. Instead of overwriting existing data to save the current state, you permanently record every individual action that led to that state as an append-only log(a database doesn't update a shopping cart's final inventory to show 1 Hat, it records the exact history, Added Shirt, Added Hat and Removed Shirt).
• Data Integration - Using Kafka Connect to continuously pull data from an old database and push it into a new cloud warehouse. In this context, kafka connect is utilized as a specialized tool and framework for scalably and reliably streaming data between Kafka and these external systems without custom code.

When NOT to use Kafka

• You just need a standard database to search for specific records (use SQL or NoSQL). Kafka is designed for sequential reading, not searching for a specific item.
• You only have a small amount of data (Kafka is complex; setting it up for low traffic is overkill).
• You need simple task routing.

Key Kafka Concepts and Rules

To understand Kafka, you need to know its vocabulary and the strict rules that govern how data is managed.
• Event (Message/record) - This is the actual piece of data( immutable record of something that happened). An event consists of a Key, a Value, a Timestamp, and optional metadata headers. The Message Key acts as an optional identifier used for routing the event to a specific partition, while the Message Value is the core payload containing the business data. The Value is the actual data. The Key is optional but important for organizing data. Before an event is sent over the network, it is translated into a binary format (bytes) in a process called serialization.
Serialization is the crucial step of converting these readable data objects into binary bytes for efficient network transmission and storage. Conversely, Deserialization is the reverse process used by consumers to convert those binary bytes back into readable data.
• Producer - The application that sends data into Kafka (e.g. the website frontend sending click data). Producers send/write (publish) messages to topics and decide which partition the data should be sent to.
• Consumer - The application that reads data from Kafka (e.g. the analytics dashboard). Consumers receive/read (subscribe) messages from topics.
• Topic - Its a named stream of events or logical category or channel where related events are continuously published and stored.
If you send user clicks to Kafka, you would send them to a topic named user_clicks. Consumers read from specific topics. Unlike traditional queues, topics have a retention policy. You configure a topic to keep data for 24 hours, 7 days, or until the disk reaches a certain size. Once the limit is hit, the oldest data is automatically deleted.
• Partition - This is the secret to Kafka's speed. A single topic is split into multiple parts called Partitions. They split data across multiple brokers/servers, enabling multiple consumers to read data in parallel. This is the mechanism that allows Kafka to scale and process massive amounts of data concurrently.
Imagine a grocery store with only one checkout lane (one topic), a line forms. If you open 10 checkout lanes (partitions), 10 times as many people can check out at once.
- Rule 1 - Order is only guaranteed within a single partition. If you send messages to Partition 0 and Partition 1, you cannot guarantee which one gets read first. But messages inside each individual partition are read in the exact order they arrive.
- Rule 2 - Keys determine the partition. If a Producer sends an event without a key, Kafka assigns it to a random partition (Round-robin routing). If an event has a key (like customer_id_123), Kafka uses a math formula (hashing) to ensure every event with that same key always goes to the exact same partition. This guarantees all purchases by customer_123 are processed in the correct order.
• Offset - Inside a partition, every single message is assigned a unique, sequential ID number called an Offset (e.g., 0, 1, 2, 3...). These offsets act as unique, ever-increasing integers used to accurately maintain reading positions. Offsets only go up and are never reused. Consumers use offsets to keep a bookmark of their specific reading position so they can resume reading safely after a crash or restart.
• Consumer Groups - A team of consumers working together to read a topic.
- The Golden Rule - A single partition can only be read by one consumer within the same group. If a topic has 4 partitions, and your group has 4 consumers, each gets exactly one partition. If you have 5 consumers in the group, the 5th one sits idle. This is how Kafka scales reading perfectly without processing the same message twice.
• Broker/ Server - A single Kafka node. This individual node is responsible for receiving messages from producers, storing them on disk, and serving them to consumers upon request.
• Cluster - A group of Brokers(servers) working together. Brokers are linked together to operate seamlessly as a single distributed network, providing fault tolerance, high availability, and massive scale. Within a cluster, data is duplicated across brokers using a Replication Factor. Replication factor is a configuration defining the exact number of copies of a partition that must be maintained across different brokers to ensure fault tolerance. If your Replication Factor is 3, three different brokers have a copy of the data.
For each partition, one broker is assigned the Leader (primary broker) and exclusively handles all read and write requests for that specific partition to ensure strict data consistency.
The other brokers become Followers (backup brokers) and they passively replicate the data from the Leader (acting as In-Sync Replicas or ISR) so they can seamlessly and instantly take over if the Leader fails.
• KRaft (Kafka Raft) - The internal manager of Kafka. Kraft is the modern built-in consensus protocol that functions as the internal cluster manager, meaning it is the overarching system responsible for managing broker states, leader elections, and metadata within Kafka. It keeps track of which brokers are online, which broker holds which partition, and handles the recovery if a broker crashes.
Historically, Zookeeper served as the legacy external coordination service that acted as the cluster manager before being phased out. Kafka recently removed ZooKeeper and replaced it with KRaft, which is built directly into Kafka to make it faster and easier to manage the state of the cluster.

How Kafka Works

Here is a detailed flow of how data moves through Kafka, showing Partitions, Offsets, and Consumer Groups:

NB: Two different Consumer Groups can read the exact same messages without interfering with each other. Because Kafka stores the data on disk, Consumer Group 2 (Receipt System) can read the message hours after Consumer Group 1 (Inventory System) read it, simply by starting at an older Offset.

Kafka architecture

To understand Apache Kafka’s architecture, we need to examine its internal design and core components.
Kafka architecture explains how Kafka does what it does. Kafka’s architecture is designed to do three things flawlessly; never lose data, handle millions of messages a second, and scale up without turning the system off.
To understand how it works, we can break Kafka’s architecture into three main areas.
1. The Network Architecture (The Physical Components)
Kafka is a distributed system, meaning it is not just one big computer. It is a network of smaller computers working together as a single unit and comprises of The Kafka Cluster, Brokers, The KRaft Controller (The Manager), Producers and Consumers.

2. The Data Architecture/Storage (Logical Components)
Kafka does not store data in tables like a standard database. It stores data using a concept called an Append-Only Commit Log.
Imagine a physical logbook. When a new message arrives, Kafka writes it at the very bottom of the page. You cannot erase, edit, or insert a message in the middle. You can only append (add) to the end.
This architectural choice is the secret to Kafka's speed. Because it never wastes time searching for a record to update or delete, writing to Kafka is incredibly fast.

How Topics and Partitions fit into the Log.

A Topic is just a logical name for a group of these logbooks.
A Partition is the actual physical logbook file sitting on a Broker's hard drive.
An Offset is the line number in that logbook.

3. High Availability Architecture (Fault Tolerance)
Because hardware fails, Kafka’s architecture assumes that brokers will eventually crash. It protects your data using Replication.
When you create a topic, you set a Replication Factor. If you set a replication factor of 3, Kafka guarantees that three different Brokers will have an exact copy of the data.
For every single partition, Kafka elects one Broker to be the Leader. which Producers and Consumers talk. The other Brokers are become Followers. They do not talk to producers or consumers but copy everything the Leader does in real-time.
If the Leader broker crashes, the KRaft Controller notices immediately. It instantly promotes one of the Followers to become the new Leader. The Producers and Consumers automatically connect to the new Leader, and the system continues without dropping a single message.

4. The Philosophy(Core Design Rules)
Kafka’s architecture relies on a few specific design choices that make it different from almost every other messaging system.

A. Smart Consumers, Dumb Brokers/Servers
In traditional message queues, the server (the queue) is smart. It remembers which consumer read which message and deletes the message after it is read. This puts a heavy workload on the server.
Kafka flipped this architecture. The Kafka Broker is dumb while the Consumer is smart. The server just stores the data and deletes it after a certain time while the consumer tracks its own Offset (its place in the logbook). This takes a massive load off the brokers, allowing them to handle millions of messages per second.

B. The Pull Model
Many systems push data to the consumers. If a sudden spike in traffic happens, the system pushes so much data that it crashes the consumer application.
Kafka uses a Pull architecture. Producers push data into Kafka, but Kafka never pushes data to Consumers. The Consumers pull data from Kafka only when they are ready for it. If the consumer gets overloaded, it just slows down its pulling. The data waits safely on Kafka’s hard drive until the consumer catches up.

C. Using Disk instead of RAM (The OS Page Cache)
While most messaging systems try to keep data in RAM (memory) because it is faster, Kafka writes straight to the hard drive.
because it relies on the Operating System's Page Cache. The OS automatically uses free RAM to temporarily hold the most recently written data. When a consumer asks for the latest data, Kafka actually serves it straight from the OS memory without ever spinning the hard disk, giving you memory-like speeds with hard-drive-like storage capacity.

Visualizing the Architecture

Here is how all these pieces connect in a real-world scenario.

How to Install Apache Kafka (Locally)

To run Kafka on your computer, you need to have Java installed.
Step 1: Download Kafka
Go to the official Apache Kafka Downloads page here and download the latest .tgz file (binaries).

Step 2: Extract the file
Open your terminal and extract the folder

tar -xzf kafka_2.13-4.2.0.tgz

# you can rename the folder 
mv kafka_2.13-4.2.0/ kafka

# Navigate into the folder
cd kafka

Step 3: Generate a Cluster UUID
Since we are using modern Kafka (KRaft mode), first generate a unique ID for the cluster

KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

Step 4: Format the Storage

bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties

Step 5: Start the Kafka Server

bin/kafka-server-start.sh config/kraft/server.properties

Leave this terminal window open. Kafka is now running!

Key Kafka Commands for Beginners

Open a new terminal window (keep the server running in the first one) to run these commands.

1. Create a Topic
Before you can send data, you need to create a topic. Let's create one called first_topic.

bin/kafka-topics.sh --create --topic first_topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

NB: localhost:9092 is the default address where your local Kafka broker is running. We set --partitions 3 to split the topic into three lanes for speed, and --replication-factor 1 because we only have one local broker running right now.

2. Start a Producer (Send Data)
This command opens a prompt where you can type messages.

bin/kafka-console-producer.sh --topic first_topic --bootstrap-server localhost:9092

Once it starts, type a few lines and hit Enter after each.

Hello Kafka!
This is my first message.

3. Start a Consumer (Read Data)
Open a third terminal window and run the below command to read the data.

bin/kafka-console-consumer.sh --topic first_topic --from-beginning --bootstrap-server localhost:9092

The --from-beginning flag tells the consumer to start reading from Offset 0. You will instantly see the messages you typed in the Producer terminal appear here. If you go back to the Producer terminal and type a new message, it will pop up in the Consumer terminal in real-time.

Conclusion

Apache Kafka is the nervous system of modern data engineering. By sitting in the middle of your architecture, it decouples the applications that create data from the applications that need to use data.
While it can be complex to manage at a massive scale, the core concept remains remarkably simple; Producers write events to Topics, those Topics are split into Partitions for speed, and Consumers use Offsets to read those events whenever they are ready. By leveraging Consumer Groups, Kafka ensures that data is processed efficiently, securely, and at an incredibly massive scale.

The Great Data Debate: Should You Build Your Warehouse Top-Down or Bottom-Up?

Lawrence Murithi — Mon, 11 May 2026 16:06:03 +0000

Introduction

Imagine you have a massive, disorganized garage. You need to clean it up so you can actually find things. You have two ways to tackle this.
The first way is to take every single item out of the garage, build a perfect, custom-sized shelving unit for the entire space, categorize every loose screw and tool into a master list, and then put everything in its exact, permanent place.
The second way is to just clean out the corner where you keep your gardening tools because it’s spring and that’s what you need right now. Later, when winter comes, you can clean out the corner for your snow shovels.
This is exactly how the data engineering world looks at building a Data Warehouse. The clean the whole garage first method is the Inmon approach. The clean corner by corner method is the Kimball approach.
If your company wants to store data to make smart business decisions, you will inevitably bump into these two names; Bill Inmon and Ralph Kimball.
This article looks at how the architectures work, the good, the bad, and which one you should actually use.

The Inmon Architecture(The Top-Down Master Plan)

Bill Inmon is often called the father of the data warehouse. His philosophy is that a data warehouse should be the single, ultimate source of truth for the entire business.

How It works

Inmon uses a top-down approach. You start by looking at the entire company, pull data from all the different software systems (sales, HR, finance) and clean it up. Then, you store all of it in one massive, highly organized central database.
Because of this design, the Inmon approach requires that business requirements are defined first. You must have a complete understanding of the enterprise's overarching data needs before building the model. Furthermore, it relies on strong governance, meaning there are strict, centralized rules controlling data quality, security, and standardization across the board.
Inmon uses a normalized structure. This means data is stored without any duplication. If a customer's name changes, you only have to update it in one single place.
Building a centralized warehouse first is the core of this method. Once this giant central warehouse is built, you carve out smaller pieces of it, Data Marts, for specific departments to use. Each department gets their own data mart, but that data mart is fed strictly by the central warehouse.

Below is a flowchart showing multiple source systems feeding into a single Staging Area, flows into a large central Enterprise Data Warehouse, which then splits into smaller Data Marts pointing to the end users.

Source → ETL → Data Warehouse → Data Marts → Reports

Pros

- Single source of truth - Because everything flows from one central hub, the different teams will never have conflicting numbers.
- High consistency - Due to strong governance and a centralized structure, definitions and metrics mean the exact same thing across the entire enterprise.
- Good for large organizations - The robust, highly structured foundation is capable of handling vast amounts of complex, enterprise-wide data efficiently over the long term.
- Easy to update - Since data isn't duplicated, updating records or fixing errors is very clean and simple.
- Built for the future - If the company grows or adds new departments, the foundation is already solid.

Cons

- Slow to implement - Designing a perfect system for an entire enterprise takes months, sometimes years, before anyone sees real value.
- It’s expensive - You need highly specialized database experts and a massive upfront budget to build and maintain the central hub.
- Hard for business users to read - The normalized database is great for computers, but very confusing for a regular business person trying to run a report.
- Hard to change - Because the entire enterprise is highly integrated and normalized, pivoting the architecture to accommodate new, unforeseen business models is difficult and time-consuming.

The Kimball Architecture(The Bottom-Up Quick Win)

Ralph Kimball felt Inmon method was slow and expensive and decided to craft a better method. His philosophy is that a data warehouse focus on business processes and answer specific business questions as quickly as possible.

How it works

Kimball uses a bottom-up approach prioritized around fast delivery. Instead of building a giant central warehouse first, you start by building individual Data Marts.
For example, if the sales team needs a report urgently, you pull data just for the sales team, run it through ETL and build a Sales Data Mart. Then later, you build an HR Data Mart.
Kimball uses a denormalized structure, known as the Star Schema. This means he doesn't care if data is duplicated. He organizes data into Facts (numbers such as sales amount) and Dimensions (context such as time, location, or customer name).
Rather than being isolated silos, these individual Data Marts are eventually linked together to form an Integrated Warehouse. To keep things from getting chaotic, Kimball uses conformed dimensions (an enterprise bus). This is a strict rule that says if both the Sales mart and the HR mart use a Date or a Customer, they must use the exact same definition, allowing the data marts to connect logically for company-wide reporting.

Below is a flowchart showing source systems feeding into an ETL process, which builds independent Data Marts(Star Schemas) first. These marts are linked together by shared conformed dimensions to form a logical Integrated Warehouse, which is then used for End-User Reports.

Source → ETL → Data Marts → Integrated Warehouse → Reports

The Pros

- Faster implementation - You can get a single department up and running with data in a matter of weeks, delivering immediate ROI.
- Cheaper to start - You don't need a massive upfront budget.
- Business-friendly - The Star Schema is incredibly easy for regular business users to understand. They can drag and drop fields in software like Tableau or PowerBI easily.
- Flexible - It is much easier to add new data marts or modify existing ones as business needs change without breaking a massive central database.

The Cons:
- Data duplication - Because data is stored in multiple different marts, you use up more storage space.
- Harder to update - Because Kimball favors speed and query performance over strict organization, the same piece of data is intentionally stored in multiple places. For example, if a customer's address changes, you might have to update it in five different data marts.
- Risk of inconsistency - If you aren't strictly enforcing conformed dimensions, your data marts will drift apart. Because data is duplicated across different marts, sales and finance might end up reporting different total revenue numbers.
- Integration challenges - Because the system is built piece-by-piece rather than centrally planned from the start, tying all the disparate data marts together into a unified, integrated warehouse later on can become technically complex and messy.
For example, if Sales mart is built in January and the HR mart in July, the teams might design their databases differently. A user trying to generate a combined report showing Sales Revenue vs. Employee Training Costs might realize that Sales measures time in Weeks, while HR measures time in Months. Trying to join the two data marts together to answer enterprise-wide questions thus becomes technologically complex.

Which Architecture is better?

If you ask a room full of data engineers this question, you will probably start an argument. But realistically, neither is better. It entirely depends on what your company needs.

You should use Inmon if

You work in a highly regulated industry (like banking, insurance, or healthcare) where data accuracy and audit trails are more important than speed.
You have a large budget, a big team of data engineers, and plenty of time.
Your company's data is incredibly complex and changes constantly.

You should use Kimball if

You are a startup, a retail business, or a fast-moving company that needs data right now.
You want your non-technical business teams to build their own reports without asking IT for help every time.
You are on a tight budget and need to prove the value of the data warehouse to your boss quickly.

The Modern Reality

It is worth mentioning that technology has changed a lot since Inmon and Kimball wrote their books in the 1990s.
Back then, computer storage was incredibly expensive and Inmon’s method of not duplicating data saved money.
Today, cloud storage is incredibly cheap. Because storage is cheap, many companies lean heavily toward Kimball's Star Schema because the cost of duplicating data just doesn't matter much anymore.
Furthermore, new hybrid approaches have popped up. The Data Vault architecture (by Dan Linstedt) is becoming very popular. It essentially takes the best of Inmon’s strict central storage and pairs it with Kimball’s easy-to-read data marts.

The Bottom Line

When it comes to building a data warehouse, don't get caught up in treating Inmon or Kimball like a religion. You aren't building a monument but a tool to help your company make money.
If your company has the patience to build a bulletproof foundation, go top-down with Inmon. If your company needs answers tomorrow to keep the lights on, go bottom-up with Kimball.
Pick the approach that fits your business reality, not the one that looks prettiest on a whiteboard.

Docker for Data Professionals: From Zero to Containerizing Your First Project

Lawrence Murithi — Mon, 11 May 2026 13:39:03 +0000

Introduction

If you work with data, you probably have spent hours writing a Python script, training a machine learning model or building a data pipeline. It runs perfectly on your laptop but when you send the same code to a teammate or try to run it on a company server, it instantly crashes.
Usually, the error has nothing to do with your code. It crashes because of issues like; the other computer has a different version of Python, is missing a library like pandas, or uses a different operating system.
Docker was created to solve this exact problem.
This article delves into what Docker is, why data scientists and analysts should care about it, and how to use it step-by-step.

What is Docker?

Before the 1950s, global shipping was a mess. Loading and unloading a ship was a nightmare(slow and unstandardized) because contents such as barrels, sacks, cars and boxes were different shapes weights and size.
Then, the shipping industry invented the steel shipping container. It didn't matter if you were shipping cars, coffee, or electronics, you just put your contents in a standard box. Ships, trains, and cranes were now built to handle that box.
Docker does the exact same thing for software.
Instead of just moving your code from one computer to another, Docker allows you to package your code, the programming language, the exact libraries you used, and the system settings into one standard box.
Because everything your code needs is inside that box, it will run exactly the same way on your laptop, your coworker's laptop, or a cloud server.

Docker vs. Virtual Machines

You might be thinking, Isn't that just a Virtual Machine(VM)?
It’s a fair assumption, as both provide isolated environments for your applications, but Docker is fundamentally lighter and more efficient.
A traditional VM relies on software called a hypervisor to bundle your code and libraries with a complete, dedicated guest operating system. Booting up a whole new copy of Windows or Linux makes VMs massive, resource-heavy, and slow to start.
However, Docker only virtualizes the application layer. It uses a background service called the Docker Engine to share your host computer's underlying operating system. By stripping away the bulky guest OS, Docker packages only the absolute essentials; the code, runtime, and settings, into a highly portable container. This isolation guarantees your app will run reliably across any infrastructure. Docker containers also take up a fraction of the disk space and launch in mere seconds.

Core Docker Terminologies

Before writing any code, it's critical to understand the basic vocabulary of the Docker ecosystem.
1. Docker Engine - This is the underlying background program running on your machine. It does the actual heavy lifting required to build, run and manage your containers.
2. Dockerfile - Think of this as a recipe. It is a plain text document that contains a step-by-step list of commands. Docker reads this file to know exactly which software to install and which files to copy to build your environment.
3. Images - An image is a frozen, unchangeable blueprint created by running a Dockerfile. It holds your code, tools, and system libraries in one package. Images are used to spawn active containers. Think of it as the static mold used to make identical products.
4. Containers - This is the live, running version of an image. A container isolates your application and its requirements from the rest of your computer, guaranteeing it behaves the exact same way no matter what machine it runs on.
5. Docker Hub - This is a massive online library for Docker images. Just like GitHub is used for sharing code, Docker Hub is a public platform where people can upload their own custom images or download pre-made environments to save time.
6. Volumes - Because containers are temporary, any data saved inside them is lost when they shut down. Volumes fix this by linking a folder inside the container to a folder securely saved on your actual hard drive, preventing data loss.
7. Networks - This is the system that allows multiple standalone containers to talk to each other safely. For example, a network lets a container holding your Python code securely send data to a separate container running a database.

Why Data Professionals Need Docker

While Software developers have used Docker for years to run websites, it has now become a required skill for data teams because of various reasons.
• Reproducibility - In data science, if someone cannot reproduce your results, your results are not valid. Docker guarantees that anyone who runs your container will get the exact same output.
• Easy Handoffs - A Predictive model is usually handed over to a data engineer or a software team to put it into production. A Docker container would easen their work since they don't have to guess how to set up the environment. They just run it.
• Working with Old Code - Sometimes you need to run a script written three years ago using Python 3.6. Instead of messing up your current computer by downgrading your software, you just spin up a Docker container with the old versions, run the job, and delete it.

Pros of Using Docker

1. Portability
Docker packages your application along with all its dependencies, libraries, and configuration files into a single image. Because the environment is locked inside this image, the application will run exactly the same way on a developer's laptop, a testing server, or in the production cloud. It eliminates the problem of "It Works on My Machine".
2. Resource Efficiency
Traditional Virtual Machines (VMs) require a full, heavy guest Operating System for every application. Docker containers, however, share the host machine's OS kernel thus are incredibly lightweight, take up significantly less hard drive space, and can start up in milliseconds. You can also run many more containers on a single server than you could VMs.
3. Isolation of Environments
Every container runs in its own isolated environment. This means you can have one container running an application that requires Python 2.7, and another container running an application that requires Python 3.10 on the exact same server.
4. Ideal for Microservices and Scalability
Docker is the foundation for modern microservices architectures. Instead of building one massive, monolithic application, you can build small, independent services (e.g., a database container, a web server container, an authentication container). If your web traffic spikes, you can quickly spin up 10 extra web server containers without having to duplicate the database.
5. Faster Deployment and CI/CD Integration
Docker images are pre-configured thus deploying them is as simple as downloading the image and pressing run. This makes Docker incredibly popular for Continuous Integration/Continuous Deployment (CI/CD) pipelines. If a new version of an app has a bug, rolling back is as easy as running the previous Docker image tag.

Cons of Using Docker

1. Steep Learning Curve
For beginners, Docker introduces a lot of new concepts hence takes time to become proficient. Developers have to learn how to write Dockerfiles, manage docker-compose configurations, understand container networking (how containers talk to each other), and grasp how images are built.
2. Data Persistence Complexity
By design, containers are ephemeral (temporary). If a container is deleted or crashes, all the data inside it is permanently lost. To save data permanently (like a database), you have to learn how to manage Docker Volumes or Bind Mounts to connect container storage to the host machine's hard drive.
3. Cross-Platform Performance and Quirks
Docker is natively a Linux technology. While Docker Desktop allows you to run it on macOS and Windows, it actually does this by running a lightweight, hidden Linux Virtual Machine in the background. This can lead to heavy RAM/CPU usage on Mac and Windows machines, and file-syncing between the host and the container can sometimes be slow.
4. Security Concerns (Shared Kernel)
Because containers share the host's Operating System kernel, they are less isolated than full Virtual Machines. If a hacker finds a vulnerability in the host OS kernel, they might be able to break out of the container and access the host machine or other containers. Additionally, poorly configured containers running as the root user pose a significant security risk.
5. Not Ideal for Desktop/GUI Applications
Docker is heavily optimized for backend services, APIs, databases, and command-line tools. While it is technically possible to run graphical desktop applications (GUI) inside Docker, it is highly complex, clunky, and generally not recommended.

First Docker Project

Let’s build a simple data project and put it inside a Docker container.

Step 1: Install Docker
Download Docker Desktop for Windows, Mac, or Linux here. Docker Desktop is a helpful graphical interface that includes the underlying Docker Engine. Install it and open the application. It runs quietly in the background.

Step 2: Set Up Your Project Files
Create a new folder on your computer e.g. myproject. Inside this folder, create three files.
File 1: main.py
This is the Python script. Write a simple program that uses the pandas library to create a small dataset and print it out.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Role': ['Data Analyst', 'Data Engineer', 'Data Scientist']
}

df = pd.DataFrame(data)

print("--- Team Data ---")
print(df)

File 2: requirements.txt
This file tells Python which libraries are needed. Since pandas was used, list it here.

pandas==2.1.0

File 3: Dockerfile
This is the magic file. Create a file named Dockerfile. Open it in a text editor and paste the following code.

# 1. Start with a base image pulled from Docker Hub that already has Python
FROM python:3.10-slim

# 2. Create a working directory inside the container
WORKDIR /app

# 3. Copy our requirements file into the container
COPY requirements.txt .

# 4. Install the libraries listed in requirements.txt
RUN pip install -r requirements.txt

# 5. Copy the rest of our code into the container
COPY main.py .

# 6. Tell the container what to do when it starts
CMD ["python", "main.py"]

Step 3: Build the Docker Image
Now, turn those three files into a Docker Image.
Open your computer's terminal (Command Prompt on Windows, Terminal on Mac/Linux) and navigate to the myproject folder and run the below command.

docker build -t my-first-data-app .

NB: The period . at the end tells Docker to look for the Dockerfile in the current folder.
• docker build tells Docker to read the recipe.
• -t my-first-data-app gives the image a name (tag) so it can be easier to find it later.

Step 4: Run the Container
Once the build is finished, the image is ready and can be run using the command below.

docker run my-first-data-app

This displays the output of the Python script on the screen.

--- Team Data ---
      Name  Age            Role
0    Alice   25    Data Analyst
1      Bob   30   Data Engineer
2  Charlie   35  Data Scientist

The Image can now be sent to anyone in the world, and it would print the exact same table, even if they don't have Python installed on their computer.

The Multi-Container Problem

Running a single container is great. However, modern applications are rarely just one piece of software.
A standard web application usually consists of:

A frontend application
A backend API
A database
A caching system

Using basic Docker commands means you have to build and run each of these containers manually, figure out how to connect them to the same network so they can talk to each other and manage their startup order. Doing this by typing long commands into the terminal every single day is frustrating and prone to human error.

Docker Compose

Docker Compose is a tool designed specifically to solve the multi-container problem.
Instead of typing a bunch of manual terminal commands, Docker Compose allows you to define your entire multi-container application in a single text file called docker-compose.yml.
YAML (Yet Another Markup Language) is just a way to write configuration data in a clean, readable format using indentation.
With Compose, you define services. Each service represents one container in your application setup. You can define what image the service should use, what ports it should open, and how it connects to the other services.

A Practical Docker Compose Example

# The structure of docker compose

services:
  app:
    build: .
    ports:
      - "8080:8080"

  postgres:
    image: postgres
    ports:
      - "5432:5432"

services:
  postgres:
    image: postgres:latest
    container_name: postgres
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: 12345
      POSTGRES_DB: postgres
    ports:
      - "5433:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 15s
      timeout: 10s
      retries: 5

  etl:
    build: .
    container_name: etl
    environment:
      DB_USER: postgres
      DB_PASSWORD: 12345
      DB_HOST: postgres
      DB_PORT: 5432
      DB_NAME: postgres
    depends_on:
      postgres:
        condition: service_healthy

Let's break it down.
services - We have two services defined; postgres (the database) and etl (the process interacting with the database).
image - For the postgres service, we are not building our own image. We are downloading the official postgres:latest image directly from Docker Hub.
container_name - Explicitly sets the names of the running containers to postgres and etl instead of letting Docker auto-generate random names.
environment - This passes variables (like passwords and database names) into the containers. The etl service's DB_HOST is simply postgres. Docker Compose automatically creates a network so the etl container can talk to the database using its service name.
ports - Maps port 5433 on your host machine to port 5432 inside the postgres container. This allows you to connect to the database from outside of Docker using port 5433.
healthcheck - Tells Docker how to test if the postgres database is actually ready to accept connections. It runs the command pg_isready every 15 seconds, waiting up to 10 seconds for a response, and will try 5 times before failing.
build - For the etl service, we tell Compose to look in the current folder (.) for a Dockerfile and build the image from scratch.
depends_on - This tells Docker not to start the etl service until the postgres container is fully up and has successfully passed its healthcheck (condition: service_healthy).

Once you have written the above file, you can start your entire application; the database, the custom network, and the backend with just one command as below.

docker-compose up

When you are done working and want to shut everything down and clean up the network, you type

docker-compose down

Using Docker Volumes for Data

Volumes persist data outside of the container's lifecycle. Why is this important for data professionals?
In the example above, we packed our Python script directly into the container. But what if you are processing a 10-gigabyte CSV file? You do not want to pack a massive data file inside your Docker image. Images are supposed to be lightweight. Furthermore, if your code generates a cleaned CSV, and the container stops running, that new file will be lost forever.
A Volume fixes this by acting as a bridge between your actual computer and the container.
Imagine you have a folder called data on your laptop, and you want your Docker container to read a file inside it. You would run your container like this:

docker run -v /path/to/your/local/data:/app/data my-first-data-app

The -v command maps a folder on your computer to a folder inside the container. Now, your Python script can read heavy datasets and save output files directly to your laptop, without making the Docker image bloated.

Summary

Docker is an incredibly powerful tool that has revolutionized software engineering by making apps fast, portable, and scalable. However, for a very simple, static website or a solo developer building a basic script, adding Docker might introduce unnecessary complexity and overhead.
If you want to start using Docker in your daily data work, ensure to follow these rules.
1. Use official base images - When writing a Dockerfile, always start with an official image from Docker Hub like python:3.10-slim or jupyter notebook. They are secure and well-maintained.
2. Keep it small - Use versions of Linux that have slim or alpine in the name. They take up less space on your hard drive.
3. Pin your versions - Always use a requirements.txt file and specify the exact version of the library you used (e.g., scikit-learn==1.3.0). If you just write scikit-learn, Docker will download the newest version, which might break your code.
4. Don't put passwords in Dockerfiles - If your script connects to a database, never hardcode your password into the script or the Dockerfile. Use environment variables instead.
5. Level up with Docker Compose - Once you are comfortable running a single container, look into Docker Compose. While Docker commands handle individual containers, Docker Compose allows you to define and manage multi-container applications. By writing a single docker-compose.yml file, you can seamlessly utilize Networks to connect multiple containers e.g. running Python script in one container and a PostgreSQL database in another and spin them all up with just one simple command (docker-compose up).
Mastering Docker could save you hundreds of hours of debugging. Once you learn how to containerize your data projects, "it works on my machine" will be a phrase you never have to say again.

Folders, Apartments, and Fake Computers: A Guide to Virtual Environments, Docker, and VMs

Lawrence Murithi — Thu, 07 May 2026 12:23:35 +0000

Introduction

If you have been spending a substantial amount of time writing code, you must have run into a frustrating problem: "It works on my computer, but it doesn't work on yours."
This happens because computers are set up differently. You might have a different operating system, a different version of a programming language, or different background software running. When a website or app breaks because of this, developers can lose hours or even days trying to figure out what the problem is.
To solve this, developers came up with ways to isolate software. Instead of installing an app directly onto your main computer, you put it inside a protective bubble. This bubble tricks the software into thinking it has its own private space, with exactly what it needs to run, so it won't mess with the rest of your system.
There are three main tools we use to create these bubbles; Virtual Environments, Virtual Machines (VMs) and Docker. While they all aim to solve similar problems, they do it in completely different ways, using completely different layers of your computer.
Let's break down exactly what each one is, how they compare and when you should use them.

1. Virtual Environments

A Virtual Environment is a localized directory that contains a specific version of a programming language and the specific software packages required for a project. It is the simplest and lightest way to isolate a project and is most commonly used in Python (using tools like venv or virtualenv) although similar concepts exist in other languages.

How Virtual Environments work

A Virtual Environment provides no system-level isolation. It does not share hardware, nor does it isolate the OS. It simply changes the PATH variables in your terminal so that when you install a package or run a script, it uses the isolated folder instead of the computer's global system files.
Imagine you are building two different websites on your laptop. Website A is older and needs version 2.0 of a web framework like Django. Website B is brand new and needs version 4.0 of that exact same framework. If you install these tools directly onto your main computer system, they will conflict and one of your websites will stop working.
A virtual environment fixes this by creating a dedicated, private folder for your project. When you turn on(activate) the virtual, it temporarily rewrites your computer's internal GPS, known as the system PATH. Because of this, your computer temporarily ignores its main, global list of tools. Instead, it only looks at the tools installed inside that specific project folder.

Pros

• Extremely fast - Creating and starting a virtual environment takes less than a second because it is just moving some folders around.
• Lightweight - It only takes up a few megabytes of space on your hard drive. There is no heavy software running in the background.
• Simple to use - Usually, it just takes one or two simple commands in your terminal to get started and shut down.
• No dependency conflicts - it solves the problem of dependency conflicts between different projects

Cons

• Weak isolation - It only isolates programming packages (like Python libraries). It does not isolate the operating system, the system clock, or your hardware settings.
• "It works on my machine" can still happen - Because the isolation is weak, hidden problems can sometimes slip through. If your code secretly relies on a specific font or a hidden system tool installed on your Mac, and you send your virtual environment code to a friend on a Windows PC, the code might still break.

Virtual environments are used on local computer on day to day coding when working on multiple projects using the same programming language but want to keep their dependencies separate from one another.

2. Virtual Machines (VMs)

A Virtual Machine is a complete software emulation of a physical computer. It runs its own full Operating System (Guest OS) entirely separate from the host computer's Operating System. It is the heaviest, most complete, and oldest form of isolation. Software like VirtualBox, VMware, or Microsoft Hyper-V allows you to do this.

How Virtual Machines work

If a virtual environment is like putting your code in a separate folder, a Virtual Machine is like buying an entirely new physical computer, shrinking it down, and putting it inside your current computer.
It uses a piece of software called a Hypervisor(like VMware, VirtualBox, or Hyper-V). The hypervisor carves out a specific amount of your physical computer's RAM, CPU, and storage and dedicates it to the VM. You then install a full Operating System (like Windows or Ubuntu) onto that carved-out space. This new system is called the Guest OS which operates/behaves like a real computer while the main computer is called the Host.

Pros

• Complete isolation - What happens inside a VM stays inside a VM. Because the hypervisor locks the hardware, if a VM gets infected with a severe virus, your main host computer is almost always completely safe.
• Run different operating systems - You can run a full Windows computer inside a Mac, or a Linux computer inside Windows, allowing you to use software made for different platforms.
• Highly secure - Because the hardware is strictly separated at a deep level, it is trusted by banks, governments, and massive corporations for highly sensitive tasks.

Cons

• Massive resource hog - Since you are running a second operating system on top of your current one, VMs eat up a lot of RAM, CPU power, and battery life. Even if the VM is just sitting idle, it is still running background updates, managing a clock, and keeping a digital desktop alive hence wasting power.
• Huge files - A VM can easily take up 20 to 100 gigabytes of storage space just to hold the basic operating system files.
• Slow - Booting up a VM takes just as long as turning on a physical computer, and moving files in and out of it can be tedious.

VMs are used in large corporate cloud servers or on a local machine when strict security is needed. Its critical when you need to test software on a completely different operating system, or when a business is running older, legacy applications that require an outdated OS to survive.

3. Docker (Containers)

Docker is a platform that uses containerization to package an application and all its necessary dependencies (libraries, frameworks, etc.) into a single, standardized unit called a container. Containers are the clever middle ground between the lightness of a Virtual Environment and the strict, heavy isolation of a Virtual Machine.

How docker work

Every operating system is made of two main parts; the core engine (Kernel), which physically tells your RAM and CPU what to do, and the user files/tools that make up the desktop experience you see on screen.
While a Virtual Machine duplicates both parts making it so heavy, Docker only duplicates the user files and tools. All Docker containers share the main host computer's Kernel.
Think of it like an apartment building. A Virtual Machine is like giving everyone their own separate house with their own separate plumbing and electricity. Docker is like an apartment complex where everyone has their own locked, private room(container) and can decorate however they want, but they all share the building's central plumbing and electrical systems hidden in the walls(Host OS Kernel).

To use Docker, you write a simple text file called a Dockerfile. It reads like a recipe; Start with a bare-bones version of Linux, set up some default database passwords, download the latest PostgreSQL and start the database server. Docker reads this file and packages it into a container. This container can be handed to anyone, and it will run exactly the same way, regardless of what computer they have.

Pros

• Consistent everywhere - It solves the "it works on my machine" problem perfectly. A Docker container behaves exactly the same on a Mac, a Windows PC or a cloud server because the environment inside the container never changes.
• Fast and lightweight - Because they don't boot up a full operating system kernel, containers start in seconds and usually only take up a few hundred megabytes of space.
• Easy to share and scale - You can run dozens or even hundreds of containers on the same computer without them fighting over resources. This allows developers to build microservices. Instead of building one massive app, you put the shopping cart in one container, the user login in another, and the payment system in a third. If the payment container crashes, the rest of the website stays up.

Cons

• Steeper learning curve - You have to learn Docker-specific terminal commands, how to write Dockerfiles and how networking works to let containers talk to each other.
• OS limitations - Because Docker shares the host's kernel, you generally run Linux containers on Linux machines. Although Linux can run containers on Mac and Windows, Docker usually installs a tiny, hidden Linux Virtual Machine in the background to provide the Linux Kernel making Docker slightly heavier on Mac and Windows than it is on native Linux.
• Less secure than VMs - Because containers share the host kernel, the wall between them is thinner hence a critical vulnerability in the host OS could potentially affect all containers.

Docker is used almost everywhere. On a developer's laptop, in automated testing environments, and in production running live websites on the open internet. Its used when building modern web applications, working with a team of developers who all use different computers, or breaking a large app down into smaller microservices.
It gives developer's an isolated, highly reliable environment that is identical across all machines, without wasting your computer's RAM and hard drive space.

Similarities between the tools

The core similarity between all three is the concept of isolation.
They all exist to create boundaries between projects and software.
They also all make it easier to delete a project without leaving junk files behind; you just delete the virtual environment folder, the VM file, or the container image, and everything associated with that project is instantly gone, leaving your main computer perfectly clean.
Most times, they are often used together in the real world. A large company might run a giant Virtual Machine in the cloud to provide security, put Docker inside that Virtual Machine to manage different web apps easily, and a developer might use a Virtual Environment inside that Docker container to organize their Python code.

The Major Differences

The difference lies in how much they isolate and how heavy they are.
• Virtual Environment (Lightest) - Isolates only the language packages but relies entirely on your computer for everything else.
• Docker (Middle) - Isolates the application and the operating system files, but shares the core OS engine (the kernel) to save power and speed.
• Virtual Machine (Heaviest) - Isolates absolutely everything. It clones the physical hardware and runs a 100% separate operating system, taking up a lot of space and power to provide maximum security.

Conclusion

If you are just writing a quick Python script to scrape a website, analyze some data, and need to install a few libraries without breaking your computer, use a Virtual Environment.
If you are building a web app, working with a database, collaborating with other developers, and need to make sure your code runs exactly the same way on your laptop as it will on your company's live servers, use Docker.
If you are on a Mac but absolutely need to run a piece of Windows-only enterprise software, or you are testing dangerous malware and need maximum security to protect your real computer, use a Virtual Machine.

The Medallion Architecture: Turning Messy Data into Business Gold

Lawrence Murithi — Wed, 06 May 2026 16:06:09 +0000

Introduction

Imagine drawing water from a muddy river. You would never scoop a glass of water from the bank and drink it straight down. You would want that water pumped into a treatment plant, filtered to remove the debris, and chemically purified until it is crystal clear and safe to consume.
Data requires the exact same treatment.
Ever seen raw data pulled directly from a company’s servers? It's usually a complete mess. Website logs, sales applications, customer service chatbots, and payment gateways all generate endless streams of information. If you take all that raw information, dump it into a single pile, and try to build a revenue report, the results will be a disaster. Your numbers will be wrong, your system will crawl to a halt, and nobody will trust the data.
To process this information safely, data engineers build systems with specific layers that clean and organize records step-by-step. Historically, this was done using traditional data warehouse layers. Today, a modern framework called the Medallion Architecture has taken over the industry.
Here is a deep dive into how data layers work, why the Medallion concept was invented, and how it refines digital mud into a clear, single source of truth.

The Old Way(Traditional Data Warehouse Layers)

Before the Medallion Architecture existed, engineers used a classic three-step method to move data from external software into a company dashboard.
To elaborate on the traditional Data Warehouse architecture, it is essential to ground the concepts in the frameworks introduced by W.H. Inmon (often called the father of the data warehouse) and Ralph Kimball.
Historically known as the Three-Tier Enterprise Data Warehouse (EDW) Architecture, this system was designed to separate operational systems (where data is created) from analytical systems (where data is analyzed).

1. The Staging Layer(The Transient Extraction Zone)
This was the receiving dock. The staging area is defined as a temporary, intermediate storage zone between operational data sources (ODS) and the data warehouse.
Data from a shopify store or a salesforce database was copied and temporarily dropped here. The main goal was speed; get the data out of the live application quickly so the app wouldn't slow down for regular users.

Attributes of staging layer

Decoupling OLTP and OLAP - The primary architectural goal of this layer is to isolate Online Transaction Processing (OLTP) systems (like Salesforce or Shopify) from Online Analytical Processing (OLAP) workloads. Analytical queries are highly resource-intensive; running them directly on a live database can cause catastrophic latency for end-users.
Extraction Mechanics - Data is pulled into this layer using methodologies such as batch processing or Change Data Capture (CDC). The data here is typically stored in its raw, native format.
Volatility - According to traditional DWH design principles, data in the staging layer is transient. Once the data is successfully moved to the next tier, it is generally purged or overwritten in the next batch cycle to conserve expensive storage space.

2. The Integration Layer (The Core Enterprise Data Warehouse)
This is where the heavy lifting happened. Engineers wrote scripts to clean the data and match up records.
This layer represents what W.H. Inmon famously defined as the subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process.
If your billing system called a customer Client_001 and your website called them User_001, the Integration layer linked them together into a central, highly structured database.

Attributes of integration layer

Semantic Reconciliation - The heavy lifting is known as semantic reconciliation and Master Data Management (MDM). Engineers must resolve heterogeneous data formats (e.g., merging Client_001 from an Oracle database and User_001 from a JSON web log) into a unified entity.
Data Cleansing and Normalization - In this layer, data undergoes rigorous cleansing (handling null values, standardizing date formats). Structurally, Inmon advocated for storing this data in the Third Normal Form (3NF). This highly normalized structure reduces data redundancy and ensures mathematical consistency across the enterprise, creating a Single Version of Truth (SVOT).
The Bottleneck - Because of the complex normalization rules, writing data into this layer requires highly complex, tightly coupled SQL scripts, making the Integration Layer notoriously slow to update or modify.

3. The Presentation Layer(Data Marts and Dimensional Modeling)
The highly normalized 3NF data in the Integration layer is too complex for business users to query efficiently, therefore, data must be reshaped for consumption. Engineers would pre-package specific tables for specific teams e.g. creating a Marketing Table or a Sales Table that connected easily to dashboard software.

Attributes of presentation layer

The Data Mart - The Presentation layer is composed of subsets of the data warehouse focused on a specific business unit also called Data Marts (e.g., Sales, HR, Marketing).
Dimensional Modeling (The Kimball Method) - In this layer, engineers apply Ralph Kimball’s dimensional modeling techniques, organizing data into Star Schemas or Snowflake Schemas. Data is divided into Facts (measurable, quantitative data e.g sales amount) and Dimensions (descriptive attributes e.g time, store, or customer).
Optimized for Read-Heavy Workloads - By pre-joining and denormalizing the data, this layer allows Business Intelligence tools like PowerBI to execute complex analytical queries rapidly without requiring end-users to understand underlying SQL structures.

The Problem with the Old Way

This system relied heavily on a process called ETL (Extract, Transform, Load). Engineers would extract the data, transform/clean it and then load it into the warehouse. The fatal flaw was that the raw data was often discarded after it was cleaned to save storage space. If a data engineer accidentally deleted a crucial column during the clean phase, that historical data was gone forever.
1. The Schema-on-Write Constraint
Traditional DWHs operated on a Schema-on-Write paradigm. This means that before data could be loaded into the warehouse, the warehouse's schema (tables, columns, data types) had to be rigidly defined. If a new column was added to the source software, the ETL pipeline would fail, or simply drop the unrecognized data, until an engineer manually updated the database schema.
2. Destructive Transformations and Storage Costs
On-premise relational database storage such as Teradata or Oracle appliances used to be very expensive. To save disk space, raw data was deemed expendable. Data was extracted, transformed to fit the strict schema, and the raw source data was then discarded.
This model had some downsides which included:
• Loss of Auditability and Lineage - If a transformation logic error occurred (e.g., a script incorrectly rounded up financial figures), there was no historical raw data to refer back to since the original data was permanently lost.
• Lack of Flexibility for Machine Learning - Modern Data Science requires massive amounts of raw, unstructured or semi-structured data to train machine learning models. The traditional integration layer stripped away the granular, raw anomalies that data scientists actually need, leaving only highly aggregated, structured data.
As a result of the flaw which resulted to loss of raw data and rigidity of ETL paved the way for Data Lakes, there was a shift from ETL to ELT(where cheap cloud storage allows raw data to be stored before transformation), and ultimately the modern Medallion Architecture (Bronze, Silver, Gold), which preserves raw data while still providing structured analytics.

The Modern Shift(The Medallion Architecture)

As cloud storage became incredibly cheap, companies stopped throwing away their raw data and began dumping everything into massive, cheap storage areas(Data Lakes).
Eventually, companies pioneered the Lakehouses which combined the cheap, infinite storage of a Data Lake with the strict organization of a traditional Data Warehouse.
The need to help companies organize the massive amounts of data inside a Lakehouse therefore gave birth to the Medallion Architecture.
The Medallion Architecture separates data into three specific stages; Bronze, Silver, and Gold. It mimics the logical flow of the traditional layers but fundamentally changes how data is treated, preserved, and upgraded.

How do the three layers work?

1. The Bronze Layer(The Raw Zone)
This is where all the raw data lands from the various sources.
The data is saved exactly as it arrived. You do not fix typos. You do not rename columns. You just capture it.

Features of bronze layer

• Safety and Troubleshooting - since the raw data is completely untouched, you never have to worry about accidentally destroying information. If an engineer writes a bad piece of code that ruins the data in the later layers, they can simply go back to the Bronze layer and restart the process.
• Historical Archive - The Bronze layer acts as an infinite, permanent record of everything that ever happened in the business. It is usually append-only, meaning new records are just added to the pile without overwriting old records.
• Speed - Getting data into the Bronze layer is fast since the computer isn't doing any complex math, translations or cleaning. Engineers often use tools called Change Data Capture (CDC) to stream this raw data in real-time.

2. The Silver Layer(The Cleaned Zone)
Once the data is safely locked away in the Bronze layer, it is copied and moved into the Silver layer. The goal of the Silver layer is to create a Single Source of Truth for the entire enterprise.

What happens here?

• Cleaning and Standardization - Engineers fix the formatting. For example, if one source system writes dates as DD-MM-YYYY and another writes MM-DD-YYYY, the Silver layer standardizes all into one standard format.
• Filtering and Quarantining - Junk data is handled here. If a user accidentally enters an age like 999, the system spots it and instead of deleting it, engineers push that bad record into a separate quarantine table so it doesn't ruin the main data set, but can still be investigated later.
• Deduplication - Sometimes, source systems can glitch and send the same receipt twice. The Silver layer strips out duplicates so every row is unique.
• Joining - Data from different tables is connected using relationships. A log of customer purchases is joined with a product inventory table so that you can see exactly what item was bought, not just a random product ID number.
• Security - This is where sensitive information (like passwords, social security numbers, or personal emails) is scrambled or hidden so that analysts using the data later on cannot see private customer details.

Data scientists and analysts spend a lot of time in the Silver layer. It is clean and trustworthy, but it is still highly detailed. Every single individual action is visible, which makes it the perfect place to look for hidden trends or train machine learning models.

3. The Gold Layer(The Action Zone)
The Gold layer is the final destination. The data here is no longer meant for deep exploration but is designed to answer specific business questions immediately.

What happens here?

In the Silver layer, you might have a table with ten million individual rows. If a user tries to load the rows into a dashboard, the software will freeze. However, in the Gold layer, those millions of rows are turned into highly summarized, bite-sized metrics.
• Aggregations - Instead of listing every single sale, engineers create a Gold table that simply shows Total Sales per Store per Day.
• Business Logic - This is where specific company rules live. If your marketing team defines an active subscriber as someone who has opened an email in the last 30 days, that exact mathematical rule is applied to a Gold table.
• Performance - Data loads instantly since it's heavily summarized and simplified using Star Schema layout. When you connect any Business Intelligence tools to the Gold layer, the charts populate immediately.

Why the Medallion Architecture Wins

The reason nearly every modern data team is adopting this structure is because it solves the biggest headaches that have plagued developers for decades.
1. Bulletproof Data Lineage
When an executive looks at a Gold dashboard and sees that monthly revenue dropped by 50%, panic sets in. The data team needs to find out if the business is actually failing, or if the system is just broken.
With this architecture, they can trace the flow backward. They check the rules in the Gold layer. If those are correct, they look at the cleaned data in the Silver layer. If that looks fine, they check the raw files in the Bronze layer.
2. Extreme Flexibility
If the finance department suddenly requests a completely new way to calculate annual growth, the data team doesn't have to panic. They do not have to go back to the original software sources, and they do not have to re-clean everything. They simply build a new Gold table on top of the already clean Silver data.
3. System Reliability (ACID Transactions)
Modern Medallion architectures are built on specialized table formats like Delta Lake or Apache Iceberg which support ACID transactions. That means if a server crashes halfway through moving data from Silver to Gold, it won't leave you with a half-finished, corrupted table. The system will automatically roll back to the last safe state, preventing bad data from leaking into executive reports.

Conclusion

If you want to remember how the Medallion Architecture functions, just remember these three phrases:
Bronze -Here is everything we found(Messy, huge, exact copies).
Silver -Here is what actually happened (Clean, standardized, truthful).
Gold - Here is what we should do about it (Summarized, fast, ready for action).
The Foundation of Trust
A data platform is only as useful as the trust people put into it. If employees constantly find missing numbers, broken charts, or conflicting reports, they will abandon the dashboards and go back to guessing.
The Medallion Architecture is much more than a way to organize servers; it is a framework for building organizational trust. By moving information systematically through the Bronze, Silver, and Gold layers, a company guarantees that every digital footprint is captured securely, cleaned relentlessly, and presented flawlessly. Just like turning muddy water into something safe to drink, the Medallion Architecture takes the chaos of raw information and refines it into the exact clarity a business needs to survive.

Transactional Power Vs Analytical Precision: The Essential Guide to OLTP and OLAP

Lawrence Murithi — Fri, 01 May 2026 19:58:49 +0000

Introduction

Behind every digital interaction is a fundamental divide in how data is handled. The system required to process your grocery checkout with lightning speed is radically different from the system a corporation uses to analyze a decade of sales growth. This is the core distinction between Transactional Power vs. Analytical Precision. To understand the backbone of modern technology, you must understand OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing).
Though they sound like technical jargon, they are simple concepts that define how businesses operate and grow.
This article serves as your roadmap to understanding how these systems function, their unique strengths, and why the balance between them is the secret to data-driven success.

OLTP(Online Transaction Processing): Handling the Day-to-Day
OLTP is the engine that runs traditional databases. It is designed to manage everyday business operations and process thousands of short, fast interactions per second. It is the system that handles the daily, minute-by-minute work of a business. Whenever a specific action or transaction takes place, OLTP is the system taking care of it.
In a database, a transaction is any small unit of work such as changing your password.
Transaction systems follow important rules called ACID properties.
ACID Properties are a set of four fundamental principles that guarantee reliable database transactions. They ensure data integrity and accuracy, preventing corruption even during system failures or concurrent operations.
The four principles are:
Atomicity(All-or-Nothing) - A transaction is treated as a single unit, it either fully completes or entirely fails and rolls back.
Consistency(Data Integrity) - A transaction ensures the database moves from one valid state to another, adhering to all constraints and rules. That means data remains valid before and after transaction
Isolation(Concurrent Control) - Concurrent transactions are isolated from each other, ensuring they don’t interfere with each other.
Durability(Permanent Data) - Once a transaction is committed, its changes are permanently saved and will survive system failures or crashes.

Examples of OLTP in real life

Adding an item to your online shopping cart.
Booking an airline ticket.
Sending a text message.
Banking systems (Mpesa, ATM transactions)

Think of OLTP like the cashier at a busy grocery store. The cashier’s job is to scan items quickly, take your money, hand you a receipt, and move on to the next person.

How OLTP Works

OLTP systems prioritize speed and accuracy. They use a design concept called normalization. This means the database organizes data into many small tables to avoid saving the same piece of information twice. Because the data is spread out neatly, the system can insert a new record, update a row, or delete a piece of data almost instantly.

Example

Imagine you want to withdraw $50 from an ATM. The bank's OLTP system immediately checks your balance, approves the withdrawal, and updates your account to show $50 less. This has to happen in seconds, and it has to be 100% accurate so you cannot overdraw your account.

Key features of OLTP

• Low latency/Fast response time - When you swipe your card, you expect it to be approved in seconds. OLTP databases are built to respond instantly.
• High number of users - The system ensures that thousands of users can access the same row in a database without failure.
• Normalized Data - Databases are typically highly normalized to reduce redundancy and ensure fast data entry. A single OLTP transaction does not require much data.
• Real-time processing/Accuracy - If you transfer $50 from your current account to your savings account, the system must subtract $50 from one and add $50 to the other. If the system crashes halfway through, the OLTP system cancels the whole thing so your data does not get corrupted. OLTP systems are built to be perfectly accurate and fail-safe.
• Write-heavy operations - Thousands of users might be doing things at the exact same time, the system is therefore constantly writing, updating or deleting information to the database.
• Highly available - Because OLTP systems handle the immediate, day-to-day operations of a business, the system is designed to be online, working, and accessible virtually 100% of the time thus downtime is not an option.
OLTP systems are usually built with backup servers and fail-safes. If one server crashes, another one instantly takes over so the customer doesn't notice a glitch.

Pros of OLTP

• Efficiency in Data Entry - Highly optimized for adding, modifying, or deleting records.
• Data Integrity - High reliability due to ACID compliance.
• Availability - Designed for 24/7 uptime for business-critical applications.

Cons of OLTP

• Inefficient for complex Analysis - If you ask an OLTP database to calculate the average sales of a product over the last five years, it will have to scan millions of everyday records. This takes a lot of computing power and can slow down the system for people trying to use it for normal tasks.
• Limited History - To keep things fast, OLTP systems usually only hold current or recent data. Old data is often moved somewhere else to save space.

OLAP (Online Analytical Processing)
OLAP is the engine behind data warehouses. If OLTP is the system for doing things, OLAP is the system for analyzing things. While OLTP only looks at a tiny slice of data at a time, OLAP is the brains used for strategic planning since its designed for data mining, processing huge amounts of information to find patterns, trends and summaries as well as complex reporting. Managers, data scientists, and business owners use OLAP to spot trends, build reports and make big decisions.

Making Sense of OLAP

Think of OLAP as the manager in the back office of the grocery store. They aren't ringing up customers. They are sitting at a desk, looking at charts and graphs of past sales to decide if they need to order more apples for next week.

How OLAP Works

OLAP systems are not built to process quick, small updates. To make this faster, OLAP uses denormalization. Instead of spreading data across many tiny tables like OLTP, OLAP groups massive amounts of related data together into large tables. This takes up more storage space, but it means the system can read through billions of records very quickly to find patterns.

Key features of OLAP

• Read-heavy operations - Unlike OLTP, which is constantly writing new data (new orders, new users), OLAP mostly just reads old data. It looks at what already happened.
• Complex Queries - OLAP tasks involve complex math—adding, averaging, and grouping massive lists of numbers.
• Multidimensional Analysis - Users can slice and dice data (e.g. viewing sales by region, then by month, then by product category) using data cubes.
• Denormalized Data - Databases often use Star or Snowflake schemas to reduce the number of table joins needed for queries.
• Slower response time - While nobody wants to wait all day, an OLAP report might take a few minutes or even a few hours to run. This usually is not a concern since the person waiting is usually a business manager, not a customer standing at a checkout counter.

Pros of OLAP

• Handles Massive Data - It can easily process millions or billions of rows of historical data.
• Does Not Disrupt the Business - Because OLAP lives in a data warehouse, running a massive, heavy report will not slow down the cash registers running on the OLTP database.
• High Performance for Reporting - Optimized for complex analytical queries.
• Strategic Insights - Allows businesses to identify trends, patterns, and anomalies to drive decision-making.
• User-Friendly: The system is often integrated with Business Intelligence tools like PowerBI for visualization.

Cons of OLAP

• Data is Not Real-Time - OLAP systems are usually updated in batches, often overnight. If you look at an OLAP report at 2:00 PM, it usually only includes data up until the night before.
• Slow to Update - Adding new data to an OLAP system takes time because the data has to be heavily organized and formatted before it is saved.
• Expensive and Complex - Building and maintaining a data warehouse requires specialized engineers and large amounts of server storage.
• Latency - Queries can take seconds, minutes, or even hours because of the massive volume of data being scanned.

Example

A regional manager for a coffee shop chain wants to know, "Between hot chocolate or dark roast coffee, which sold better on rainy days last year?" To answer this, the system has to look at weather data, sales data from fifty stores and a whole year of dates. An OLAP system can pull this specific report together without breaking a sweat.

Examples of OLAP in real life

Netflix figuring out what genres of movies are most popular in different countries during the summer.
A hospital analyzing patient records over ten years to see if a specific treatment is working.
- A retail store deciding how much inventory to buy for Black Friday based on the last three years of sales.

Common OLAP Operations

OLAP systems organize massive amounts of data into multi-dimensional structures, often referred to as OLAP cubes. These cubes allow users to view business metrics from any angle. To explore, analyze, and make sense of this complex data, OLAP systems support several powerful analytical operations.

Here is a detailed look at the five core OLAP operations:
1. Roll-Up (Consolidation)
Roll-up is also known as consolidation or aggregation and involves summarizing data to a higher, more generalized level. This operation reduces the detail of the data by climbing up a concept hierarchy or by removing a dimension entirely. It is primarily used by upper management to view macro-level business trends.
It uses mathematical functions—such as summing, averaging or counting to group smaller data points into larger, overarching categories.
Example (Time Hierarchy)
Daily sales → Monthly sales → Yearly sales.

If a company has millions of records of individual daily transactions, viewing them all at once can be overwhelming. Using a roll-up operation, an executive can consolidate these daily records to see total sales by month, and then roll up again to see the total gross revenue for the entire year.
Business Value - Roll-up provides a big picture view of business performance, stripping away unnecessary granular details to highlight overarching trends.

2. Drill-Down
Drill-down is the exact opposite of roll-up. It involves navigating from highly summarized, macro-level data down to highly detailed, micro-level data. This is done by stepping down a concept hierarchy or by adding a new dimension to the dataset.
It breaks a larger aggregated number into the smaller components that make it up, allowing analysts to uncover the root causes behind a specific metric.
Example (Geography & Time Hierarchy)
Yearly sales → Monthly sales → Daily sales (or Country → Region → Individual Store).

Imagine an annual report shows that total yearly sales are significantly lower than expected. A manager can drill down from the yearly view to the monthly view and discover in what specific month sales plummeted. They can then drill down further into the month's daily sales to find which specific week caused the drop.
Business Value - It is essential for root-cause analysis, troubleshooting anomalies, and investigating sudden spikes or drops in performance.

3. Slice
The slice operation performs a selection on one specific dimension of the OLAP cube, resulting in a new, smaller slice of the data.
Think of it like slicing a single piece of bread from a whole loaf. It locks one variable in place so you can analyze the rest of the data in a two-dimensional table.
You isolate a single value within one dimension (e.g., Time, Geography, or Product) while keeping the other dimensions open.
Example
Show sale records for Nairobi city only.

If a data cube contains sales data across Products, Time, and Cities, applying a slice on the City dimension for Nairobi isolates that market. The resulting view will show the sales of all products over all time periods, but exclusively for Nairobi location.
Business Value - It allows regional managers, department heads or specific product owners to filter out irrelevant data and focus entirely on the one area of the business they are responsible for.

4. Dice
While a slice filters data based on a single condition, a dice operation isolates a highly specific sub-cube by applying multiple filters across two or more dimensions simultaneously.
Think of it like cutting a smaller block out of a larger block of cheese.
It selects specific ranges or values across multiple dimensions to create a highly targeted subset of the original data.
Example
Show laptop sales in Nairobi and Mombasa during January and February.

Here, the user is applying filters across three separate dimensions, Product Dimension(Laptops only), Geography Dimension(Nairobi and Mombasa only) and Time Dimension(January and February only).
Business Value - Dicing is used for highly specialized, multi-faceted analysis. It allows data scientists and marketers to look at exact intersections of data, such as evaluating the success of a specific winter promotion for a specific tech product in key coastal cities.

5. Pivot (Rotate)
Pivot, sometimes called rotation, does not filter or change the underlying data, instead, it changes the visual perspective. It rotates the data axes to provide an alternative presentation, making different relationships easier to spot.
It rearranges the layout of the data, typically by swapping rows and columns, or by moving a dimension from the background into the foreground.
Example
Swapping Products and Time periods.

A manager might be looking at a table where Products (Laptops, Phones, Tablets) are listed in the rows and Months (January, February, March) are the columns. By pivoting the data, they can make Months the rows and Products the columns.
Business Value - Different layouts highlight different trends. A pivot makes it easier to compare data side-by-side depending on what the analyst is trying to prove, ensuring the final report is as readable and impactful as possible.
NB: OLAP is not mainly about recording what is happening right now. It is about understanding what has happened and what it means.

OLTP vs. OLAP

The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) boils down to two distinct phases of business; execution and strategy. Simply put, OLTP runs the business, while OLAP analyzes the business.
These two systems are designed for fundamentally different jobs. Understanding how they differ and how they work together comes down to understanding their relationship with time, purpose, and data architecture.

Here is a detailed comparison of how the two systems operate.
1. Main Purpose and System Goals
OLTP - Its primary objective is to handle daily business operations and execute transactions seamlessly. Its core focus is on accuracy, transaction safety, and ensuring the day-to-day business continues without interruption.
OLAP - Its primary objective is to extract valuable insights from data to help leadership make smart, strategic decisions. Instead of facilitating transactions, it focuses on reporting, identifying long-term trends, and planning for the future.

2. The User Profiles
OLTP - These systems are used by everyday customers, cashiers, front-line staff, and mobile applications. These are the people actively interacting with the business in real-time buying items, logging into portals or booking appointments.
OLAP - These systems are utilized by business analysts, managers, and corporate executives. These users interact with data using dashboards, Business Intelligence reports and complex spreadsheets to evaluate business performance.

3. Data State and Architectural Design
OLTP - Data is current, real-time, and highly operational. Since the data is constantly changing, the database is highly normalized to ensure efficiency and eliminate data redundancy. It is optimized to handle a constant stream of inserting, updating, and deleting small bits of data.
OLAP - Data is historical, static, and rarely changes. It consists of summarized data spanning months or years. Because the goal is fast analysis rather than fast updates, the database is often denormalized allowing the system to efficiently read millions of rows of data at once without altering them.

4. Query Dynamics and Performance Needs
OLTP - Queries are short, simple, and require incredibly fast response times per transaction. They generally touch only a few records at a time.
Example Query - Update bread's price to $10, What is John's email address? or Update a specific customer's order.
OLAP - Queries are heavy, long, and highly complex. While speed is still important, the system is built to process massive analytical workloads rather than split-second individual actions.
Example Query - What is the average age of customers who bought bread in November of 2022? or Show the global sales trends broken down by region over the past 5 years.

5. Real-World Examples
OLTP Systems - ATMs, retail checkout registers, airline booking systems, and e-commerce shopping carts.
OLAP Systems - Corporate data dashboards, annual financial reports, and Business Intelligence (BI) platforms.

The Synergy(How OLTP and OLAP Work Together)

A successful business relies on a symbiotic relationship between both systems. You cannot accurately analyze a business if you do not have an OLTP system reliably recording the daily sales. Conversely, you cannot grow a business if you lack an OLAP system to look back at your history and determine what strategies are actually working.

So, how does the two systems connect?
They are linked through a pipeline process known as ETL (Extract, Transform, Load).
Every day, the OLTP database handles the rapid work of serving customers and processing transactions. At the end of the day, usually in the night when customer traffic and system strain are at their lowest, an automated batch script runs.
Extract - The script pulls a copy of the day's newly generated operational data from the OLTP database.
Transform - It cleans, formats, and aggregates that raw data to ensure it is properly structured for analysis.
Load - Finally, the script deposits that formatted data into the OLAP data warehouse.
By the time the business analysts and executives log into their dashboards the next morning, the OLAP warehouse is fully updated with yesterday's finalized numbers. The data is now perfectly prepped to be searched, graphed, and studied.

The Bottom Line

The difference between OLTP and OLAP simply comes down to time. While OLTP handles the exact moment a transaction occurs, OLAP handles months or years of historical data that the transactions leaves behind. Together, they allow a business to operate today while intelligently planning for tomorrow.

Conclusion

Every time you interact with a screen, you are leaving a digital footprint. Databases are the safe spaces that hold those footprints. OLTP ensures daily transactions are fast and secure. Data warehouses collect all those footprints over time. Finally, OLAP helps businesses look at the giant trail of footprints to figure out where they should step next.
These tools might be invisible, but they are the engine running modern business, keeping our digital lives fast, organized, and constantly improving.

From Tables to Tides: Navigating Databases, Warehouses, Marts, Lakes, and the Lakehouse Revolution

Lawrence Murithi — Fri, 01 May 2026 17:13:47 +0000

Introduction

Every time you buy a coffee with a card, "like" a post on social media, withdraw money from an ATM or buy a shirt online, you are interacting with a database. Behind the scenes of every app and website, data is constantly being created, moved, stored and read.
However, not all data storage is the same. The way a system stores your checkout items at a grocery store is very different from the way that same grocery chain analyzes ten years of sales trends.
To understand how modern software handles data, we need to look at the main types of storage; traditional databases, data warehouses, data marts, data lakes and lake houses.

If you are not a computer guru, these terms might sound very technical but they are not as complex as they sound. But once you break them down, they make perfect sense.
This article gives a simple but detailed breakdown of what these are, how they work, and why software relies on both.

The Basics of Data Storage

In today's data-driven world, organizations generate massive amounts of information. To effectively store, manage, and analyze this data, businesses use different architectural models based on their specific needs.
Before we look at the specific processing types, it helps to understand the physical or virtual places where data lives.

1. The Database (The Daily Worker/Operational Engine)
Think of how you keep track of your personal budget. You might use a spreadsheet. A spreadsheet is great for one person looking at a few hundred rows of information. Now imagine a company like Amazon trying to use a spreadsheet to track millions of orders happening every minute. The spreadsheet would freeze and crash instantly.
A database is like a highly advanced, incredibly secure digital filing cabinet built to store massive amounts of information without crashing. Databases are primarily designed for OLTP (Online Transactional Processing). They are the workhorses that power day-to-day operations, such as processing bank transactions, managing inventory, or storing user profiles. Its main job is to quickly record new information, update existing information, and allow users to quickly look up specific details. More importantly, it is organized so that users can find exactly what they are looking for in a fraction of a second.
Information in a standard database is usually organized into tables with rows and columns. For example, an online store might have one table for Customers, one for Products, and one for Orders. The database connects these tables so the system knows exactly which customer bought which product. Think of a traditional database like a busy cash register. It needs to be fast, accurate, and handle hundreds of transactions at once without freezing.

Key Characteristics of a Database

ACID Compliance – Traditional relational databases follow strict rules (Atomicity, Consistency, Isolation, Durability) to ensure that transactions are processed reliably and that data remains accurate even in the event of a system crash.
Normalized Structure – Data is organized into tables to reduce redundancy. For example, a customer’s address is stored in one place rather than being repeated for every order they place.
Real-Time Interaction – Databases are designed to handle thousands of concurrent users making small, rapid changes to the data simultaneously.

Types of Databases

Relational (SQL) - Uses tables with rows and columns (e.g., MySQL, PostgreSQL, Oracle). Ideal for structured data where relationships are clearly defined.
Non-Relational (NoSQL) - Uses flexible structures like documents or graphs (e.g., MongoDB, Cassandra). Ideal for rapidly changing data types and massive scaling.

2. The Data Warehouse (The Long-Term Archive/Analytical Hub)
As a business runs, over time, its database fills up with millions of past transactions. After a few years, a company manager might want to know, "Which of our stores sold the most winter coats in December over the last five years?"
For the database to answer that question, it has to dig through millions of old records thus it slows down. This causes the system to freeze hence people trying to buy things on the website at that moment cannot check out.
Using the grocery store analogy, a store manager walking up to a cashier who has a long line of customers and asking them to calculate the store's total profit for the last decade would cause a crisis and bring the whole store to a halt. To fix this, companies build Data Warehouses.
A data warehouse is a massive storage system designed to hold historical data from many different sources. It aggregates data from various sources such as different operational databases, CRM systems and flat files to provide a comprehensive, historical view of the entire organization. Periodically, usually in the night, the company copies all the new data from these sources and dumps it into the data warehouse.
From the previous example, if the database is the cash register, the data warehouse is the company's central filing room. A data warehouse takes the daily receipts from all the different cash registers, organizes them and stores them for years.
The data warehouse acts as the company's long-term memory. It doesn't handle everyday customer actions. Instead, it is a quiet, organized space where business analysts can run massive queries and reports without interrupting the live website.
Data warehouses utilize OLAP (Online Analytical Processing). Instead of focusing on individual transactions, they are optimized to scan millions of rows to find trends, averages and insights.

The ETL Process (The Warehouse Engine)

Before data enters a warehouse, it must undergo ETL (Extract, Transform, Load).
Extract - Pulling data from multiple, often messy, source systems.
Transform - Cleaning, deduplicating, and formatting the data into a standardized structure.
Load - Moving the clean data into the warehouse.
This is known as Schema-on-Write, meaning the structure of the data must be defined and validated before it can be stored.

Key Benefits of a Data Warehouse

Data Integration – It breaks down data silos by combining information from marketing, sales, and finance into one single source of truth.
Historical Context – While databases often only show current data, warehouses store years of historical records, allowing for year-over-year comparisons.
Optimized for Performance – Warehouses often use columnar storage, which allows them to perform complex calculations such as, What was the total revenue for 2023?, significantly faster than a standard database.
High Quality & Accuracy – Because data is cleaned during the ETL process, business leaders can trust that the reports they generate are based on accurate, non-conflicting information.
Why use a Data Warehouse?
NB: A data warehouse is the foundation for Business Intelligence. It allows executives to run complex What if? scenarios and generate reports that inform long-term strategy. It also ensures that the operational databases are not slowed down by heavy analytical queries.

3. Data Marts(The Departmental Lens)
A data mart is a highly focused, specialized subset of a data warehouse designed to serve the specific needs of a single department or business unit.
While a traditional Data Warehouse acts as a massive, centralized repository containing all of an organization's structured data, a data mart isolates only the information relevant to a specific team.

Key Benefits of a Data Mart

Enhanced Performance - Because the data mart is smaller and queries are highly specific, reports and dashboards load much faster.
Improved Security - By isolating data, companies can strictly control who has access to sensitive departmental information
Ease of Use - Business users and analysts do not have to sift through irrelevant enterprise data to find what they need.
Data marts can be Dependent (built by drawing data from an existing enterprise data warehouse) or Independent (built directly from operational systems).

4. Data Lakes(The Raw Data Reservoir)
A data lake is a massive, highly scalable storage system designed to hold vast amounts of raw, unprocessed data in its native format.
Unlike a data warehouse, which requires data to be cleaned, transformed, and structured into strict tables before it can be stored(Schema-on-Write), a data lake stores data exactly as it is generated, assigning structure only when the data is eventually read or queried (Schema-on-Read).

Data Lakes store?

Structured Data - Traditional tables and relational databases.
Semi-Structured Data - JSON files, XML, CSVs, and server logs.
Unstructured Data - Emails, documents, PDFs.
Binary/Media Data - Images, audio files, and videos.
Streaming Data - Real-time IoT sensor data and website clickstreams.

Why use a Data Lake?

A data lake is ideal when an organization wants to capture and retain everything, even data they don't immediately need. It is highly cost-effective because it utilizes cheap cloud storage. Furthermore, having raw, unmanipulated data is essential for training artificial intelligence (AI) and complex Machine Learning (ML) models.
NB: Without proper organization and governance, a data lake can become a messy, unsearchable Data Swamp.

5. Data Lakehouse(The Modern Hybrid)
For years, companies had to maintain a two-tier architecture; a Data Lake for raw data and machine learning, and a separate Data Warehouse for clean data and business reporting. This resulted in expensive storage costs, data duplication, and complex maintenance.
A Data Lakehouse is a modern architectural design that merges the best concepts of both systems. It is built directly on top of cheap data lake storage, but it applies the organizational structures, management tools, and performance speeds of a data warehouse.

Key Features of a Lakehouse

Flexibility & Scale - Like a data lake, it can store massive amounts of structured, semi-structured, and unstructured data.
Reliability & Structure - Like a data warehouse, it supports ACID transactions (meaning data is reliable, updates don't break the system, and multiple people can read/write simultaneously).
Single Source of Truth - Teams no longer have to copy data from the lake to the warehouse. Business analysts can build BI dashboards, and data scientists can run machine learning models directly on the exact same data platform.

Summary of the Storage systems

The Bottom Line

In today's modern economy, data is a company’s most valuable asset. However, data only provides value if it can be accessed, analyzed, and trusted. By understanding the distinctions between these storage methods, organizations can build a robust infrastructure that avoids the Data Swamp, reduces operational costs, and ultimately turns raw information into a competitive advantage.

Conclusion

Choosing the right data storage architecture is no longer about finding a one-size-fits-all solution but matching the right tool to the specific needs of the business. As organizations evolve from simple record-keeping to complex artificial intelligence and real-time analytics, their data strategy must also mature.
For Day-to-Day Operations, Database remains the essential engine, ensuring that transactions are processed accurately and instantly.
For Strategic Reporting, Data Warehouse and its specialized Data Marts provide the single source of truth needed for executive decision-making and departmental efficiency.
For Big Data & Innovation, Data Lake serves as the vital reservoir for raw information, fueling the next generation of Machine Learning and AI development.
For the Future of Scalability, Data Lakehouse represents the ultimate convergence, offering the best of all worlds; the speed of a warehouse with the massive flexibility of a lake.

Apache Airflow for Beginners: DAGs, Tasks, Operators, and Scheduling Explained

Lawrence Murithi — Wed, 29 Apr 2026 20:24:12 +0000

Introduction

Being a beginner in data engineering can seem very scary. People use technical words like ETL, pipelines, data warehouses, architecture, orchestration etc. At that point, it is very easy to feel like you need a computer science degree just to understand what they mean. However, most of these terms are just technical but not as complicated as they sound.
Data engineering, in simple terms, involves extracting data from a place such as websites, social media pages, excel/csv files or payment systems etc, cleaning it, and storing it somewhere (database, data warehouse or data lake). If you need this done once, you can run a simple Python script. However, if the job must run every hour, every day, or every week, you need a tool that can manage it for you. That's where Apache Airflow comes in.

What is Apache Airflow?

To understand Apache Airflow, think about a process like baking a cake. You do not just throw everything into the oven. You follow steps:

Buy the ingredients
Prepare the dough
Put the dough in the oven
Bake the cake
Let it cool
Add frosting
Serve the cake

Some steps must happen before others. You cannot frost the cake before baking it. You cannot bake the cake before preparing the dough. You also need to know how long each step should take and what to do if something goes wrong.
This kind of process is called a workflow or pipeline and Airflow helps you manage that workflow.
NB: Airflow does not usually do the heavy data processing itself but tells other tools when to do the work.
A workflow may be a data pipeline, a machine learning pipeline, a reporting process, or any process made up of several steps.
Example

extract_data >> clean_data >> load_data >> send_email

Apache Airflow is an open-source platform used to schedule, monitor and manage workflows. It was originally created by Maxime Beauchemin at Airbnb in 2014 to manage increasingly complex data workflows. It helps you decide what task should run first, what should follow, what should happen if something fails, and when the whole process should run again.

Airflow as an Orchestrator

Orchestration refers to arranging many tasks so they run in the right order and at the scheduled time. It makes sure that task B does not run before task A has finished. It also records whether each task succeeded or failed. Without orchestration, you will have many scripts running manually or through separate cron jobs hence becoming difficult to manage as your project grows.

Why Airflow?

While a normal Python script could run fine with simple tasks, you need more control as the number of tasks increases. Airflow is useful because data jobs often have many moving parts.

Airflow is useful because of various reasons:

1. Scheduling
Since most data work is repetitive, scheduling enables workflows to run automatically based on the scheduled time. Airflow handles complex timezone logic natively, ensuring global data pipelines run exactly when they should.

2. Catchup and Backfilling
If your pipeline breaks over the weekend and you don't fix it until Monday, Airflow knows it missed Saturday and Sunday. It will automatically go back in time and run the missed jobs in order.

3. Task Orchestration
Tasks are arranged depending on which task runs first, second, and last.
Example

extract >> transform >> load

This order is critical because if the load task runs before the transform task, the database may receive dirty data. If the transform task runs before extract, there will be no data to clean.
Airflow has parallel execution capabilities to run several tasks simultaneously and wait for all of them to finish before moving to the next step.

4. Monitoring
Monitoring standard scripts to know if a job ran successfully requires SSH-ing into a server and digging through terminal files. However, Airflow provides a centralized web interface for the entire data ecosystem to monitor.
The Web Dashboard/Task Statuses - Airflow comes with a beautiful, easy-to-read user interface (UI) with Color-coded views. You can log in and see exactly which tasks succeeded(green), which are currently running(light green), queued(gray) and which failed(red).
Gantt Charts - Visual representations of task duration, helping you identify bottlenecks in your pipeline.
Historical Trends - view the history of a specific pipeline over a duration of time to spot intermittent failures or slowing performance.

5. Automated Retries
In the real world, tasks can fail for temporary reasons. An API may be rate-limited, a database might briefly drop a connection, or a network hiccup might occur.
Instead of waking up at 3:00 AM to manually restart a failed script, Airflow handles transient errors gracefully by trying the task again based on the number of retries set.
Example

"retries": 3,
"retry_delay": timedelta(minutes=5)

In this scenario, if the task fails, Airflow will wait for 5 minutes before trying again, up to three times.

6. Accessible Logs
Finding out why and when a pipeline breaks is very critical. Airflow attaches isolated logs to every single task execution eliminating the need to hunt through an entire server log file.
A user is also able to click on a failed task directly in the web UI and instantly read the error message for that specific run, reducing debugging time.

7. Failure Handling
When a task fails, letting the rest of the script run can result in corrupt data or crashed databases. Airflow thus stops execution of the downstream tasks preventing bad data from moving through the pipeline.
Airflow can also be configured to send an automated email, slack message, or an alert when a pipeline fails, ensuring the team is instantly aware of critical data outages.

8. Clear Pipeline Structure
Airflow workflows are written entirely in Python hence the pipeline configuration is treated like any other software project. Workflows are visible and anyone can see how tasks connect to each other hence a new person joining the team can open the Airflow UI and understand the pipeline flow.
Workflows can be committed to Git, peer-reviewed, and rolled back if a mistake is made.

The Core Assets(Airflow Terminologies)

Before writing any Airflow code, its important to understand its building blocks and the main terms used in the Airflow world because they describe parts of a workflow system.

1. DAG
In Airflow, a full workflow is called a DAG(Directed Acyclic Graph).
Directed - the workflow moves in one direction. The process has a starting point and an ending point and does not move backward.
Acyclic - there are no loops. Since workflow must have a clear start and a clear end loops are not allowed since they create endless cycles and the pipeline might never finish running.
Graph - a structure made up of points and connections. The points are tasks and the connections are dependencies
A DAG is, therefore, a workflow made up of tasks arranged in a clear order and indicating how they connect with each other.
Example

with DAG(
    dag_id="stock_etl_dag",
    start_date=datetime(2026, 4, 20),
    schedule=timedelta(hours=1),
    catchup=False
) as dag:

2. Task
A task is one step inside a DAG or one job inside a pipeline. A task should usually do one clear job. Creating one huge task that does everything makes debugging hard thus work should be split into separate tasks.
Example

fetch = PythonOperator(
    task_id="fetch_stock_data",
    python_callable=fetch_stock
)

fetch is the task object, and fetch_stock_data is the task name shown in Airflow.

3. Operator
An operator is the tool used to create and run a task. Different operators are used for different types of jobs.

4. Dependency
A dependency defines the order of tasks by telling Airflow which task must run before another task. In simple terms, a dependency is the relationship between tasks.
Example

extract >> transform >> load

This means extract runs first, transform runs after extract succeeds and load runs after transform succeeds.
You can also define parallel dependencies to show which tasks should run simultaneously.
Example

download >> [clean_data, backup_data] >> send_email

This means download runs first, clean_data and backup_data run after download then send_email executes after both clean_data and backup_data finish.

5. Scheduler
The scheduler is the brain of Airflow which checks the DAGs and decides which tasks should run and when.
If the scheduler is not running, DAGs may appear in the UI but tasks may stay queued or show no status.
The scheduler constantly checks:

which DAGs exist
whether a DAG is due to run
whether a task’s upstream tasks have succeeded
whether a task should be queued
whether a failed task should retry
whether a DAG run is complete The scheduler does not usually execute the task itself but decides which task is ready and sends it to the executor.

6. Executor
The executor is the part of Airflow that decides how tasks are actually run. Different Airflow setups use different executors.
Common executors include:
SequentialExecutor - This runs one task at a time thus cannot run many tasks in parallel. It is simple and often used for learning or testing.

LocalExecutor - This runs tasks locally on the same machine, and it can run more than one task at the same time. It's useful when Airflow is installed on one server and you want tasks to run on that server.

CeleryExecutor - This is used for larger setups. The scheduler sends tasks to a queue, and workers pick them up and run them. This setup usually needs a message broker such as Redis or RabbitMQ.

KubernetesExecutor - This runs each task in a separate Kubernetes pod. It's more advanced and usually used in cloud or production environments.

NB: The scheduler decides that a task should run while the executor handles the running method(how Airflow runs tasks).

7. Worker
A worker is the process that actually executes tasks. This term is particularly important when using CeleryExecutor.
In a Celery setup, the flow looks like:

Scheduler >> Queue >> Worker >> Task runs

The scheduler decides the task is ready, the executor sends the task to a queue and the worker picks it up and runs it.
NB: The scheduler decides what should run while the worker does the actual execution.

8. XCom
XCom means cross-communication. It allows tasks to pass small pieces of data to each other. XCom can help pass data from one task to another.
XCom is for passing small messages between tasks, not for moving large datasets. Passing large datasets through XCom slows down Airflow and fills the metadata database.

In PythonOperator, you can push data to XCom:

kwargs["ti"].xcom_push(key="raw_data", value=data)

Then another task can pull it:

data = kwargs["ti"].xcom_pull(task_ids="extract", key="raw_data")

However, you can save large data somewhere else, then pass the location through XCom.
Example

extract task saves data to /tmp/raw_stock_data.csv
XCom passes "/tmp/raw_stock_data.csv"
transform task reads the file

9. Sensors
A Sensor is a special type of Operator that just waits.
Imagine you are expecting an important package. You stand at the window waiting for the mail truck. A Sensor does this in software. It can wait for a file to drop into a folder, or wait for another database to finish an update before letting the DAG move on to the next step.

10. Metadata Database
The metadata database is Airflow’s internal database and uses it to remember what happened(records results of a DAG).
It stores information such as:

DAGs
DAG runs
Task runs
Task states
Schedules
Retries
Users
Roles
Variables
Connections
XCom values

This database is very important because Airflow needs memory.
For example, Airflow needs to know:

Did this task succeed?
Did this task fail?
How many times has it retried?
When did the DAG last run?
What logs belong to this task?
What DAGs exist?
Which users can log in?

What Airflow is NOT

To fully understand Airflow, you also need to know its limits.
It's not a data streaming tool - Airflow is built for batch processing (running jobs every hour, every day, or every week). It is not designed to process live data happening by the second, like tracking live mouse clicks on a website.
It's not a data processing engine - Airflow is the manager, not the worker. You should not use Airflow to process a 100-gigabyte CSV file in its own memory. Instead, Airflow should send a command to a tool like Apache Spark or Snowflake to do the heavy lifting.

Conclusion

Apache Airflow may look difficult when you first encounter it because it comes with many technical jargons making data engineering feel more complicated than it really is. Airflow is simply a workflow manager which helps you organise work that must happen in a specific order. Apache Airflow is about control. It helps you control timing, order, failure, retries, logs, and monitoring.
Just like baking a cake, you must follow the right sequence. A data pipeline works the same way. You extract data, transform it, load it, check it, and sometimes send a notification. Each step depends on the previous one. Airflow gives you a clean way to define these steps and make sure they run correctly.

ETL vs ELT: Which One Should You Use and Why?

Lawrence Murithi — Sat, 11 Apr 2026 17:51:32 +0000

Introduction

Imagine you are running a massive kitchen. Every day, trucks arrive carrying raw ingredients from different farms. Some boxes have dirty potatoes, some tomatoes are bruised, and the meat needs to be separated from the bone.
Can you just throw all of this straight onto a customer’s plate? Definitely not. You have to wash, chop, season, and cook the ingredients first.

In the business world, data works the same way. Every day, companies generate tons of raw data from apps, websites, payment gateways, customer service logs etc. This raw data is usually dirty and messy. It has errors, missing fields and mismatched formats. Before it can be used for reporting or decision-making, it needs to be moved, processed and organized. This process of moving and cleaning data is called data integration.
The two main approaches are used in data integration are ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). Although both methods aim to prepare data for analysis, they follow different steps and are suited for different situations.
If you are just stepping into data engineering, software engineering or backend development, ETL and ELT are common terms you will encounter.
This article explains both approaches in detail, compares them, and helps you understand when to use each one.

What is ETL?

ETL stands for Extract, Transform, Load. It is the traditional method used to move and prepare data.
The key idea in ETL is that data is cleaned and transformed before it is stored in the final system. This means that by the time the data reaches the data warehouse, it is already structured, organized, and ready for use.
This approach was developed at a time when computing resources were limited, and companies had to be very careful about what data they stored.

Steps in ETL

1. Extract
This step involves collecting raw data from different sources such as:

Databases
APIs
Excel files

In real-world scenarios, data rarely comes from a single source. A company may have customer data in one system, sales data in another, and marketing data in a third system. This extraction step pulls all this data together.

2. Transform
In this stage, data is processed in a separate system before being stored. This transformation step ensures that all data is consistent, accurate, and usable.

Common transformations include:

Standardizing data formats
Handling missing values
Removing duplicate records
Fixing errors in data
Masking sensitive data such as credit card numbers
Combining data from different sources

This step is where raw data is made meaningful. Without transformation, data would remain inconsistent and difficult to analyze.

3. Load
After transformation, the cleaned data is loaded into a data warehouse or database.
At this stage, the data is ready for carrying out analysis, creating dashboards and reporting.
Simple Diagram of ETL

Why ETL Was Popular

In the past, data warehouses were physical servers sitting in basements. Storage space was incredibly expensive and computing power was very limited. Companies, therefore, could not afford to store raw, useless data. They had to clean it up and shrink it down before loading it into the warehouse.

What is ELT?

ELT stands for Extract, Load, Transform. It is a modern approach made possible by cloud computing. Here data is loaded first and transformed later inside the data lake.
This approach takes advantage of modern systems that can store large amounts of data and process it quickly.

Steps in ELT

1. Extract
Data is collected from different sources just like in ETL.

2. Load
This is a major shift from ETL. Instead of first cleaning the data, you load the raw data directly into your target data lake without any changes.

3. Transform
The transformation happens inside the data lake. This means analysts can use the warehouse's own computing power to clean, format, and organize the data..

Simple Diagram of ELT

Why ELT Became Popular

The emergence of modern cloud data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift changed the game. Today, storing data in the cloud is incredibly cheap. Furthermore, these cloud warehouses have massive, scalable computing power.
Instead of buying a separate, expensive server just to transform data (like in ETL), companies no longer need to clean data before storing it. They can store everything and process it later.

Differences Between ETL and ELT

1. Order of Steps
in ETL, transformation happens before loading while in ELT transformation happens after loading.

2. Where the Transformation Happens
In ETL, transformation happens in a separate server outside the warehouse while in ELT, the transformation happens right inside the destination data warehouse.

3. Speed of Loading
ELT is usually much faster at the loading stage since there is no cleaning of the data. ETL takes longer because the data has to wait in line to be processed before it can be loaded into the warehouse.

4. Maintenance and Flexibility
ETL is less flexible and changes require rebuilding pipelines. If a mistake is made in an ETL pipeline, or if you want to format the data differently, you have to go back to the source, re-extract the data, and run it through the whole pipeline again.
With ELT, the raw data is already sitting in your warehouse. Any mistake during transformation, you simply write a new SQL query and transform the raw data afresh.

5. The Skills Required
ETL often requires specialized tools and programming such as software engineers who know Java, Python or drag-and-drop tools. ELT uses SQL and since the data is transformed inside a database, it is accessible to analysts.
NB:

ETL focuses on control, structure, and quality before storage
ELT focuses on speed, flexibility, and scalability after storage.

Advantages and Disadvantages

ETL

Advantages
Security and Compliance - If you are dealing with highly sensitive data (like medical records or credit cards), ETL allows you to strip out/mask the sensitive parts before storage in the main warehouse.
Reduced and cheaper Storage - Because you are only loading refined data, you take up much less storage space in your destination database.

Disadvantages
Rigid - Setting up an ETL pipeline takes a lot of time. If a source system needs to make a change, the whole ETL pipeline might break and need to be rewritten.
Bottlenecks - If you have massive amounts of data, the processing server can easily get overwhelmed and slow down the whole operation.

ELT

Advantages
Agility - Since raw data is loaded quickly and directly into the warehouse, analysts do not have to wait for engineers to build complex pipelines to access the raw data.
Future-Proof - Because you keep a copy of the exact raw data, reprocessing of raw data is always possible. You can also go back and answer new business questions that you hadn't thought of previously.
Scalability - Cloud warehouses are designed to scale automatically thus are able to support large datasets.

Disadvantages
Security Risks - Since you are loading raw, unfiltered data into your warehouse, you have to be careful about who has access to the warehouse if that data contains sensitive information such as passwords, personal addresses or financial details.
Higher computing costs - While cloud storage is cheap, cloud computing can get expensive. If you have bad SQL code running inefficient transformations inside your warehouse every hour, your monthly cloud bill will skyrocket.

ETL Tools

These tools are designed for structured, enterprise-level data pipelines.

Informatica
IBM DataStage
Talend

ELT Tools

Modern ELT uses different tools for each step:
These tools allow analysts to work directly with data using SQL.

Fivetran / Airbyte → Extract and Load
dbt (Data Build Tool) → Transform
Cloud Warehouses → Snowflake, BigQuery, Redshift

Real-World Use Cases

Banking System (ETL)
A bank handles sensitive data from mobile app banking, ATMs and physical branch locations. This data contains raw account numbers, account balances, passwords and PIN, personal details and financial transactions thus must be secured before storage.

E-commerce Startup (ELT)
An online store that wants to track user behavior will generate large amounts of data daily just from people clicking around their website, viewing products, adding items to carts etc. The marketing team thus has to constantly change what they want to measure. One week they may want to track abandoned carts while the following week they may want to track how long people look at a specific product. The business has to frequently change what it wants to analyze.

Which One Should You Use and Why?

If you are starting a new project and trying to choose between ETL and ELT, here is a practical guide to help you decide.
Choose ETL if
- You are bound by strict privacy laws - If you work with sensitive data (healthcare, banking), the ability to scrub data before it lands in a database should be key.
- Your system uses on-premise databases - If your company still keeps its servers in a physical server room, your database may not have high processing power required to do transformations internally hence you will need a separate ETL server.
- Your data source is unstructured - If you are extracting data from highly complex, old mainframes that output weird file types, standard ELT tools might not know how to read them. You will need a custom ETL script to decode and format the data before it can be saved.

Choose ELT if
- You are using a cloud data warehouse - If you have Snowflake, BigQuery, or Redshift, ELT is most convinient since it takes advantage of what you are already paying for.
- You work with large volumes of diverse data - If you are tracking millions of tiny events (like website clicks, product views or IoT sensor readings), pushing it directly to the cloud is the only way to keep up with the volume.
- You need flexibility in analysis and fast data processing - ELT allows data engineers to focus purely on moving data from point A to point B, while empowering data analysts to handle the business logic and formatting using SQL.

Conclusion

The debate between ETL and ELT is less about which one is better and more about matching your business needs, data size, and system architecture. Understanding both approaches helps you design better data pipelines and make smarter decisions when working with data.

Advanced SQL Techniques for Data Analytics Every Data Analyst Should Know

Lawrence Murithi — Thu, 09 Apr 2026 13:21:19 +0000

Introduction

In today’s data-driven environment, organizations rely heavily on data to make decisions. Businesses collect large amounts of information from different sources such as sales systems, customer platforms, and operational databases. However, raw data alone is not useful unless it can be analyzed and transformed into meaningful insights.

SQL (Structured Query Language) plays a central role in this process. It allows analysts to retrieve, clean, and analyze data stored in relational databases. While basic SQL skills are important, advanced SQL techniques are what truly enable analysts to solve complex business problems.

This article explains advanced SQL concepts in simple terms and shows how they are applied in real-world data analytics scenarios. The goal is to help you understand not just how to write SQL queries, but how to use them effectively in practical situations.

The Role of SQL in Data Analytics

SQL is the foundation of data analytics. Most business data is stored in databases, and SQL is the language used to interact with that data.

Data analysts use SQL to:

Extract data from databases
Filter and clean datasets
Combine data from multiple tables
Perform calculations and aggregations
Prepare data for reporting tools like Power BI

SQL is often the first step before using any visualization tools. If the data is not properly prepared using SQL, the final reports may be inaccurate or misleading.

Working with Complex Queries

As data becomes more complex, simple queries are not enough to handle it. Advanced SQL, therefore, introduces techniques that help break down complex problems into manageable steps.
In real-world data analysis, datasets are often large and contain multiple tables with different relationships. Moreover, analysts are expected to answer questions that involve comparisons, calculations and multiple layers of logic. These techniques therefore allow analysts to solve the problems step by step instead of trying to do everything in one single query.
Complex query techniques thus help analysts organize their queries in a way that is easier to understand, maintain, and scale.

They are useful when:

Comparing values against aggregated results
Reusing part of a query
Working with multi-step transformations
Simplifying long and confusing SQL statements

Some of the advanced SQL techniques include:

Subqueries

A subquery is a query inside another query. Subqueries are useful when you need to perform a calculation first and then use that result in another query. They allow you to embed logic directly inside your main query.

SELECT name
FROM employees
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
);

Explanation:
_- The inner query calculates the average salary

The outer query returns employees earning above average_

Subqueries can be used in different parts of a query:

In the WHERE clause
In the SELECT clause
In the FROM clause (called derived tables)

Real-World Case Scenarios:

Identify high-performing employees based on salary or performance metrics.
Finding customers who spend more than the average customer
Identifying products priced above the average price

NB: While subqueries are powerful, they can become slow if used incorrectly, especially with large datasets.

Common Table Expressions (CTEs)

A CTE is a temporary result in an SQL query that helps improves readability and organization(temporary table that exists only while the query is running).

CTEs allow you to define a query once and then use it in the main query. This makes complex queries easier to read and understand, especially when working with multiple steps.

WITH sales_summary AS (
    SELECT product_id, SUM(amount) AS total_sales
    FROM sales
    GROUP BY product_id
)
SELECT *
FROM sales_summary
WHERE total_sales > 1000;

Types of CTEs:
- Recursive CTE: A specialized CTE that references itself, which is essential for querying hierarchical data like organizational charts or family trees.
- Non-Recursive CTE: The most common type, used to simplify standard queries by creating manageable logical steps.
Benefits:

Makes queries clean and easier to read
Breaks complex logic into steps, thus easier to debug and modify
Improves maintainability

NB: You can also have multiple CTEs in one query, which is useful for complex data transformations.

In business reporting, analysts often build layered queries. CTEs allow them to structure their logic clearly when working with large datasets.
Step 1: Calculate total sales per product
Step 2: Filter high-performing products
Step 3: Join with other tables for reporting

Advanced Joins

Joins are used to combine data from multiple tables. In advanced SQL, joins become more powerful when dealing with complex relationships.

SELECT c.customer_name, o.order_date, p.product_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN products p ON o.product_id = p.product_id;

In a retail company:

Customers table stores customer details
Orders table stores transactions
Products table stores product information

Using joins, analysts can create a full view of customer purchases.

Poor joins can lead to:

Duplicate data
Incorrect totals
Misleading reports

Window Functions

Window functions allow us to perform advanced calculations across a group of related rows while keeping the original data. They are useful for ranking, running totals, moving averages, and analytical reporting.
Window functions often remove the need for complex self-joins and provide an analytical layer within SQL.
Window functions:

Keep every row
Add calculated values to each row

SELECT column_1,
       function() OVER (
           PARTITION BY column
           ORDER BY column
       ) AS output_column
FROM table_name;

Window functions are widely used in business intelligence and reporting for:

Rankings within a group
Calculating running totals
Compare rows (current vs previous)
Analyzing trends over time

Companies use ranking to:

Identify top performers
Allocate bonuses
Compare employee performance

## Ranking employees

SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;

Businesses use running totals to:

Track revenue growth
Monitor daily or monthly performance
Forecast future trends

## Running totals

SELECT date, sales,
SUM(sales) OVER (ORDER BY date) AS running_total
FROM sales;

Aggregations and Grouping

Aggregation helps summarize large datasets. Raw data is often too detailed to understand directly. Aggregation thus helps turn large datasets into meaningful summaries.

SELECT region, product_id, SUM(sales) as total_sales
FROM sales
GROUP BY region, product_id;

Aggregation allows analysts to answer questions such as:

Total sales by region
Sales by product category
Monthly revenue trends

Aggregation is often used together with:

Filtering (HAVING)
Sorting (ORDER BY)

Data Cleaning and Transformation

Data cleaning is one of the most important steps in analytics. Since raw data is usually dirty and messy, SQL helps clean and prepare it before analysis.
Raw data may contain:

Duplicates
Missing values
Incorrect formats
Inconsistent entries

Removing Duplicates

Removes repeated values and ensures each entry appears only once.

SELECT DISTINCT customer_id
FROM customers;

Handling Missing Values

Replaces NULL values with a default value thus preventing errors in reports

SELECT COALESCE(phone, 'Not Available')
FROM customers;

Data Transformation

Creates a new calculated column
Data transformation also includes:

Changing data types
Formatting dates
Standardizing values

SELECT price, quantity, price * quantity AS total_sales
FROM sales;

Using SQL for Real-World Business Problems

Advanced SQL is not just about writing queries but solving real problems.
In organizations, SQL is used daily to answer business questions and support decisions.

Customer Segmentation

Businesses use customer segmentation to:

Target high-value customers
Design marketing strategies
Improve customer retention

## Grouping customers based on spending

SELECT customer_id,
CASE 
    WHEN total_spent > 1000 THEN 'High Value'
    WHEN total_spent > 500 THEN 'Medium Value'
    ELSE 'Low Value'
END AS segment
FROM customer_sales;

Sales Performance Analysis

Total sales are calculated per product and sorted products by performance to identify best-selling products.

SELECT product_id, SUM(sales) AS total_sales
FROM sales
GROUP BY product_id
ORDER BY total_sales DESC;

Segmentation helps organizations to:

Understand performance
Identify opportunities
Solve operational problems

Performance Optimization

SQL queries must be clean, easy to understand and efficient.
In large databases, poor queries can slow down systems and delay reports.

Best Practices:

Use indexes on important columns to speed up data retrieval
Avoid selecting unnecessary columns
Filter data early to reduces data size
Use CTEs instead of repeated subqueries
Avoid unnecessary joins

Conclusion

Advanced SQL is a critical skill for data analysts. It goes beyond basic queries and allows analysts to work with complex datasets, perform advanced calculations and solve real-world business problems.

In this article, we explored key advanced SQL techniques such as subqueries, CTEs, joins, window functions, aggregations, and data transformation and how they are applied in real business scenarios

In data analytics, SQL is not just a tool but is a core skill that connects raw data to meaningful insights. Mastering advanced SQL allows analysts to move from basic reporting to deeper, more impactful analysis

Connecting Power BI to SQL Databases: A Practical Guide for Data Analysts

Lawrence Murithi — Tue, 17 Mar 2026 12:03:47 +0000

Introduction

In most modern organizations, data is one of the most valuable assets. Companies collect large amounts of information from sales systems, websites, customer platforms, and operational databases. To make sense of this information, businesses use tools that can transform this raw data into clear insights. One of the most widely used tools for this purpose is the Microsoft Power BI platform.

What is Power BI

Power BI is a business intelligence and data visualization tool developed by Microsoft. It allows users to connect to different data sources, analyze data, and create interactive dashboards and reports. These reports help organizations monitor performance, understand trends, and support decision-making among other uses.

Power BI is commonly used by data analysts, business managers, and decision makers because it can present complex data in simple visual forms such as charts, tables, maps, and dashboards.

Most organizations store their operational and analytical data in SQL databases. SQL databases are designed to store large amounts of structured data in tables. They allow users to query, filter, update, and analyze data efficiently using Structured Query Language (SQL). SQL databases are reliable, secure, and scalable, hence they are widely used in business systems such as sales platforms, customer management systems, and inventory systems.

Connecting Power BI to a database allows analysts to access this stored data directly. Instead of manually exporting data into spreadsheets, Power BI can retrieve the data automatically, refresh it when the database changes, and build dashboards that always reflect the latest information.

This article explains how Power BI connects to SQL databases, how to connect to a local PostgreSQL database, how to connect to a cloud database such as Aiven PostgreSQL, and how the loaded data is modeled for analysis.

Understanding the Power BI Interface

Before connecting to a database, it is helpful to understand the Power BI Desktop interface. Power BI Desktop is the main application used for building reports and dashboards.

The Power BI Desktop interface includes several sections such as:

Ribbon (Top Menu) – Contains commands and tabs such as Get Data, Transform Data, and Publish.
Report Canvas – The workspace where charts and dashboards are created.
Visualizations Pane – Used to select and customize charts.
Fields Pane – Displays the tables and columns loaded into Power BI.

You can download Power BI desktop app here.

Connecting Power BI to a Local PostgreSQL Database

PostgreSQL is one of the most popular open-source relational databases used in data analytics. Many organizations run databases locally on their own servers.
The steps below explain how to connect Power BI to a local PostgreSQL database.

Step 1: Open Power BI Desktop

Start by opening Power BI Desktop on your computer.
When the application opens, a blank report canvas appears. This is where you will build your report after loading the data.

Step 2: Get Data

On the Home tab of the ribbon, click Get Data.
This button opens a list of available data sources. Power BI supports many data sources including:

Excel
SQL Server
PostgreSQL
Web APIs The Get Data feature is the starting point for connecting Power BI to any data source. Other data sources are as shown on the image.

Step 3: Select PostgreSQL Database

From the list of available data connectors, click more to view more options. Scroll down, select PostgreSQL Database and click Connect.

Step 4: Enter the Database Connection Details

After selecting PostgreSQL and clicking connect, Power BI opens a connection window that requires connection details for the connection to go through.

Server – The location of the database server. If the database is on your computer, use localhost.

Database – The name of the database you want to connect to.

Step 5: Provide Login Credentials

After a connection is made, Power BI will ask for authentication details.
You will need to provide:
Username
Password

These credentials were set up during installation of the PostgreSQL and allows Power BI to securely access the database.

Once the credentials are entered, click Connect.

Step 6: Select Tables to Import

After connecting successfully, Power BI opens the Navigator Window which displays all available tables in the database.
You can preview the contents of each table before loading them.
There are two options:
Load – Import the data directly.
Transform Data – Clean or modify the data before loading it.

Connecting Power BI to a Cloud Database (Aiven PostgreSQL)

Many organizations now store their databases in the cloud. Cloud databases are accessible through the internet and provide benefits such as scalability, backups, and easier management.
Aiven is a cloud platform that provides managed PostgreSQL databases.
Connecting Power BI to a cloud database is not different to connecting to a local database, only that additional security steps are required.

The steps below explain how to connect Power BI to an Aiven PostgreSQL database.

Step 1: Get the Database Connection Details from Aiven

Login to Aiven and inside the dashboard, you will find the connection information for your PostgreSQL service. These details are used by Power BI to locate and connect to the database.

Step 2: Download and install the SSL Certificate

Cloud database providers often require SSL encryption to secure the connection.
An SSL certificate ensures:

Data transferred between Power BI and the database is encrypted
Unauthorized users cannot intercept the connection
The database server identity is verified

In Aiven, download the certificate file(CA Certificate) from the Connection Information section of the service dashboard.
Rename the downloaded file from ca.pem to ca.crt and install the Certificate on your PC.

Choose Local Machine as the location of the installation and click next.

Choose place all certificates in the following store and browse certificate store to Trusted Root Certification Authorities.
Click ok and finish.

Step 3: Connect Power BI

Open Power BI desktop as before, click Get Data, click more, scroll down and select PostgreSQL Database.
Copy the Server Name from the service URL(host_name:port_number) on Connection Information and paste on Power BI.
Input the name of your database and click ok.

Copy the username and password from Aiven, input them on the Power BI credentials window that opens and click connect.
Once the connection is successful, a navigator window opens and displays all tables in the database.
Select the tables you want to work with and click on load/transfrom data depending on what you wish to do with the data.
Transform data option is used to clean raw data e.g delete any duplicates and address null values using the most appropriate method.

Successfully loaded data displays on the data pane as shown on figure below.

Creating Relationships Between Tables

Once loaded, Power BI automatically detects relationships between tables based on matching columns using primary and foreign keys. Relationships not created can be manually created by dragging a column from one table onto the matching column in another table.
These relationships allow Power BI to combine information across multiple tables.
For example:

In the Model View, the tables appear as connected boxes. The relationships show how data flows between tables.

Data Modeling and Why its Important

Data modeling is the process of defining how data is stored, structured, and related within a database. It ensures that Power BI understands how different tables are related.

Good data modeling allows Power BI to:

Filter data correctly
Calculate totals accurately
Create meaningful visualizations
Avoid duplicated values

For example, when analyzing sales:

The sales table stores transaction records.
The customers table provides customer information.
The products table describes the items sold.

why SQL skills are important for Power BI analysts

Power BI is a powerful tool for building reports and dashboards, but it does not replace the need for strong data handling skills. Most business data is stored in SQL databases, and before that data can be visualized in Power BI, it must first be retrieved, cleaned, and structured properly.
SQL skills give Power BI analysts a real edge by providing an easier way to grab just what you need without pulling everything into Power BI.
Without SQL, analysts may rely too much on raw data, which can lead to slow reports, incorrect results, and inefficient workflows.
SQL allows analysts to:

1. Retrieve Data
Analysts can write queries to select specific rows and columns relevant to their analysis from a database.

-- selecting only products name and price columns from products table

SELECT product_name, price
FROM products

Why this matters in Power BI:

Reduces the amount of data imported
Improves performance
Makes the model easier to manage
Avoids unnecessary columns

2. Filter Data
In real-world scenarios, not all data is useful for analysis. Analysts often need to focus on specific time periods, regions, or business conditions. SQL thus makes it easy to filter datasets based on a specific criteria before loading them into Power BI.

-- Retrieving only sales from 2024 onwards.

SELECT *
FROM sales
WHERE sale_date >= '2024-01-01'

Why this matters:

Reduces dataset size
Speeds up report loading
Focuses analysis on relevant data
Avoids unnecessary processing inside Power BI

3. Perform Aggregations
Aggregation is the process of summarizing data. In business analysis, analysts often need totals, averages, counts, and other summary metrics. SQL can summarize large datasets quickly by using functions such as GROUP BY, SUM, COUNT, and AVG.

-- Calculating total sales per product

SELECT product_id, SUM(quantity) AS total_quantity
FROM sales
GROUP BY product_id;

Why aggregation in SQL is important:

Reduces data volume before loading
Improves Power BI performance
Simplifies data models
Avoids heavy calculations in DAX

Preparing data for Analysis

Raw data must be cleaned or transformed before it is ready for visualization.
SQL can be used to:

Joining Tables and Combining Data

Business data is usually stored in multiple tables.
SQL allows analysts to combine these tables using joins. Joined datasets in SQL can simplify the data model.

SELECT c.customer_name, s.sales_amount
FROM customers c
JOIN sales s
ON c.customer_id = s.customer_id;

Why this is important:

Combines related data into one dataset
Reduces the need for complex relationships in Power BI
Makes analysis easier
Prevents duplication errors

Data Cleaning and Preparation

Raw data is often messy. It may contain:

Missing values
Duplicate records
Incorrect formats

SQL helps clean and prepare the data before it is loaded into Power BI hence leading to better insights.

-- Eliminating duplicates

SELECT DISTINCT customer_id
FROM customers;

--Handling missing values

SELECT 
  COALESCE(phone_number, 'Not Provided') AS phone
FROM customers;

Why data cleaning matters:

Ensures data accuracy
Improves report reliability
Reduces cleaning work in Power BI
Prevents errors in calculations

Creating Calculated Fields

SQL allows analysts to create new columns based on existing data.

-- Calculate total sales

SELECT 
  product_name,
  price,
  quantity,
  price * quantity AS total_sales
FROM sales;

Why calculated fields are useful:

Prepares key metrics before loading
Reduces need for DAX calculations
Keeps logic centralized in the database

Supporting Advanced Analysis

SQL also supports more advanced operations such as:

Window functions (running totals, ranking)
Subqueries
Common Table Expressions (CTEs)
Data transformations

SELECT 
  sale_date,
  SUM(sales_amount) OVER (ORDER BY sale_date) AS running_total
FROM sales;

Conclusion

Power BI is a powerful tool that helps organizations transform raw data into meaningful insights. By connecting directly to SQL databases, Power BI allows analysts to access structured data stored in business systems and convert it into interactive dashboards and reports.
SQL prepares the foundation, and Power BI builds the story on top of it. Strong SQL skills allow analysts to work more efficiently, produce accurate reports, and deliver better insights for decision-making.
When SQL and Power BI are used together, they provide a powerful combination for modern data analysis and business intelligence.

Mastering SQL Joins and Window Functions

Lawrence Murithi — Tue, 03 Mar 2026 10:36:56 +0000

Introduction

SQL (Structured Query Language) is a powerful tool used to search, manage, and analyze large amounts of data. It is widely used by data enthusiasts, software developers and even marketing professionals.
In real-world databases, data is not stored in one large table. It is divided into multiple related tables. This makes storage efficient and avoids duplication. To work effectively with such data, you must understand SQL joins and window functions. These two features allow you to combine data correctly and perform advanced analysis without losing important details.

SQL Joins

A JOIN in SQL is used to combine rows from two or more tables based on a related column. This relationship is usually created using:

A primary key (unique identifier in one table)
A foreign key (reference to that key in another table) Joins are essential when working with relational databases because data is often split across multiple tables.

Importance of Joins

Combining related data from multiple tables
Maintaining relational integrity
Supporting one-to-many and many-to-many relationships
Building meaningful reports and analytics
Preventing unnecessary duplication of data

The type of join you use directly affects:

The number of rows returned
Whether NULL values appear
How business logic is interpreted NB: Choosing the wrong join can lead to missing data, duplicated records, or incorrect analysis.

Types of SQL Joins

INNER JOIN

The INNER JOIN returns only the rows that have matching values in both tables.

Combines records based on a related column
Returns only matching rows
Excludes non-matching rows INNER JOIN is used when:
You only need matched data
You want to exclude incomplete relationships

LEFT (OUTER) JOIN

The LEFT (OUTER) JOIN returns:

All rows from the left table
Matching rows from the right table
NULL values if no match exists LEFT JOIN is used when:
You want all records from the main table
You want to identify missing matches
You need complete reporting from one side

RIGHT (OUTER) JOIN

The RIGHT (OUTER) JOIN returns:

All rows from the right table
Matching rows from the left table
NULL where no match exists on the left NB: RIGHT JOIN works like LEFT JOIN but from the opposite direction.

FULL (OUTER) JOIN

The FULL JOIN returns:

All rows from both tables
Matching records where possible
NULL values where no match exists The FULL JOIN is used when:
Comparing two datasets
Identifying differences between systems
Performing reconciliation tasks

CROSS JOIN

A CROSS JOIN returns all possible combinations of rows thus can create very large results.
If Table A has 5 rows and Table B has 10 rows:
Result = 50 rows.
It does not use a matching condition.

A CROSS JOIN is used to:

Generate combinations
Create calendar expansions
Test scenarios

SELF JOIN

A self join joins a table to itself. Aliases are used to refer to the same tale
Example:
Employee table:
| EmployeeID | ManagerID |
To show each employee and their manager name, the table is joined to itself.

Self joins are useful for hierarchical data.

NATURAL JOIN

A natural join Joins all tables using columns that have the same name.

Performance Considerations for Joins.

Joins can affect performance, especially in large databases.
Best practices:

Index join columns (primary and foreign keys)
Avoid unnecessary joins
Filter data early using WHERE
Understand execution plans
Be careful with joins that multiply rows unintentionally

Improper joins can cause:

Duplicate results
Data inflation
Slow query execution

Window Functions

Keep every row
Add calculated values to each row

Structure of a window function:

SELECT column_1,
       function() OVER (
           PARTITION BY column
           ORDER BY column
       ) AS output_column
FROM table_name;

1. OVER()

The OVER() clause defines how the window function operates and controls:

Partitioning
Ordering
Optional frame boundaries

2. PARTITION BY()

The PARTITION BY divides rows into logical groups. If omitted, the entire dataset is treated as one group.

3. ORDER BY()

ORDER BY defines the sequence of rows inside each partition.

It is essential for:

Ranking
Running totals
Time-based comparisons

If ORDER BY is omitted, row processing order is undefined.

4. Frame Clause (ROWS vs RANGE)

Used to define a range of rows(boundary) and commonly used for moving averages and cumulative calculations.
In the ROWS subclause, the frame is defined by beginning and ending row positions while in the RANGE subclause, the frame is defined by a value range.

ROWS BETWEEN lower_bound AND upper_bound
RANGE BETWEEN lower_bound AND upper_bound

Types of SQL Window Functions

Window functions fall into three main categories.

1. Aggregate Window Functions

These include:

AVG() - Calculates moving averages.
SUM() - Creates running totals.
COUNT() - calculates the number of items found in a group.
MIN() - returns the minimum value.
MAX() - returns the maximum value.

Some use cases of Aggregate window functions include:

Department totals
Running totals
Moving averages
Cumulative metrics

2. Ranking Window Functions

They are used to assign position or rank.

ROW_NUMBER() - Assigns a unique number to each row.
RANK() - Assigns rank with gaps when ties exist.
DENSE_RANK() - Similar to RANK but does not skip numbers and better for ranking reports where gaps are not desired.
PERCENT_RANK() - calculates the relative rank of a row within a group of rows.

Some use cases of Ranking window functions include:

Top N per group
Performance ranking
Leaderboards
Percentile analysis

3. Offset (Value) Window Functions

They are used to access data from other rows.

LAG() - shows previous row value and used in time-based analysis.
LEAD() - shows next row value and used in time-based analysis.
FIRST_VALUE() - returns the first value in an ordered set of values within a partition.
LAST_VALUE() - returns the last value in an ordered set of values within a partition.
NTH_VALUE() - Divides rows into equal groups and useful in performance analysis and segmentation.

Some use cases of Offset window functions are:

Month-over-month growth
Time-series comparison
Trend detection
Sequential analysis

Conclusion

SQL joins and window functions are core tools for designing efficient and powerful queries.
Joins allow you to combine data from multiple tables using defined relationships while Window functions provide an advanced analytical layer in SQL.