Forem: Robert Njuguna

Apache Kafka: A Beginner's Guide to Key Concepts

Robert Njuguna — Sun, 17 May 2026 12:34:30 +0000

Apache Kafka

A busy post office keeps on delivering even when one worker is slow. The letters keep on arriving on a constant rate, delivery staff picks the letters up and drops them, and no one waits for the other.

Now, Apache Kafka works this way but now for data. It is a distributed platform for data streaming. Some application sends data to Kafka, and other read the same data from Kafka. Everything happens fast and in real time

The problem Kafka actually solves.

Before Kafka was developed, applications used to talk to each other directly. APP A sends data to APP B, which in turn send the data to APP C. It seems easy at first, but as systems grew, the massive amount of data sending becomes messy. If App C is down, App A has to wait and if App B is down, the whole process collapses.

This is where Kafka comes in, instead of the apps sending data directly to other apps, Kafka acts as an intermediate. producers send data into Kafka and consumers read that data from Kafka.

Key Concepts:

1. Events and Messages

An event is a record that something happened. A user clicked a button, a sensor recorded a temperature or a payment was processed. These messages are referred to as records in Kafka.

Every message has 3 parts:

A key, which is an optional identifier( e.g : userID)
A value, this is the actual data( e.g: {"user": "jane", "action": "purchase"})
Timestamp, when it happened.

2. Topics

A topic is a named channel where the messages live. A topic is like a folder in a computer, where the messages are the files inside that folder. If one creates a topic named users, each time a new user registers, then the app where the user was registered sends a message to the user topic in Kafka.

Topics are append only, meaning new messages go at the end, and old messages still remain. No messages disappear even after being read.

A small example:

Topic: payments
├── msg 1: {"id": "a1", "amount": 50}
├── msg 2: {"id": "a2", "amount": 120}
└── msg 3: {"id": "a3", "amount": 30}

Kafka allows as many topics as one wishes to create. May it be users, inventory, payments, email-sent etc.

3. Partitions.

A Topic can hold data in large amounts. In order to handle this, Kafka splits topics into partitions. Each partition contains a numbered log of messages. For example, if payment has 3 partitions, then Kafka spreads messages across all the 3 partitions.

Each message inside a partition gets a number called an offset. offset 0 is the first message and offset 1 is the second message. The offset never changes, which allows consumers to track where they are.

Partition 0: [offset 0] [offset 1] [offset 2]
Partition 1: [offset 0] [offset 1]
Partition 2: [offset 0] [offset 1] [offset 2] [offset 3]

This is important because of parrarelism, which allows multiple consumers to read messages at the same time. If there are 10 partitions, then 10 consumers can read the data.

4. Producers

A producer is any application that sends messages to kafka. When an e-commerce backend sends order data , it is a producer, and when a mobile app sends user activity, it as also a producer.

When writing topics, producers get to pick which topic they are writing. Producers can also either chose where the message they are writing goes or let Kafka chose. If Kafka decides, it writes data in one partition, then the next, and the next, then comes back to the first. Like that.

To simplify this, one can decide what partition they wish the data to go by passing a key argument directly in the send () call. Messages with the same key, go to the same partition. e.g.

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lamda v: json.dumps(v).encode('utf-8')
)
.
# This message goes to partition with key 'user-10'

producer.send(
    topic='payments',
    key=b'user-10',
    value={"amount": 50}
)

producer.flush()

Here you are specifying the exact partition where the message goes.

5. Consumers and Consumer Groups

A consumer reads messages from a topic. A consumer tracks the offsets, so as to determine which messages have already been processed.

A consumer Group is a team of consumers working together to read messages from a topic. Kafka assigns one partition to one consumer in the group. For example, if a topic has 4 partitions, each consumer get assigned exactly one partition, and if one consumer fails, Kafka assigns the partition to the others automatically, in a process called rebalancing.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    topic='payments',
    bootstrap_servers='localhost:9092',
    group_id='payments-processor',
    value_deserializer=lambda v: json.loads(v.decode('utf-8'))
)

for message in consumer:
    print(f"Offset: {message.offset}, Data: {message.value}")

plaintext

6. Brokers and the Cluster

A broker is a single Kafka server. Brokers store the partitions and serve the producers and consumers. A cluster is a group of brokers. In production, one runs at least 3 brokers. Data is copied, or replicated, onto each broker for safety. If one broker fails, another broker has the data it needs to serve the clients.

Each partition has a leader (one of the brokers) and the replica brokers. Producers and consumers always communicate with the leader. The replicas stay in sync with the leader. If the leader broker fails, one of the replica brokers automatically becomes the new leader.

Broker 1 (Leader for partition 0)
Broker 2 (Replica for partition 0, Leader for partition 1)
Broker 3 (Replica for partition 1)

7. Retention

Messages are kept for a set amount of time. The default time is 7 days. After that time, the messages will be automatically deleted. This period can be changed on a per-topic basis. This is very powerful as it allows consumers to fall behind in processing the events and then catch up later. Another consumer could also replay all the events that occurred from the beginning of the topic. There is no limit to processing these events in real time or catching up later.

A Real Example of How Kafka Fits in

Customer places an order - This is a backend that produces a message to orders topic.
The kitchen app consumes from orders - starts preparing the food.
The delivery app consumes from orders - assigns a driver.
The billing app consumes from orders - charges the customer.

All the three consumers work independently, such they never block one another. So if the delivery app crashes, it picks up where it left off when it resumes since the offsets are saved.

Conclusion

Kafka may seem complicated at the start, but becomes simpler as one connects the process to familiar objects. Topics are folders, messages/events are files., Partitions are subfolders in the big folder. Producers are writers of these files and consumers read these files. Brokers are the office buildings that physically hold the filing cabinets. If one building burns down, another building already has a backup copy of those files. Work continues without losing anything.

Apache Airflow for Beginners: DAGs, Tasks, Operators, and Scheduling

Robert Njuguna — Sun, 10 May 2026 14:42:38 +0000

When one needs to run a series of scripts in a specific order, say you first want to extract the data, then Transform/clean it and finally load it in a database, then you have actually understood what problem airflow solves. Doing such a task manually daily would be difficult, Airflow automates it.

What is Apache Airflow

Apache airflow is an open source tool that was developed by AirBnb in 2014. It allows one to schedule, monitor and manage workflows using python code.

In short, instead of running scripts manually or creating messy cron jobs, you write a python code that defines the workflow automatically. Airflow runs the workflow, tracks what failed and lets you retry the failed tasks.

The Core Concept: What Is a DAG?

DAGS stands for Directed Acyclic Graph. It just means:

Directed - scheduled tasks runs in order, task B can't run before A.
Acyclic - After running task B you can go back to task A.
Graph - the connected tasks are visually represented in a map.

A DAG is like a recipe, you can't frost a cake before you bake it. A specific order must be followed.

A Real World Example of a DAG - a daily data pipeline

extract_data → transform_data → load_to_warehouse

A basic DAG code

from airflow import DAG
from airflow.operators.python import PythonOperator

def extract():
    print("Extracting data ...")

def transform():
    print("Transforming data...")

def load():
    print("loading data to warehouse...")

with DAG(
    dag_id="daily_pipeline",
    start_date=datetime(2026, 5, 10),
    schedule_interval="@daily",
    catchup=False
) as dag:

    extract_task = PythonOperator(task_id="extract", python_callable=extract)
    transform_task = PythonOperator(task_id="transform", python_callable=transform)
    load_task = PythonOperator(task_id="load", pyhon_callable=load)

    extract_task >> trasform_task >> load_task

">>" operator sets the order in which the tasks run, so, transform_task cannot run before extract_task.

What are Tasks

Every box in the dag is a task, extract_task is a task, transform_task is another task, and load_task is another task. Airlow runs each task independently.

Each task has one of these statuses at runtime:

queued — waiting to run
running — currently executing
success — finished without errors
failed — something went wrong
skipped — intentionally bypassed
retrying — failed but trying again.

if for instance, the load_task fails, Airflow does not retry the extract_task and trasform_task, rather it only retries load_task, which saves on time.

What Are Operators?

An operator is the template that defines what a task actually does. Think of an operator as a worker who already knows how to do one specific type of job.

Airflow comes with many inbuild operators:

1. PythonOperator

This runs any python code:

from airlfow.operators.python import PythonOperator

def my_fucntion():
    print("Hello from python!")

task = PythonOperator(
    task_id = "run_python",
    python_callable = my_function
)

2. BashOperator

This runs any bash command or shell script.

from airflow.operators.bash import BashOperator

task = BashOperator(
    task_id = "run_bash",
    bash_command = "echo 'Pipeline started' && python3 scripts/extract.py"

3. EmailOperator

This sends an email, useful for reports.

from airflow.operators.email import EmailOperator

task = EmailOperator(
    task_id = "send_email",
    to = "exampleemail@gmail.com",
    subject = "Daily Report is ready",
    html_comtent = "<p> Your pipeline finished successfully. <p>"
)

4. PostgresOperator

Runs an SQL query against a Postgres database.

from airflow.providers.postgres.operators.postgres import PostgresOperator

task = PostgresOperator(
    task_id = "run_sql",
    postrges_conn_id = "my_postgres_connection",
    sql = "INSERT INTO reports SELECT * FROM staging WHERE date = '{{ ds }}';"
)

5. S3ToRedshiftOperator

This copies data from Amazon S3 directly into Redshift. No python code is needed for the actual move.


from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator

task = S3ToRedshiftOperator(
    task_id="load_to_redshift",
    s3_bucket="my-data-bucket",
    s3_key="data/2024/sales.csv",
    schema="public",
    table="sales",
    copy_options=["CSV", "IGNOREHEADER 1"]
)

What is Scheduling?

This is the part of the dag where you define how frequent or when to run your dag.

Airflow supports two formats:

1. Preset shortcuts:

Preset	Meaning
@once	Run one time only
@hourly	Every hour
@daily	Once a day at midnight
@weekly	Once a week
@monthly	Once a month

2. Cron expressions:

"0 6 * * *"      → Every day at 6:00 AM
"0 6 * * 1"      → Every Monday at 6:00 AM
"*/15 * * * *"   → Every 15 minutes
"0 0 1 * *"      → First day of every month at midnight

Cron follows: Minute hour day month day-of-week

Example of running a pipeline everyday at 7am

with dag(
    task_id = "morning_pipeline",
    start_date = datetime(2026 5 10)
    schedule_interval = "0 7 * * *",
    catchup = False
) as dag:
    ...

Everything about multi-step pipelines becomes easier to run with Apache Airflow. We don't have to babysit scripts every day anymore; our workflow is written once in Python, and Airflow does the rest. A DAG is a "full workflow"; a task is "one step" in the workflow; an operator is "the worker" that performs the step; and scheduling is "how or when" you want the workflow to run.

OLAP vs OLTP: What's the Difference and Why Does It Matter?

Robert Njuguna — Sun, 10 May 2026 12:18:33 +0000

If you've ever worked with databases, you've probably heard these two terms (OLAP and OLTP). But what do they actually mean, and when do we use each one? Let's break it down with real examples.

The Core Idea

Think about two very different jobs:

A cashier at a store processes hundreds of small transactions per minute (scanning items, updating inventory, recording payments).
A store manager pulls up last quarter's sales report to decide what to stock next season.

These two people need completely different tools. That's exactly why OLTP and OLAP exist.

OLTP (Online Transaction Processing) = the cashier's system
OLAP (Online Analytical Processing) = the manager's reporting tool

What Is OLTP?

OLTP systems handle day-to-day operations. They are built to process many small, fast transactions

inserts, updates, and deletes happening in real time.

Real-world example: When you buy something on Jumia, the OLTP system:

Deducts the item from inventory
Creates a new order record
Charges your payment method
Triggers a confirmation email

All of this happens in milliseconds, and thousands of users do it at the same time. OLTP database examples include : PostgreSQL, MySQL, SQL Server, Oracle

What a typical OLTP query looks like:

-- Find a specific customer's order
SELECT * FROM orders
WHERE customer_id = 4821
AND order_date = '2024-11-15';

This query is fast because it fetches one row using an index. No heavy lifting.

What Is OLAP?

OLAP systems handle analysis and reporting. Instead of processing one transaction at a time, they scan millions of rows to find patterns, trends, and summaries.

Real-world example: Netflix's data team uses OLAP to answer questions like:

Which shows had the most views in Q3?
What percentage of users in Africa watch on mobile?
How did churn rates change after a price increase?

These questions require scanning huge datasets — not updating a single record.

OLAP tools examples: Google BigQuery, Amazon Redshift, Snowflake

What a typical OLAP query looks like:

-- Total revenue by country for the last 12 months
SELECT country, SUM(revenue) AS total_revenue
FROM sales_fact
JOIN date_dim
ON sales_fact.date_id = date_dim.id
WHERE date_dim.year = 2024
GROUP BY country
ORDER BY total_revenue DESC;

This query scans millions of rows across tables. It's slow compared to OLTP but it's built for exactly this kind of work.

Side-by-Side Comparison

Feature	OLTP	OLAP
Purpose	Run daily operations	Run analysis and reports
Query type	Simple, targeted (1–few rows)	Complex, aggregated (millions of rows)
Operations	INSERT, UPDATE, DELETE	SELECT with GROUP BY, aggregations
Data freshness	Real-time, always current	Historical, often hours/days old
Database size	Gigabytes	Terabytes to Petabytes
Users	Thousands of end users at once	Small number of analysts
Speed goal	Fast individual transactions	Fast large-scale reads
Schema style	Normalized (3NF)	Denormalized (star/snowflake schema)
Example tools	MySQL, PostgreSQL, SQL Server	BigQuery, Redshift, Snowflake
Example use case	Process a payment	Compare monthly revenue across regions

Why Schema Design Is Different

This part trips up a lot of beginners.
OLTP uses normalized schema, data is split into many small tables to avoid duplication and make writes fast.

customers → id, name, email
orders → id, customer_id, date
order_items → id, order_id, product_id, quantity
products → id, name, price

OLAP uses denormalized schema (usually a star schema) — data is flattened into fewer, wider tables to make reads fast. Joins are expensive at scale, so OLAP systems reduce them.

sales_fact → sale_id, date, customer_name, product_name, country, revenue, quantity

Yes, there's repeated data in that table. That's intentional, reads are what matter here, not storage efficiency.

A Practical Scenario: E-Commerce Company

Imagine you run an online store. Here's how both systems work together:

OLTP handles:

Customer logs in → session created
Customer adds to cart → cart table updated
Customer checks out → order inserted, inventory decremented, payment recorded

OLAP handles:

Every night, data from OLTP is copied to a data warehouse (this is called ETL — Extract, Transform, Load)
Analysts query the warehouse: "Which products had the highest return rate last month?"
Marketing pulls a report: "Which ad campaign drove the most first-time purchases in Q4?"

You need both. They serve different people with different needs.

The Biggest Mistake people Make

Running heavy analytical queries directly on your OLTP database. This causes:

Slow performance for live users - your production database is now doing two jobs at once
Lock contention — long-running reads block writes
Downtime risk — a bad analytical query can crash a production system

The fix is to separate them. Use OLTP for operations, copy the data to a warehouse, and run analysis there.

Quick Rule

Writing/updating → OLTP
Reading patterns across history → OLAP

ETL vs ELT: Which One Should You Use and Why?

Robert Njuguna — Tue, 14 Apr 2026 10:28:09 +0000

ETL and ELT: Which to use and why?

What Is ETL?

ETL is an abbreviation that means "Extract, Transform, Load." It involves the transfer of data between two locations. "Extract" refers to a term that typically means to draw raw data out of a source system. These sources may be a database, spreadsheet, API, or web application. The term "transform" states that it cleans the data and restructures it before sending it to its destination. "Load" is used to refer to loading the cleaned data into a storage mechanism such as a data warehouse.

Consider the ETL as laundry prior to stuffing it in a suitcase. Clean first and put away.

A retailing company gathers sales information from five branches of stores. The data is stored in a different format and is stored in each branch. The ETL process obtains all this data, corrects the formatting disparities, and eliminates any duplicates, then loads a single clean, consolidated table into the central database of the company. Reporting by analysts can now be done without messy data.

ETL has been utilized for decades. Its popularity came at a time when storage was costly, and companies could not afford to store raw, unprocessed data. It was also time-saving to transform data prior to loading it and maintain clean warehouses.

What Is ELT?

ELT is used as an abbreviation to mean "Extract, Load, Transform." The processes are nearly similar to ETL, except that these last two processes are reversed. Data are extracted from the source and loaded into the destination system in their raw form and transformed within that destination system.

ELT can be imagined as putting all your clothes in the suitcase and sorting them out when you get to the hotel.

One tech startup transmits millions of logs of user activities daily in the company's mobile application. The group puts all those crude logs in a cloud data warehouse such as BigQuery. After the data is in, data analysts then write SQL queries to clean and convert the data into useful reports. The raw data remains accessible to anybody in need of it in the future.

ELT was made popular by the fact that cloud storage became very cheap, and cloud data warehouses became very powerful. Snowflake, Google BigQuery, and Amazon Redshift are examples of platforms that have the ability to perform large transformations through their own internal computing capabilities.

The major differences between ETL and ELT.

1. The location of the transformation.

In ETL, conversion occurs out of the destination system. The heavy lifting is done by a separate tool or server prior to the arrival of the data. Transformations, in ELT, occur within the destination system once the data has been loaded.

This is important due to its impact on speed, cost, and flexibility.

2. Speed of Loading

ETL may be slow to load since the data has to be cleaned prior to being loaded into the warehouse. ELT is much faster in loading data as raw data are directly transferred into the storage with no pre-processing.

3. Data Storage

Only clean, transformed data is stored in ETL. Transformation usually results in the discarding of the raw version. Raw data is always stored in ELT, and thus, the original information does not get lost. Later, teams can re-transform the same data with other rules.

4. Flexibility

ETL pipelines are stiff. In case of a change in business rules, the pipeline will have to be constructed again. ELT is more lenient. As the raw data remains intact, the analysts can add new transformation queries without having to touch the pipeline itself.

5. Cost

ETL needs an additional transformation server or a tool, which increases the costs. ELT is based on the computing power of the cloud warehouse itself, which can be less expensive, depending on the size of operations.

6. Skill Requirements

ETL frequently demands engineers familiar with particular software, such as Informatica or Talend. ELT frequently involves individuals familiar with SQL, a more common skill.

Feature	ETL	ELT
Order	Extract → Transform → Load	Extract → Load → Transform
Where Transform?	Outside warehouse	Inside warehouse
Raw Data Stored?	No	Yes
Speed to Load	Slower	Faster
Flexibility	Lower	Higher
Best For	On-premise systems	Cloud-based systems
Common Tools	Informatica, Talend, SSIS	dbt, BigQuery, Snowflake

ETL in real-world applications.

1. Banking and Finance

Banks process transactions made on ATMs, mobile applications, and branch counters. All these data are in various formats. All the records are standardized by an ETL pipeline, and any unfinished ones are removed and then loaded into the reporting database. Banks cannot just afford to store bad information, as it directly impacts financial reports and regulatory compliance.

2. Healthcare

Patient data is gathered by the various systems at hospitals. There is a system that holds lab results, a system that holds prescriptions, and a system that holds appointment history. ETL consolidates all such data in a single clean patient record. The step of transformation is important, as any incorrect data format in a medical record can lead to grave consequences.

3. Legacy System Migration

ETL is usually utilized when a firm is changing its database system to a new one. The existing data is taken out, cleaned to suit the new system, and finally inserted into the new database.

ELT use cases in the real world.

1. E-commerce Analytics

Even an online shop such as a Jumia or Amazon-type store gathers billions of clicks, searches, and purchases on a daily basis. All of this raw data is directly stored in a cloud warehouse. The next step is writing SQL queries by data teams to identify trends, such as the most commonly viewed products prior to a purchase. The raw data remains open to analysis in the future.

2. Social Media Platforms

ELT is highly utilized in platforms that follow the user behavior, likes, shares, and watch time. The amount of data is too huge to convert prior to loading. The only feasible way is to load raw data quickly and to transform data at a later stage within the warehouse.

3. Start-ups and emerging firms.

Early-stage companies are not always sure of the questions that they will pose to their data in the future. ELT allows storage of raw data and enables users to explore. On the occurrence of a new business query, analysts write new transformation queries without the reconstruction of any pipeline.

Tools Used in ETL

One of the most robust and reliable ETL tools is Informatica PowerCenter. It is popular in big businesses in the banking and insurance sectors.

The Microsoft SSIS (SQL Server Integration Services)is an inbuilt part of the Microsoft ecosystem. SQL Server databases are already in use by companies, so they are likely to select SSIS to perform ETL workflows.

Talend is an ETL tool that has free and paid versions and is an open-source tool. It is linked to hundreds of data sources and allows a visual interface to be used to construct transformation logic.

Apache NiFi is a system created to transfer data between systems. It manages data routing, transformation, and scaling of data.

Another free alternative that is being used by many mid-sized businesses to create ETL pipelines without incurring high licensing fees is Pentaho.

Tools Used in ELT

The most used tool of transformation steps during ELT is the dbt (Data Build Tool). SQL models are written in dbt by data engineers, and dbt executes such transformations directly on the data warehouse, along with automatically tracing documentation and data lineage.

Google BigQuery is a cloud data warehouse capable of housing raw data and performing large-scale SQL transformations at scale. A lot of companies rely on BigQuery as the heart of their ELT.

Another example of a cloud-based warehouse is Snowflake, where storage and computing capabilities are independently handled. The design renders it highly economical for ELT processes.

Amazon Redshift is a cloud data warehouse by AWS. It can be easily integrated with other AWS services and is a powerful option for teams already on the Amazon cloud.

Fivetran and Airbyte/ are software that manage the ELT extract and load stages. They are linked to hundreds of data sources and automatically synchronize raw data to the warehouse. DBT then performs the transformation step.

Which to use and which to leave.

These questions need to be asked prior to selecting an approach.

Is the amount of data extremely large?

High volumes prefer ELT since unprocessed information can be loaded in the shortest amount of time and processed in the future with the help of the computing ability of the warehouse.

Is storage cheap?

It is cheap today to store data in the cloud. ELT is much more appropriate in clouds. Storage: On-premise storage is more costly, and the ETL's strategy of only storing clean data remains valid for older systems.

Is there a high rate of business rule change?

Regular amendments of the rules are in favor of ELT. The raw data remains the same, and additional transformations can be written without modification of the pipeline.

Do you have any concerns over data privacy?

ETL has the capability to cover up or anonymize sensitive data even prior to its coming into the warehouse. This is critical in the fields of health care and finance, where uncontrolled keeping of raw personal data should not be allowed.

What are the skills of the team?

SQL-based teams have the ability to work more swiftly with ELT and dbt. Experienced teams that are familiar with certain ETL tools might find it easier to adhere to the ETL model.

A real-life application of a combination of the two.

There are numerous actual businesses that adopt a hybrid model. ETL may be used by a company to strip out any personally identifiable information in customer records prior to loading them. It is then sensitively cleaned up, and the rest of the data is stored in the warehouse as raw data with ELT making all further transformations. This offers the best of both worlds: the protection of data privacy, which ETL provides, and flexibility, which ELT provides.

Final Thoughts

ETL and ELT are solutions to the same issue in dissimilar ways. ETL standardizes data prior to storage and is most effective in structured settings that have rigid data guidelines. ELT is a data storage that is transformed subsequently and is therefore the best fit for cloud platforms that deal with large and fast-moving data. The decision to make is based on the tools available in the team, the amount of data, the storage system, and the frequency of business requirements. Each method is valid, and each is extensively utilized in the data engineering industry today.

SQL Joins and Window Functions: A Practical Guide.

Robert Njuguna — Fri, 06 Mar 2026 13:28:54 +0000

SQL is an efficient language that is used to manage and analyze database data contained in relational databases. Joins and window functions are considered to be among the most significant tools of high-level data querying.

SQL Joins - Fusion of Different Tables.

The SQL used to put together rows of two or more tables is a JOIN in SQL. It is done by joining 2 or more tables using a common column in both tables being joined. Joins are important to work with normalized databases, in which data is distributed across a number of tables to minimize redundancy.

Types of Joins

INNER JOIN

Returns only the rows that have matching values in both tables.

Sample: Selecting the names of the employees and the name of the department in which they work. There will only be the employees in a department; employees who have no department or departments that have no employees will not be returned.

SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id;

LEFT JOIN (or LEFT OUTER JOIN)

Gives back all the rows of the left table and the matching rows from the right table. In case no match is found, NULLs are shown.
Sample: Giving a list of all the employees, including those who are not assigned to any department.

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d
ON e.department_id = d.department_id;

RIGHT JOIN (or right outer join)

Returns all rows from the right table and the matching rows from the left table.
Example: Selecting all departments, even those that have no workers/employees.

SELECT e.name, d.department_name
FROM departments d
RIGHT JOIN employees e
ON e.department_id = d.department_id;

FULL JOIN (full outer join)

Returns everything from the joined tables.

Example: Combining the employees and department tables.

SELECT e.name, d.department_name
FROM employees e
FULL OUTER JOIN departments d
ON e.department_id = d.department_id;

CROSS JOIN

Gets the Cartesian product of two tables. Each row of the first table is multiplied by each row of the second one. Essentially, if table one has 5 rows and row 2 has 4 rows, you would expect a total of 20 rows in the cross-joined table.

Example: Generating the combinations of employees and departments.

SELECT employees.name, departments.department_name
FROM employees
CROSS JOIN departments;

One thing about cross joins, it doesn't need the on statement.

Window Functions: Row-Level Calculations

Window functions allow you to perform calculations across a set of rows that are related to the current row.

Common Window Functions

ROW_NUMBER()

Assigns a unique sequential integer to rows within a partition.
Example: Ranking employees by salary within each department.

SELECT e.name, d.department_id, department_name, e.salary, ROW_NUMBER() OVER (PARTITION BY department_name ORDER BY salary) AS salary_rank
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id;

RANK() and DENSE_RANK()

RANK gives a sequence to rows, but in the case of rows with a similar value, it skips ranks based on ties, while DENSE_RANK doesn't skip ranks even in a tie.

SELECT name, department_id, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;

SUM(), AVG, MIN, and MAX as Window Functions

Carries out aggregate inspections on a window of rows.
Example: Calculating the total salary expense per department while still showing individual employees.

SELECT name, department_id, salary, SUM(salary) OVER (PARTITION BY department_id) AS total_salary
FROM employees;

LEAD() and LAG()

Accesses subsequent or previous row values within a partition.

Example: Comparing the salary of each of the employees with both the past and the future salary in the dataset.

SELECT name, salary,
LAG(salary, 1) OVER (ORDER BY salary) AS previous_salary,
LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary
FROM employees;

TIP

Figuring out how to rank, do totals, calculate moving averages, and make comparisons between current and previous values, window functions are best suited for analytics efforts where row-level information cannot be lost.

The Joins and Window Functions Combination.

Joins and window functions tend to be used together when it comes to more intricate analytics. For instance, you could join employees with departments, then rank salaries within each department:

SELECT e.name, d.department_name, e.salary, RANK() OVER (PARTITION BY e.department_id ORDER BY e.salary DESC) AS salary_rank
FROM employees e
JOIN departments d ON e.department_id = d.department_id;

This query provides a ranked view of salaries for each department, combining relational data retrieval with advanced row-wise computation.

Conclusion

Joins and window functions are highly applicable in SQL. Joins make it easy and effective to combine data from several tables. Windows functions allow one to compute calculations of rows without losing any detail. A combination of the two will provide more powerful insights on your datasets. They are useful in ranking, totals, comparisons, and analytics. The practice of these functions will make your SQL queries strong. Working with data involves working with learning joins as well as window functions, which is a crucial part of the work of any data professional.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

Robert Njuguna — Mon, 09 Feb 2026 15:14:31 +0000

Data is rarely received in an ideal state. In the physical world it is incomplete, contradictory, replicated, and dispersed among various systems. However, business remains keen to find such straight answers: What is driving costs? Where are we losing revenue? How can performance improve? An analyst is supposed to fill that gap. Power BI offers the arsenal, which enables analysts to transform sloppy data into dependable information and, eventually, into action.

A Power BI analyst is the intermediary between the source systems and decision-makers. Through this process, the application of organized data cleaning, dimensional modelling, data aggregation functions, and intentional visualization can allow the analysts to transform initial information into trustworthy knowledge that drives decision-making.

Power Query and the ETL Foundation.

Analysis life cycle commences in Power Query, where Extract, Transform, Load (ETL) standardizes the raw inputs.

Examples include hospital and pharmacy data, which usually include:

The dates are recorded as a text rather than a date.
Different systems use different names of drugs.
Blank values, negative values (where they should not be negative, like transaction costs, cost of medicine or even quantity of medicine given to patients).

Duplicate invoice numbers

The analyst uses transformation steps to impose data types, trim and clean text, eliminate duplicates, substitute nulls, and derive attributes. It is possible to break down a single Date column into Year, Quarter, Month, and Day to expand the time intelligence in the future.

In power Bi the Time intelligence functions are mainly used to clean the Date column

For instance:

Year (Date)

Month (Date)

Day (Date)

These Time intelligence functions returns only the Year, month number, or Day number of the month respectively of the desired date on a separate column.

Star Schema and Dimensional Modeling.

Once prepared, the analyst models the data with the dimensional modeling methods, the star schema being the most common.

A transactional record may be contained in a central fact table like

Prescription ID
Patient ID
Product ID
Date
Quantity
Unit Price
Cost

Dimensions tables surrounds the facts table and give descriptive information:

Dim Patient - age, gender, type of insurance.
Dim Product—name of drug, brand name, therapeutic classification.
Dim Date—fiscal periods, weekdays, and months.
Dim Department - ward, facility, region.

The relationships are generally one-to-many, where the dimensions flow to the facts. The structure lessen redundancy as well as make aggregations predictable. In the absence of this design, totals can be counted twice, filters can fail, and there will be worse performance.

DAX: Encoding Business Logic

DAX is applied to develop standardized metrics with an appropriate model.

The analyst defines it instead of computing revenue in several ways.
For instance:

Total Revenue = SUMX(FactSales, FactSales[Quantity] * FactSales[Unit Price])

Profitability may be defined as

Gross Profit = TOTAL Revenue [Cost] - SUM (FactSales[Cost])

Since DAX works in a filter environment, analysts can immediately observe the performance by month, department, or type of drug without needing to rewrite formulas.

This guarantees one version of the truth in the organization.

Architecture and Visualization.

Good dashboards are not created but designed. The layouts created by analysts are built in accordance with information hierarchy: KPIs are on top, diagnostics are in the middle, and details are at the bottom.

For instance:

There are cards showing Total Revenue, Total Prescriptions, and Gross Margin.

Line charts depict trends of utilization on a daily or monthly basis.

The bar charts are used to compare products (say products utilization per departments) or (departments found in each county).

Color logic (e.g., red decline, green improvement) is used to provide the user with a quick way to understand performance. Date, facility, or Location (county) slicers allow the presenter to control the interactivity without making the audience feel bombarded.

It is aimed at rapid thinking and directed investigation.

Between the Insight and Operational Action.

A dashboard may indicate that some drugs are in large stock but not being utilized. By cutting further, the analyst can find that demand declined following a treatment regimen alteration.

The same understanding can stimulate the following procedures, like changing the quantities of procurement, renegotiating contracts with suppliers, or reallocating inventory between locations.

In this case, analytics transforms the reporting of the past to the future strategy.

Conclusion

Power BI allows an analyst to combine ETL, dimensional modeling, DAX computation, and visualization into one decision system. The layers are based on each other: clean data is used to come up with good models and good visuals, good models are used to come up with good measures, and good measures and visuals are used to make good decisions. When properly implemented, dashboards are working tools that minimize uncertainties, enhance productivity, and generate quantifiable business performance. This is the how chaotic data is brought into action.

Schemas and Data Modeling in Power BI

Robert Njuguna — Tue, 03 Feb 2026 18:55:09 +0000

Schemas and Data Modeling in Power BI

Data modeling is an important step in Power BI as it defines the way of organizing, associating, and analyzing data. An effective data model enhances the performance of a report, guarantees correct output, and simplifies it to create meaningful visualizations for the users. The major ideas in the Power BI data modeling are schemas, fact and dimension tables, and relationships.

Star Schema

The most commonly used and most recommended data model in Power BI is the star schema. A star schema has one central fact table and several dimension tables that link to the fact table, forming a star.

The fact table is filled with quantifiable values like the amount of sales made, the number of patients, or the yield.

Example

The information stored in dimension tables is descriptive and includes dates, locations, products, hospitals, or the type of crops.

All the dimension tables are directly related to the fact table through a one-to-many relationship. Star schemas are simple to maintain and comprehend and are very effective for the in-memory engine of Power BI. The number of joins is reduced; therefore, reports load quicker, and DAX calculations are more effective.

Snowflake Schema

The snowflake schema is a variation of the star schema that is more advanced. The use of this model entails the normalization of dimension tables into several related tables. As an illustration, a location dimension can be divided into country, region, and city tables.

Example

Although snowflake schemas minimize redundancy in data, they add more relationships and complexity. In Power BI, this has the ability to slow performance and complicate DAX formulas to write and debug. Owing to this reason, snowflake schemas are neither popular nor usually sought-after except when the dataset is large or already structured in that manner.

Fact and Dimension Tables

To master good data modeling, it is important to understand the distinction between fact and dimension tables:

Fact tables

Essentially consists of numerical values (metrics or measures).

Have many rows
Examples: admissions, sales transactions, and crop production records.

Dimension tables

Hold descriptive characteristics.
Have fewer rows.
Examples: date, hospital, department, crop type, county.

Common measures in Power BI are totals, averages, and ratios generated in fact tables, whereas dimensions are used to slice and filter the data.

Relationships in Power BI

Relationships indicate the connection between tables. Power BI has a one-to-many relationship, the most prevalent type of relationship.

Good practices for relationships include:

Single-direction (dimension to fact filtering).
It is better to avoid many-to-many relationships.
Ensuring Matching data types and clean keys.

Distorted or complicated associations may cause incorrect totals, undesirable filtering, and inefficient execution.

Significance of Good Data Modeling.

Good data modeling is important for the following reasons:

Performance—Properly modeled models, in particular star schemas, save memory and increase report response time.
Precision—Relationships are clear so that the measures are calculated, and filters act as expected.
Usability—Clean models allow easier report building as the user is not confused.
Scalability—A well-structured data model is capable of adding more data and data measures without significant redesign.

In practical data like hospital records or agricultural data of Kenyan crops, it is good to model the data so that the analysts can make credible information to aid in decision-making.

Conclusion

Effective Power BI reporting depends on data modeling. With the help of star schemas, well-defined fact and dimension tables, and straightforward relationships, one obtains better results in terms of performance and accuracy of the analysis. The snowflake schema has a place, but Power BI suits best with clean and intuitive models. One of the ways to guarantee quality dashboards and reliable business intelligence results is investing in suitable data modeling.

Introduction to Excel For DATA Analysis

Robert Njuguna — Tue, 27 Jan 2026 17:02:17 +0000

Introduction

The most common data analytics tool that is popular with beginners is Microsoft Excel. It is simple to operate, versatile, and strong enough to carry out simple data analysis functions without possessing any programming skills.

In this article, I present MS Excel as an analytics tool used in data analysis and how it may be utilized to analyse data using straightforward examples. This tutorial will assist the beginners who are new to Excel and wish to know how it can assist in the analysis of data.

Data Analysis

Data analysis is the process of collecting, cleaning, analysing data to discover important information, trends, and patterns in order to make informed decision.

Data Collection

Data collection is the process of obtaining both primary and secondary data for the purpose of analysis. This can either be in the field (primary) or already stored data from databases (Secondary data).

Once collected, data is stored in a tool(Excel) for analysis. This is called data entry. For example, say you are a HR analyst, and you wish to analyse employee data using Excel. You have to first obtain the employee data from the relevant databases and store it in an excel file as shown below:

Data Cleaning

Data cleaning involves ridding data of any inconsistencies which would otherwise negatively impact the analysis process. This can involve:

Data formatting
Getting rid of duplicates (Mainly done in the unique identifier Column)
Handling Outliers
Handling Blanks/Missing Values

Data Formatting

This is the process of correcting any formatting problem.
For instance in the HR dataset, the Hire date Column the date data doesn't currently fit in the Hire date column. See Below:

This is fixed by extending the Hire date Column. This is done by right clicking the + sign at the furthest end of the column.

The number formatting also comes into play, For example, the Employee ID column in the HR dataset is initially formatted into number format rather than text format.

How do you know this? - In excel, text is always aligned to the left, while Date and number formats are always aligned to the right for each cell in a column.

To fix this, the Format option is utilized in excel. Select the Employee_ID column Go to:

Home>Font>Click the furthest Arrow>Format Cells>Text

Getting rid of duplicates

This is done by first conditional formatting the Unique Identifier column, where you highlight the duplicates. Once the duplicates are visualized you can use logical arguments to suggest how to deal with duplicates.

Select the Unique Identifier Column:
Home>Styles>Conditional Formatting>Highlight Cell Rules>Duplicate Values
and then highlight with your preferred colour.

Once highlighted, the next step if applying you criteria. For instance, In our example for the HR dataset, we can choose to keep the oldest Unique Identifier (Basically, the first ID entry and delete subsequent duplicate entries).

Handling Blanks/Missing Values

Missing Values can be as a result of data entry problems, or unavailable data for that particular Cell. Say for example you have Sales data set that contains countries and their respective cities. If the a particular country doesn't have data in the city column. How do we handle this?

We can fill the empty cells with unavailable to prevent adding incorrect values/Cities for a particular set of data.

Next, for values in a particular column, we can fill empty cells using the averages, median or mode of other columns. This is decided by what the data represents. E.g., ratings(median), Salary(average, median) e.t.c.

This is done by first computing the arithmetic's of these particular columns:

= Average ([Cell range])
= Median ([Cell range])
= Mode ([Cell range])

Handling Outliers

1. Identify Outliers

Boxplots (most common)
Z-scores (values > ±3 are often outliers)
IQR method

Lower bound = Q1 − 1.5 × IQR

Upper bound = Q3 + 1.5 × IQR

2. Verify the Outliers

Before removing anything, check:
Is it a data entry error? (extra zero, wrong unit)
Is it a valid but extreme value?

3. Remove Outliers

Use this only if the value is an error or irrelevant.
Delete the row
Filter values outside acceptable ranges

4. Transform the Data

Reduce the impact instead of removing:
Log transformation
Square root transformation

Data Analysis

This is the process of obtaining imortant information/insights from cleaned and transformed data

1. Descriptive Statistics (Start Here)

This summarizes your data so you understand it.

In Excel:

Assume your cleaned prices are in Column B.

Mean (Average)

=AVERAGE(B:B)

Median

=MEDIAN(B:B)

Minimum & Maximum

=MIN(B:B)
=MAX(B:B)

Standard Deviation

=STDEV(B:B)

Interpretation example:
“The average product price is KSh 1,666, with most prices clustered around the mean, as indicated by a standard deviation of …”

2. Identify Outliers (IQR Method – Practical)

This shows whether any prices are unusually high or low.

Step-by-step:

Q1

= QUARTILE(B:B,1)

Q3

= QUARTILE(B:B,3)

IQR

= Q3-Q1

Lower Bound

=Q1-1.5*IQR

Upper Bound

=Q3+1.5*IQR

Any price below the lower bound or above the upper bound is an outlier.

Pivot Tables and Visualizations

PIVOT TABLES

Pivot tables are used to summarize data in excel. Pivot tables come in hand when one wants to group dataset using a particular criteria or to aggregate a given set of grouped data. For example, You may want to calculate the total revenue by a given region, or the average salaries of employees by Region. This can be easily done using pivot tables.

To create one, you select the entire data range>Insert> Pivot table>From table Range.

This inserts the pivot table in a new worksheet or a pre-created worksheet

The next steps involves summarizing the desired data criteria. For example, Average Salary by departments as shown below:

Visualization

This can either be done by raw data, or data from the pivot table. From raw data, a column or columns is/are selected, and a chart is inserted.

This are some of the charts that can be created using raw data in excel:

The charts can also be created using the pivot tables, by heading to:

Pivot table analyze > pivot charts,

and selecting the desired chart.

Dashboards

Lastly, once the visualizations are complete, one can decide to present the entire visual, KPI's on one panel. This is called a dashboard. The charts are clearly and systematically arranged to tell a story about the data. The dashboards are also important since all the required information about your data can be placed in one place which makes it easier for presentation:

See an example below:

The above dashboard clearly and visually presents a story about the employees data.

Conclusion

To sum up, Microsoft Excel is an effective and easy-to-use data analysis tool. It enables one to gather, clean, analyze, and visualize information without having to know how to program. Excel is useful because it allows users to discover valuable insights and trends through its data formatting capabilities, missing values, duplicates, descriptive statistics, pivot tables, and charts. Dashboards also improve knowledge as they display critical information in a single transparent display. Altogether, Excel is a great place to start with when someone wishes to learn more about data analysis and how to use simple and practical tools to make an informed decision.

Understanding GIT: Version Control, Push and Pull Code, and Tracking changes

Robert Njuguna — Sun, 18 Jan 2026 10:15:09 +0000

1. This article tackles:

What is Version Control
How Git Tracks Changes
How to Push and Pull code

1.1 What is Version Control

For instance say you write an essay and there are different versions, e.g., Final_doc_1.docx, final_doc_2.docx. This can quickly become messy.

When it comes to GIT, Version control solves this by automatically tracking and updating changes without saving two different versions. Git helps in:

Saving each change Automatically
Letting one go back to previous versions
Seeing who changed any part of the document and when.

1.1.1 Saving each change Automatically

This is done through:

git add .
git commit -m "saved changes"

1.1.2 Letting one go back to previous versions

To see previous versions of commit, one can view through:

git log

This displays the log of the commit, giving the commit Id, name of the author and the Date of the commit. For instance:

commit: abc3wef
Author: Robert
Date: Mon Jan 15 10:30:00 2026

Therefore, in order to go back to a specific version of the commit, we use:

git checkout <commit-id>

Moreover, when one wants to get the most recent version of the commit, we use the more general version:

git checkout main

1.1.3 Showing who changed what and when

Showing who changed what and when is done using the code:

git log --oneline

This displays a short, easy-to-read summary of your project’s commit history.

Next, to see what was changed the command:

git show

is used to display all the exact changes done to the files.

Last, but not least, the code:

git blame filename.txt

is used to show line by line history of the changes made(who changed each line)

1.2 How Git Tracks Changes

Git does not automatically save every change you make. Instead, it tracks changes using three simple stages. Understanding these stages is the key to using Git correctly.

The Three Stages of Git

1.2.1 Working Directory

This is where you edit your files.

You write code
You delete or modify files
Git notices the changes but does not save them yet

To check changes, one uses:

git status

Git gives you a clear summary of what’s happening in your repository.
Example output:

On branch main
Changes not staged for commit:
  modified: index.html

Untracked files:
  script.js

1.2.3 Staging area

This is where you tell Git the changes you want to save. This is done through:

git add .

Basically, you are preparing Git for a commit and telling Git the changes you wish to keep.

1.2.4 Repository (Commit History)

This is where Git saves the staged files Once you commit. The commit is done by:

git commit -m "New files added"

Once this is done, the committed staged files are now permanently saved in Git's history.

Git knows what has changed by comparing the last commit with the current files. This can be displayed in Git using the code:

git diff

This displays the added or removed lines, or any modified code.

In simple terms, think of Git like a camera, where the Working Directory is the scene you are setting up in order to capture a picture, Staging Area is what you frame in the photo and Commit is the photo you save.

1.3 How to Push and Pull Code in Git

Basically, when one is working with Git the code you are writing exists in two places Local repository(your computer) and Remote Repository (GitHub). Essentially, pushing and pulling keeps these two locations in sync.

1.3.1 What is push ?

When you are done committing any changes on a file, pushing basically saves the files into the Remote Repository(GitHub). This is done by using the code:

git push

In short, this is like uploading your saved work.

1.3.2 What does "Pull" mean?

Pull basically downloads the saved files from the Remote Repository into your Local Repository(into your computer). This is done if you are occasionally collaborating and others may have committed changes in the last version you interacted with.

This is done using the code:

git pull

In short, this is like downloading updates. If you don't pull first before pushing Git may block your push or You may see a merge conflict. That’s why pulling first is a good habit.

Fun twist: The name Git comes from British slang meaning “someone who is stubborn”—which is fitting, because Git never forgets your changes.