Forem: Sharon M.

Window Functions in PostgreSQL using DBeaver

Sharon M. — Mon, 02 Mar 2026 19:35:05 +0000

Overview

Key Components of Window Functions
- OVER()
- PARTITION BY
- ORDER BY
- Frame Clause (Optional)
The Store Database
Demonstrating Window Functions Using the Store Tables
- Dedicated Window Functions
- Ranking Functions
  - ROW_NUMBER()
  - RANK()
  - DENSE_RANK()
  - NTILE(n)
- Offset (Navigation) Functions

In the previous article, we saw how joins allow us to combine data from multiple related tables. Using a small retail store database, we connected the departments, employees and sales tables to answer various questions about the business.

Joins are essential in relational databases in that, when information is stored across several tables, joins bring the data together so that it can be analyzed and actually make sense.

However, many analytical questions require something slightly different. Sometimes we first combine tables using joins, and other times we work directly with a single table. In both cases, the goal is the same: to analyze how rows relate to other rows in the dataset.

For example, using our data we might want to answer questions like:

Which employee has the highest sales within each department?
What is the running total of sales over time?
How does each sale compare to the previous sale?
How can employees be ranked based on their performance?

These types of questions require calculations that look at multiple rows at the same time while still keeping every row visible in the result. This is where window functions become useful.

Window functions allow SQL to perform calculations across a group of related rows without collapsing the results into a single summary row. Unlike traditional aggregate functions such as SUM() or AVG(), which return one result for a group of rows, window functions perform calculations across multiple rows, but they keep every row visible in the result.

In PostgreSQL, a function becomes a window function when it is used together with the OVER() clause. The OVER() clause defines the window, which is the set of rows the function will use during its calculation.

Window functions are commonly used to:

Rank rows within groups (assign ranking positions to rows based on values such as sales performance or salary).
Calculate running totals (compute cumulative values across ordered rows, such as tracking total sales over time).
Compare values between consecutive rows.
Divide data into performance groups (divide rows into buckets based on metrics like revenue or sales volume).
Analyze trends over time (examine how values change across an ordered sequence such as dates or transactions).

We will see these as we demonstrate each window function using the store dataset.

Key Components of Window Functions

Window functions rely on several components that determine how rows are grouped and ordered during a calculation. The most important element is the OVER() clause which may also include:

PARTITION BY
ORDER BY
Window frame clauses (Optional)

OVER()

As we earlier stated, a regular SQL function becomes a window function when it is used together with OVER(). OVER() defines the set of rows (the window) that the function should consider during the calculation.

General syntax:

function_name(expression)
OVER (
    PARTITION BY column
    ORDER BY column
)

PARTITION BY

PARTITION BY divides rows into groups (partitions). Each partition is processed independently by the window function.

Example:

PARTITION BY department_id

This ensures calculations are performed separately for each department.

ORDER BY

The ORDER BY inside the OVER() statement defines the order of rows within each partition.

Example:

ORDER BY sale_date

With time you will observe that ordering is important when performing calculations such as ranking, running totals or comparisons between rows.

Frame Clause (Optional)

A window frame further limits which rows are used in a calculation within a partition. Frame clauses define which rows relative to the current row should be included in the calculation.

Example:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

This frame starts from the first row of the partition and continues up to the current row. It is commonly used when calculating cumulative totals.

The Store Database

To demonstrate window functions, we will continue using the same retail store dataset introduced in the previous article on joins.

The database contains three tables:

departments
Each department has multiple employees.
employees
Each employee belongs to one department.
sales
Employees generate sales that are recorded in this table.

Demonstrating Window Functions Using the Store Tables

In PostgreSQL, there is no completely separate window function library. Instead, several built-in functions can operate as window functions when they are used with the OVER() clause. These functions generally fall into two categories:

1. Dedicated Window Functions

Dedicated window functions are specialized analytical functions designed to operate across a set of rows related to the current row without collapsing the result into a single output row.

We have:

- Ranking Functions

Ranking functions assign a ranking value to each row within a partition. They include:

a. ROW_NUMBER()

Assigns a unique, sequential number to each row within a partition. The numbering starts at 1 and increases based on the order specified in the ORDER BY clause. If PARTITION BY is used, the numbering restarts for each partition.

This function is often used to:

Determine the order of events.
Retrieve the first or latest record in a group.
Rank rows uniquely even when values are identical.

Example:

Question: In what order did each employee record their sales transactions?

SELECT e.employee_name, s.sale_date, s.amount,
ROW_NUMBER() OVER (
        PARTITION BY e.employee_id
        ORDER BY s.sale_date
    ) AS sale_sequence
FROM employees e
JOIN sales s
ON e.employee_id = s.employee_id;

JOIN - The employees and sales tables are joined so that each sale can be associated with the employee who recorded it.

PARTITION BY - This divides the result set into partitions for each employee. Each employee’s sales are processed independently and the row numbering resets for each employee.

ORDER BY - This sorts the sales chronologically within each employee’s partition.

Output:

b. RANK()

Assigns a ranking value to each row based on the order defined in the ORDER BY clause. If two or more rows have the same value, they receive the same rank. The next rank number is then skipped, creating a gap in the ranking sequence. For example, if two rows share rank 1, the next row will receive rank 3 instead of 2.

This function is often used to:

Rank employees by performance.
Identify top-selling employees.
Determine position within a leaderboard.

Currently, there are no ties in my dataset. To demonstrate how RANK() behaves when ties occur, I inserted an additional sale for Daenerys:

INSERT INTO sales (employee_id, sale_date, amount)
VALUES (4, '2025-01-04', 4000);

Output:

Example

Question: Which sales transactions generated the highest revenue?

SELECT 
    e.employee_name,
    s.amount,
    RANK() OVER (
        ORDER BY s.amount DESC
    ) AS sales_rank
FROM employees e
JOIN sales s
ON e.employee_id = s.employee_id;

Output:

c. DENSE_RANK()

Also assigns a rank to each row but without gaps in the sequence, even if there are ties. This function is useful when continuous ranking positions are required in reports or analyses.

Example:

Question: Which sales transactions generated the highest revenue (while maintaining a continuous ranking sequence)?

SELECT 
    e.employee_name,
    s.amount,
    DENSE_RANK() OVER (
        ORDER BY s.amount DESC
    ) AS sales_dense_rank
FROM employees e
JOIN sales s
ON e.employee_id = s.employee_id;

Output:

NOTE: The difference between RANK() and DENSE_RANK() becomes clear when two rows share the same value. In such cases, RANK() introduces gaps in the ranking sequence, while DENSE_RANK() maintains a continuous ranking order.

d. NTILE(n)

Divides rows into n roughly equal-sized groups (tiles or buckets). Then it returns the bucket number assigned to each row.

This function is often used in:

Grouping customers into spending levels.
Dividing employees into performance groups.
Creating quartiles (NTILE(4)), deciles (NTILE(10)) or percentiles for analysis.

Example:

Question: How can employees be divided into performance groups based on their total sales?

SELECT e.employee_name,
    SUM(s.amount) AS total_sales,
    NTILE(2) OVER (
        ORDER BY SUM(s.amount) DESC
    ) AS performance_group
FROM employees e
JOIN sales s
ON e.employee_id = s.employee_id
GROUP BY e.employee_name;

Output:

- Offset (Navigation) Functions

In the next part we will look into navigation (offset) functions such as LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE() and NTH_VALUE().

I hope this article helps you to better understand how window functions are used in SQL.

Joins in PostgreSQL Using DBeaver

Sharon M. — Mon, 02 Mar 2026 19:25:45 +0000

In this article, I will walk through the process step by step:

1. Creating a Database and Schema
- Create the Database
- Create a Schema
- Set the Search Path
2. Creating the Tables
- Table 1: departments
- Table 2: employees
- Table 3: sales
3. Demonstrating Joins Using the Store Tables
- a) INNER JOIN
- b) LEFT JOIN
- c) RIGHT JOIN
- d) FULL JOIN
- e) NATURAL JOINS
- f) CROSS JOINS
- g) SELF JOIN
Inclusive vs Exclusive Joins

Suppose you manage a small retail store. The store has multiple departments i.e Electronics, Clothing, and Groceries. Each department has employees assigned to it. And those employees generate sales.

Now you want to evaluate performance.

You might ask:

Which employee belongs to which department?
How much revenue has each employee generated?
Are there employees who have not recorded any sales?

At this point, the data you need is not stored in a single table. It is distributed across several related tables. That is how relational databases are designed. Data is separated logically to reduce redundancy and maintain consistency.

To demonstrate how this data can be combined meaningfully, I will use PostgreSQL as the database management system. I am connecting to a locally installed PostgreSQL server using DBeaver, which serves as the SQL client for writing and executing queries.

1. Creating a Database and Schema

Create the Database

A database is the top-level container that stores schemas, tables, functions, and other database objects. This statement creates a new PostgreSQL database named demo.

CREATE DATABASE demo;

After creating the database, I connect to it in DBeaver so that all subsequent objects are created inside demo.

Create a Schema

A schema organizes database objects within a database. It acts as a namespace, preventing naming conflicts and grouping related tables logically. This statement creates a schema named joins_window.

CREATE SCHEMA joins_window;

In this case, the schema joins_window will contain all tables used in this demonstration.

Set the Search Path

By default, PostgreSQL searches for tables in the public schema. Setting the search path ensures that any tables created or queried will reference the joins_window schema automatically. Syntax:

SET search_path TO joins_window;

This keeps the work organized and avoids having to prefix every table with the schema name.

2. Creating the Tables

To demonstrate joins effectively, I created three related tables:

departments
employees
sales

Each table represents a distinct entity in the store system, and the relationships between them are enforced using primary and foreign keys.

Table 1: departments

Syntax:

CREATE TABLE departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(50)
);

INSERT INTO departments (department_name) VALUES
('Electronics'),
('Clothing'),
('Groceries');

The department_id column is defined as a primary key.

This guarantees that each department has a unique identifier.

The SERIAL keyword automatically generates sequential integer values for each new row.

Output:

Table 2: employees

Syntax:

CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    employee_name VARCHAR(50),
    department_id INT,
    salary NUMERIC,
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

INSERT INTO employees (employee_name, department_id, salary) VALUES
('Arya', 1, 50000),
('Baelish', 1, 60000),
('Cersei', 2, 40000),
('Daenerys', 2, 45000),
('Sansa', 3, 35000);

Here, employee_id is the primary key.

The department_id column is a foreign key referencing departments(department_id). This establishes a relationship between employees and departments.

The foreign key constraint enforces referential integrity. In practical terms, it prevents inserting an employee with a department that does not exist in the departments table.

Output:

This design models a one-to-many relationship:

One department can have many employees.
Each employee belongs to exactly one department.

Table 3: sales

Syntax:

CREATE TABLE sales (
    sale_id SERIAL PRIMARY KEY,
    employee_id INT,
    sale_date DATE,
    amount NUMERIC,
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id)
);

INSERT INTO sales (employee_id, sale_date, amount) VALUES
(1, '2025-01-01', 2000),
(1, '2025-01-02', 3000),
(2, '2025-01-01', 4000),
(3, '2025-01-01', 1500),
(4, '2025-01-03', 2500),
(5, '2025-01-02', 1000);

Again, sale_id is the primary key.

The employee_id column is a foreign key referencing employees(employee_id). This ensures that every sale is linked to a valid employee.

Output:

This creates another one-to-many relationship:

One employee can record multiple sales.
Each sale belongs to one employee.

3. Demonstrating Joins Using the Store Tables

With the tables now defined and populated, we can begin examining how joins operate. The data we need is distributed across separate tables by design. Departments are stored independently from employees, and sales are stored independently from employees. This separation prevents redundancy and enforces data integrity. However, it also means that answering even simple analytical questions requires combining tables.

That combination is achieved using joins. Here is a list of the joins we will be looking into:

a) INNER JOIN - returns only matching rows (intersection of both tables)

b) LEFT JOIN - preserves all rows from the left table.

c) RIGHT JOIN - preserves all rows from the right table.

d) FULL JOIN - preserves all rows from both tables.

e) CROSS JOINS - generates all possible row combinations (Cartesian product).

f) NATURAL JOIN - automatically matches columns with identical names.

g) SELF JOIN - joins a table to itself to model hierarchical relationships.

In the sections that follow, some additional rows may be inserted to clearly demonstrate how different join types behave when matching data is present and when it is null.

a) INNER JOIN

An INNER JOIN returns only the rows where the join condition evaluates to true in both tables. In other words, it keeps only matching records.

General Syntax:

SELECT table1.column1, table2.column2
FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field;

The ON clause defines the relationship between the two tables. Without it, PostgreSQL would not know how the rows should be matched.

Example:

SELECT 
    e.employee_name,
    d.department_name,
    e.salary
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id;

Output:

PostgreSQL compares employees.department_id with departments.department_id. For every employee row, it searches for a department with the same department_id. When a match is found, the rows are combined into a single result row. If no match is found, that employee would not appear in the result. It returns only the intersection between the two tables.

b) LEFT JOIN

A LEFT JOIN guarantees that all rows from the left table appear in the result set. The key difference from an INNER JOIN is this:

If a matching row does not exist in the right table, the left table row is still preserved. The columns from the right table are filled with NULL.

General Syntax:

SELECT table1.column1, table2.column2
FROM table1
LEFT JOIN table2
ON table1.common_field = table2.common_field;

At the moment, every employee in our dataset has at least one sale. That would make the LEFT JOIN behave similarly to an INNER JOIN. To demonstrate the difference clearly, we insert an employee who has no associated sales:

INSERT INTO employees (employee_name, department_id, salary)
VALUES ('Jon', 1, 48000);

Output:

Now Jon exists in the employees table but has no corresponding row in the sales table. Let us see how the output will be.

Example:

Question: Which employees exist in the company, and what sales have they recorded, if any?

SELECT 
    e.employee_name,
    s.sale_date,
    s.amount
FROM employees e
LEFT JOIN sales s
ON e.employee_id = s.employee_id
ORDER BY e.employee_name;

Output:

PostgreSQL starts with the employees table (the left table). For each employee, it attempts to find matching rows in the sales table using employee_id. If matches are found, the rows are combined. If no match is found, the employee row is still returned.

c) RIGHT JOIN

A RIGHT JOIN is structurally similar to a LEFT JOIN, but while a LEFT JOIN guarantees all rows from the left table, a RIGHT JOIN guarantees all rows from the right table. And if a matching row does not exist in the left table, the right table row is still returned. The columns from the left table are filled with NULL.

General Syntax:

SELECT table1.column1, table2.column2
FROM table1
RIGHT JOIN table2
ON table1.common_field = table2.common_field;

The table written after RIGHT JOIN is the one that will always be preserved (all rows from the right table).

Example:

Question: What sales were recorded, and who is associated with each one?

SELECT 
    e.employee_name,
    s.sale_date,
    s.amount
FROM employees e
RIGHT JOIN sales s
ON e.employee_id = s.employee_id;

Output:

PostgreSQL starts with the sales table (the right table). For each sale, it attempts to find a matching employee using employee_id. If a match is found, the rows are combined. If no match is found, the sale row is still returned.

When we performed a LEFT JOIN between employees and sales, Jon appeared with NULL values in the sales columns because he had no recorded sales. In this RIGHT JOIN, Jon does not appear at all because the right table is sales. Jon has no row in the sales table, therefore, there is nothing for the RIGHT JOIN to preserve on his behalf.

d) FULL JOIN

A FULL JOIN combines the qualities of both LEFT and RIGHT joins. In that, no row from either table is excluded. If a row exists in one table but not in the other, it still appears in the result and the columns from the missing side are filled with NULL.

General Syntax:

SELECT table1.column1, table2.column2
FROM table1
FULL OUTER JOIN table2
ON table1.common_field = table2.common_field;

Example:

Question: Provide a complete view of employee records and sales transactions.

SELECT 
    e.employee_name,
    s.sale_date,
    s.amount
FROM employees e
FULL OUTER JOIN sales s
ON e.employee_id = s.employee_id
ORDER BY e.employee_name;

Output:

PostgreSQL attempts to match employees to sales using employee_id. Where a match exists, the rows are combined.

e) NATURAL JOINS

A NATURAL JOIN is a type of join that automatically matches columns with the same name in both tables. It is important to note:

NATURAL JOIN only works when columns have the same name.
Warning: NATURAL JOIN can be dangerous if tables share multiple same-named columns. NATURAL JOINS will join on all shared column names, not just the one you intended.

In our schema, both employees and departments contain a column named department_id.

General Syntax:

SELECT column_list
FROM table1
NATURAL JOIN table2;

Example:

SELECT 
    employee_name,
    department_name
FROM employees
NATURAL JOIN departments;

Output:

f) CROSS JOINS

A CROSS JOIN produces a Cartesian product of two tables, that means every row from the first table is combined with every row from the second table.

So, if table A contains m rows and table B contains n rows, the result of a CROSS JOIN will contain: m × n rows.

General Syntax:

SELECT column_list
FROM table1
CROSS JOIN table2;

Example:

Question: What are all possible employee–department assignment combinations?

SELECT 
    e.employee_name,
    d.department_name
FROM employees e
CROSS JOIN departments d;

Output:

g) SELF JOIN

A SELF JOIN occurs when a table is joined to itself. At first glance, that may seem unnecessary, I honestly thought the same thing when I came across this join. Why would a table need to join to itself? The answer actually lies in hierarchical relationships. What does that mean? Many real-world structures are recursive in nature:

An employee reports to another employee (Maybe a manager).
A product category contains subcategories.
A comment replies to another comment.

In each case, the relationship exists within the same table.

Hence why we use SELF JOIN, it allows us to treat the same table as two logical instances and define a relationship between them.

Now this is where we slightly modify the employees table to make it meaningful. We will introduce a reporting structure in the employees table by adding a manager_id column:

ALTER TABLE employees
ADD COLUMN manager_id INT;

The manager_id will reference another employee_id in the same table.

Next, we assign managers:

UPDATE employees SET manager_id = 2 WHERE employee_id = 1; -- Arya reports to Baelish
UPDATE employees SET manager_id = 2 WHERE employee_id = 3; --Cersei reports to Baelish
UPDATE employees SET manager_id = 4 WHERE employee_id = 5; -- Sansa reports to Daenerys

Output:

General Syntax:

SELECT columns
FROM table_name t1
JOIN table_name t2
ON t1.related_column = t2.primary_key;

Example:

Question: For each employee, who is their assigned manager?

SELECT 
    e.employee_name AS employee,
    m.employee_name AS manager
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.employee_id;

Output:

The table employees is referenced twice where e represents the employee and m represents the manager. PostgreSQL matches e.manager_id to m.employee_id, where a match exists, the employee is paired with their manager. If an employee has no assigned manager, the manager column appears as NULL.

Inclusive vs Exclusive Joins

Inclusive joins return matching rows along with unmatched rows from one or both tables.

Exclusive joins return only unmatched rows by filtering NULL values after an outer join.

Exclusive joins = Outer Join + NULL filter. These are not separate SQL keywords, but rather combinations of outer joins with filtering conditions using WHERE ... IS NULL.

This used in analytics to:

Find customers without orders.
Find products never sold.
Find users who didn’t log in.
Find orphaned records. You can learn more about Inclusive vs Exclusive Joins here.

I hope this article helps you to better understand how joins are used in SQL.

Introduction to MS Excel for Data Analysis

Sharon M. — Sun, 15 Feb 2026 15:27:03 +0000

Most people think Excel is just for typing in data and doing quick totals. But when I use Excel for data analysis, I treat it like a workspace where I analyze raw data.

Microsoft Excel is still one of the most used tools in organizations. Even with newer platforms around, a lot of real analysis still starts in Excel. And honestly, I get why. It’s flexible, familiar, easily accessible and it lets you explore and interpret data step by step.

In this article, I’ll walk you through the flow I follow from the moment I receive a dataset to the point where I can confidently present insights.

1. Establishing the Analytical Objective

When I receive data, my first instinct is not to calculate but to understand what I’m solving.

Sometimes the question is obvious. Sometimes it isn’t. Either way, I ask myself what decision the analysis should support. Am I explaining why something changed over time? Am I comparing regions? Am I checking if we’re making profit or loss? Am I investigating a performance decline?

If the objective is not clearly defined, it is easy to calculate figures that appear meaningful but do not answer the core question. Defining the objective early ensures that each step taken in Excel aligns with a specific goal.

2. Evaluating the Raw Data Structure

After defining the objectives, I scroll through the data slowly, examining its structure rather than immediately performing calculations so as to determine what kind of data I’m dealing with.

I try to understand what each row represents. One row could be a transaction, a customer, or an event, a product record. Then I look at the columns to ensure headers are clear, if there are blank rows in the middle, and whether someone added totals inside the raw dataset.

For data to be suitable for analysis, it should follow a consistent structure. Each row should represent a single record and each column should represent one variable.

3. Data Preparation and Cleaning

This is the step many beginners might find challenging but don’t skip it, the quality of your data matters and this stage determines whether the final results will be reliable.

I check for duplicates because duplicates can make totals look larger than they should, I always check to see if a duplicate is legitimate or erroneous. I also identify missing values because blanks can break calculations or hide patterns.

I also confirm and change data types because Excel needs to understand what a date is and what a number is. If Excel stores a number as text, the formula might not behave the way I expect.

I also scan for values that don’t make sense. Things like negative quantities, strange dates, or percentages that look unrealistic. This stage isn’t exciting, and there is a lot more that goes into data cleaning in excel.

Learn more about Data Cleaning with Excel.

4. Designing a Logical Workbook Structure

Once the raw data looks clean and usable, I separate my work into sheets. This is one habit I always keep.

I keep raw data in one sheet which I usually rename it as Original. Then I copy and paste the original clean data onto another new sheet each labeled depending on what type of metrics are being performed on the sheet in question (e.g. calculations, summaries, pivot tables, dashboard).

When everything lives in one giant sheet, errors hide. It also becomes hard to explain what you did. When sheets are separated, you are able to easily navigate through your thought process because it is easy to keep track of what you have done.

5. Creating new Columns

After ensuring the data is clean and structured, I determine which additional columns (metrics) are required based on my objectives for this project.

Raw datasets often contain transactions but lack interpretive measures So, I create derived metrics based on what I’m trying to find out; I may calculate revenue, profit, percentage margins, growth rates, or time-based groupings such as month or quarter or even generate categories.

It is important to do the necessary calculations, focusing only on metrics that directly affect or support the analysis. This keeps the workbook manageable and also aligned with the objective.

Learn more about formulars and functions in Excel.

6. Summarizing Data and Identifying Patterns

Once derived metrics (new columns) are in place, I move from row-level calculations to pattern recognition.

PivotTables are usually my first choice for this because they allow grouping, aggregation and comparison across categories without rewriting formulas repeatedly. So, instead of focusing on individual records, I proceed to analyzing trends and relationships.

For example; You can compare regions, categories, months or IDs without writing long formulas.

7. Visual Interpretation of Analytical Results

Visual representation supports interpretation. That is, depending on the objective, I use line charts to show trends, column charts to compare categories or tables when precise values are required. I select visualizations based on clarity because charts should simplify understanding. Each chart you choose should clearly represent what you want to interpret. And different charts are fit for different representations.

Learn more about charts in Excel.

8. Scenario Evaluation and Sensitivity Analysis (Testing “What If” Scenarios)

In more advanced analyses, an analyst can evaluate how changes in assumptions affect outcomes. Excel allows input values to be adjusted while linked formulas recalculate results automatically. This makes it possible to test how changes in cost, pricing or volume influence performance.

Learn more about What-If Analysis with Data Tables in Excel.

9. The Last Step Is Always Presentation and Communication of Findings

The final stage involves presenting insights clearly. That is why it is important to create a clean summary that someone else can understand.

In order to do so, I create one final sheet in the workbook where I bring together the most important results in one place (The Dashboard). This sheet does not contain raw data or calculations. Instead, it displays key metrics, selected charts, summary figures that directly answer the original analytical question and slicers for interaction.

Excel is powerful, but it has some limitations.

Excel alone may not be enough; when data becomes very large, or when many people need access at the same time or when reporting must refresh automatically. At that point, an analyst such as you and I, may require complementary tools such as Power BI.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

Sharon M. — Mon, 09 Feb 2026 19:09:55 +0000

Very many organizations are not suffering from having too little of data, most organizations are actually drowning in too much of data. They have spreadsheets everywhere. Reports that don’t match. Systems that don’t talk to each other. Numbers that look fine until you try to explain them.

That’s where analysts come in and, in our case, the analyst is using Power BI to answer the question that most companies have:

“We have the data, but we have no idea what it’s telling us.”

Power BI give analysts the tools to take messy, uncooperative data and turn it into something people can actually use. But that only works when every step in the process is done correctly.

From messy data to something you can interpret and report on: Data Shaping

Real-world data is rarely clean. Anyone who has opened an Excel file pulled from an operational system knows this immediately.

You might be presented with data that contains blank rows or duplicate records or even data saved in the wrong format of data type for example Dates saved as text. Different words all referring to the same thing for example “N/A”, “None”, and “Error”. County names written five different ways: Nairobi, nairobi, NAIROBI, all treated as separate values.

If you build visuals on top of data like that, the charts might look impressive. The insights won’t be, and bad insights lead to bad decisions.

That’s why analysts start with Power Query.

Power Query Editor is where the unglamorous work happens. Removing duplicates, fixing data types, cleaning text, replacing errors, standardizing values. It’s not exciting work, but it’s essential. This step ensures that whatever comes next is built on data that is actually reliable.

You can learn more about data cleaning in Power BI here.

Data Modelling: Why structure matters

Once the data is clean, the next question we should ask ourselves isn’t, “Which chart should I use?” It’s:

“How should this data be structured?”

This is where many beginners struggle, especially if they’re used to flat Excel tables. Power BI doesn’t really work with spreadsheets. It works with models.

Instead of cramming everything into one giant table, analysts separate data into two main types:

Fact tables — the numbers: sales, revenue, quantities
Dimension tables — the context: dates, products, locations, customers

This setup reflects how a business actually operates. Sales happen on a date, at a store, for a product, by a customer. Modelling data this way lets Power BI filter and calculate correctly without guessing.

Relationships within the data then connect these tables. When they’re set up properly, selecting a county, a date, or a product automatically filters the right records in the background.

When relationships are missing or wrong, everything starts to feel off. Totals stop making sense. Slicers don’t behave. Numbers refuse to change when they should.

And most of the time, that’s not a DAX problem. It’s a modelling one.

You can learn more about data modelling and schemas in Power BI on Microsoft.

DAX: where business questions turn into logic

Raw data doesn’t usually answer business questions by itself. You won’t find managers asking questions like:

“What’s in column F?”

They ask things like:

How much revenue did we make?
Are we actually profitable?
Is performance improving or getting worse?
Which areas are falling behind?

That’s where DAX comes in.

DAX lets analysts define business logic once and reuse it everywhere. Instead of hardcoding numbers into visuals, analysts create measures like total revenue, profit margin, averages, and trends. These measures automatically respond to filters and slicers.

The real strength of DAX isn’t memorizing functions. It’s understanding that once a rule is defined, it behaves consistently across charts, tables, and dashboards hence providing a reliable output.

You can learn more about DAX functions here.

Choosing Visuals

Charts often get treated like decoration. Pick something colorful, add labels, move on.

That’s not how analysts think about visuals.

Every visual should answer a question. Are we:

Comparing categories? Bar or column charts do that well.
Looking at trends over time? Line charts make patterns obvious.
Showing proportions? Pie charts or treemaps (note: but only when categories are few).
Tracking performance? KPI cards work.

Using the wrong chart can quietly push people toward the wrong conclusion. A cluttered dashboard doesn’t feel confusing because one isn’t smart. It feels confusing because the analyst didn’t make clear choices.

You don’t use every visual available. You choose the ones that make the message easier to understand.

You can learn more about choosing the right visuals here.

Dashboards aren’t reports

Dashboards are different form reports in that: they’re built for decisions, not deep exploration.

A good dashboard fits on a single screen. Additionally, it highlights the most important numbers first, shows key trends and makes it obvious where things are going well and where they aren’t.

You should be able to glance at it and immediately know whether attention is needed or not.

The charts you created can now be used in your dashboard for quick decision drawing purposes. Most importantly, it answers a small number of critical questions.

So, if users have to scroll endlessly or guess what a visual means, the dashboard has already failed, no matter how accurate the data behind it is.

You can learn more about dashboards and reports here.

Turn Insights into Action

We did not do all these just to produce nice-looking models and dashboards, no.

Insights have to lead somewhere.

A well-designed Power BI dashboard and report help decision-makers to:

spot underperforming areas early
catch trends before they become serious problems
use resources more efficiently
track progress against targets
ask better follow-up questions

At this point, your role as the analyst shifts. You’re no longer just building charts. You’re deciding what deserves attention. You’re making risks hard to ignore. You’re helping leaders act with information instead of reacting under pressure.

How It All Comes Together

The Power BI architecture reflects this exact workflow.

Data first comes from different sources such as Excel files, databases, and text files. Power Query handles the extract, transform, and load (ETL) process, cleaning and shaping the data. The cleaned data is stored in a structured model where relationships and calculations using DAX are applied. From there, visuals and dashboards are built and shared with business users.

Each step depends on the one before it.

If the data is messy, the model breaks.

If the model is weak, DAX results are unreliable.

If visuals are poorly chosen, decisions suffer.

But understanding this end-to-end process makes it clear that Power BI isn’t just a reporting tool. Used properly, it becomes a powerful way to turn data into insight and insight into action.

I highly recommend for anyone looking to deepen their understanding, the Microsoft Power BI documentation is an excellent place to start.

Schemas and Data Modelling in Power BI

Sharon M. — Mon, 02 Feb 2026 19:08:38 +0000

When most people hear Power BI, they immediately think of dashboards, colorful charts, large numeric cards, clickable slicers, and polished visuals. These elements are what users interact with, so it is natural to assume they are what makes a report “good.”

In reality, however, dashboards are only the final layer. Long before any visual appears on the screen, critical decisions have already been made. These decisions determine whether a report is fast or slow, reliable or misleading, intuitive or frustrating. That earlier and often invisible stage is data modelling.

Data modelling in Power BI is the process of organizing data so that Power BI understands what the data represents and how different pieces of information relate to one another. It involves deciding:

which tables are needed?
what each table should contain?
how those tables should be connected?

When this structure is well designed, Power BI feels logical and predictable. When it is not, even simple questions can return confusing or incorrect results. In other words, the quality of a Power BI report is decided before the first chart is ever created.

What “Schema” Means in Power BI

In Power BI, a schema refers to the overall structure of the data model. This is not a theoretical concept, it is the actual layout you see in Model view, including the tables and the relationships between them.

A schema answers very practical questions:

What tables exist in the model?
Which tables store measurements, and which store descriptions?
When a user clicks a slicer, how does Power BI know which data to include?

Power BI does not “reason” about data in a human way. Instead, it follows the paths you define. The schema determines:

how filters move from one table to another,
how totals and averages are calculated,
and how fast visuals respond when users interact with the report.

Two schema patterns appear most frequently in Power BI models:
Star schema
Snowflake schema

Understanding the difference between these two explains why some Power BI models feel simple and trustworthy, while others feel fragile and unpredictable.

Fact Tables and Dimension Tables: Understanding the Roles of Tables

Most Power BI models are built using two types of tables. Understanding what each one does is the foundation of data modelling.

Fact Tables: Recording What Happened

A fact table records events. Each row represents something that actually occurred.

In a dataset such as Kenya crops data, a single row in the fact table might represent:

a specific crop,
grown in a specific county,
during a specific year or season,
with a measurable outcome such as yield in kilograms.

Because these events are recorded repeatedly over time, fact tables typically:

grow very large,
repeat the same crops or counties many times,
focus on numeric values that can be summed, averaged, or counted.

A fact table does not explain what a crop is or where a county is located. It simply records that something happened.

Dimension Tables: Giving Meaning to the Events

Dimension tables exist to describe and contextualize the facts. Instead of repeating names and descriptions in every row of the fact table, that information is stored once in separate tables, such as:

a Crop table that stores crop names and types,
a County table containing county names,
a Date table containing years or seasons.

Dimension tables typically:

change slowly compared to fact tables,
contain descriptive rather than numerical data,
are used to filter, group, and label results in reports.

When you select a county or crop in a slicer, Power BI relies on the dimension table to determine which rows in the fact table should be included. This separation is what makes analysis both efficient and accurate.

The Star Schema: A Structure That Matches How Power BI Thinks

The star schema is the most effective and widely recommended structure for Power BI models.

In a star schema:

one fact table sits at the center (for example, crop yield records),
each dimension table connects directly to that fact table (crop, county, date),
dimension tables do not connect to each other.

This structure aligns closely with how Power BI processes filters.

When you selects a county in a slicer, Power BI:

Looks at the County table.
Identifies the selected county’s unique key.
Follows the relationship directly to the fact table.
Keeps only the matching rows.
Performs calculations using those rows.

Because each dimension connects straight to the fact table: filters move directly to the data being analyzed and Power BI does not need to pass through intermediary tables which leads to calculations behaving consistently.

This makes much of the analytical logic to be handled by the structure itself, reducing the need for complex formulas later.

Why the Star Schema Performs Better in Power BI

Power BI stores data in columns and is optimized for fast aggregation. It performs best when relationships are simple and unambiguous.

In a star schema, you will observe that:

Power BI follows one clear relationship path,
fewer joins are required to answer questions,
the model is easier to understand and debug.

As a result, reports load faster, slicers respond more smoothly and DAX formulas tend to be shorter and easier to reason about.

The Snowflake Schema: A bit more complex

A snowflake schema starts with the same idea as a star schema but splits descriptive information across multiple related tables.

For example, instead of storing all location details in a single County table, the data might be organized as:

a County table stores county information,
a Region table stores regional information,
the Country table stores country information. When a user selects a country, Power BI must follow a longer path before reaching the data. For Example, Start at the Country table. Then, Move to the Region table. Then move to the County table. Finally reach the fact table.

Each additional step increases processing work for Power BI and increases the chance of errors if any relationship is incorrect.

While snowflake schemas reduce duplicated data, they create challenges in Power BI because filters must travel through multiple tables, more relationships must be managed. Hence, it becomes harder to predict how calculations will behave.

For this reason, snowflake schemas are common in source systems but are often reshaped into star schemas for reporting.

Relationships: How Tables Actually Work Together

Relationships define how tables communicate and how filters flow.

When you select a county, crop, or year in a slicer, Power BI does not search the fact table directly. It looks at the dimension table, then identifies the matching key, then it follows the relationship to the fact table and filters the fact rows accordingly.

In a well-designed model:

each dimension table contains unique values (each crop or county appears once),
fact tables contain many related records linked to those values,
filters flow from dimension tables to the fact table.

This mirrors real-world logic: one county can have many crop records, and one crop can appear across many years.

Cardinality: Understanding “One” and “Many”

Cardinality describes how many rows in one table relate to rows in another.

One-to-Many means one row in a dimension table relates to many rows in the fact table.
One-to-One means one row matches exactly one row in another table. (rare in reporting)
Many-to-Many means multiple rows relate to multiple rows (can cause duplicated totals if not handled carefully)

Note: Incorrect cardinality may still produce a result but those results may not represent reality.

Why Good Data Modelling Matters

Data modelling affects every Power BI report in three key ways.

Performance

Simple structures reduce processing work, resulting in faster visuals and smoother interaction.

Accuracy

Correct relationships ensure each fact is counted once, preventing inflated totals and misleading averages.

Simplicity

Clear models make reports easier to build, understand, and maintain. Complex DAX is often a sign of a model that needs improvement.

Effective models typically:

separate measurements from descriptions,
use star schemas where possible,
define relationships clearly,
rely on the model to handle logic instead of forcing visuals to compensate.

When this foundation is solid, Power BI becomes easier to use and easier to trust the results. Schemas and data modelling directly determine whether Power BI produces reliable insight or confusing results. By understanding fact and dimension tables, choosing appropriate schemas, and defining relationships carefully, analysts create reports that are fast, accurate, and understandable.For more information, feel free to visit Microsoft on more information about PowerBI.

Also Feel free to leave a comment sharing how you approach data modelling in your own Power BI projects. Discussion and different perspectives are always welcome.

Understanding Git Version Control, Push & Pull (A Beginner’s Guide)

Sharon M. — Sun, 18 Jan 2026 14:24:30 +0000

What is covered in this part

What version control, push, and pull mean
Version control (Git)
Push
Pull
The main stages of Git
A Practical Example(Part 4)

What Do Version Control, Push, and Pull Mean?

Version control (Git)

Version control is a system that records changes to your files over time.

In simple terms, it’s like having a save history for your project. With Git, you create checkpoints called commits. Each commit becomes part of your project history, allowing you to see what changed, know when it changed and go back to an earlier version if needed.
So if something breaks or you realize you preferred an older version, you can easily go back to it instead of starting over.

Here are some commands to keep in mind:

git status — Shows which files have changed and whether those changes have already been committed. Only committed changes can be pushed.
git add . — Tells Git to include all changed files in the next commit.
git commit -m "..." — Creates the checkpoint (commit) with a message.
git log — Displays a list of all commits. It displays who made the change, when, and the commit message.
git show or git show <commit-id> — Shows the actual lines that were added or removed. It is useful when you want to understand what changed.
git restore filename(e.g. notes.txt) — Reverts the file back to the last saved commit. It is useful if you made a mistake and want to undo it.
git checkout <commit-id>(e.g. a1b2c3d) — Lets you look at an older version of the project without deleting your work. It is useful when learning.
git checkout main — Lets you return to the latest version.
git restore filename — Lets you undo changes to a file (i.e. go back to last saved version).

Push

Push means sending your saved commits from your computer (local repository) to GitHub (remote repository). When you push:

Your work and changes are uploaded to GitHub.
They are backed up online.
They become available for sharing or collaboration. If you don’t push, your work remains only on your computer and isn’t visible or backed up on GitHub.

Here are some commands to keep in mind:

If it is your first time pushing a repository, you may need:
git push -u origin main(Replace main with master if your branch is named master).
Use git branch(or git status) if you want to check which branch you are currently working on.
If you want to switch to main branch, use:
git branch -M main command.

After your first push, you can just use git push.
You should also note that before pushing, changes must be saved as commits using git add and git commit. When you push, your local commits are uploaded to GitHub so they are stored online, backed up, and available for others to see.

Pull

Pull means downloading the latest commits from GitHub (remote repository) to your computer (local repository).
This is important when:

You work on multiple computers and need the same files everywhere.
You collaborate with others and want their latest changes. For example, group or teams projects.
Updates were made directly on GitHub and you need them on your local machine.

Here are some commands to keep in mind:

git pull- Allows you to download the latest commits from GitHub and updates your local files.
git fetch- Checks for updates without installing them. This command lets you download updates from GitHub without changing your local files. It simply checks for new commits and brings that information into your local repository so you can see what has changed. Unlike git pull, fetching does not automatically update your working files. This makes it useful when you want to review updates first or confirm that changes exist before applying them.
git merge- Lets you now combine the updates fetched(git fetch) with your files.

There are other command prompts, such as git pull --rebase, git restore, git reset --hard, which will be covered later in the series.

What Actually Happens at Each Stage

When you are working with Git, your files move through a few simple stages. Once you understand these stages, many Git commands start to make much more sense.

Working Directory - Where changes live.

This is your project folder on your computer. Any time you open a file and make changes, those changes happen here first.
At this point, Git can detect that something has changed, but nothing has been saved yet.

Staging Area - How changes move.

When you run git add, you move specific changes into the staging area. It states: “These are the changes I want to include in my next save.”
Do not panic, you’re still in control it is only that the changes you add get staged.

Local Repository (Commits) - How Git stores history

When you run git commit, Git takes everything in the staging area and saves it as a permanent checkpoint in your local repository (aka. your computer).
At this point, your work is safely saved on your computer, but it hasn’t been sent online yet.

Remote Repository (GitHub) - How local and remote copies relate

When you run git push, Git sends your local commits to GitHub.
You can now breathe because this is when your work is backed up online and becomes visible to others.

Pulling Updates from GitHub

When you run git pull, Git downloads the latest commits from GitHub and applies them to your local project.

The Git Workflow Cycle

After pulling, you’re back in the working directory, ready to continue editing files. From here, the same cycle repeats:

edit files → stage → commit → push → pull → repeat

Once you understand these stages, Git commands start to make a lot more sense.

A Practical Example: Tracking Changes and Pushing to GitHub and Pulling from GitHub (Pending)

Now that we understand what version control, push, and pull mean, let’s walk through a simple example to see how these commands work together in practice in Part 4 of the series, where we will also look into additional Git commands.

As always, feel free to comment below. I welcome feedback and discussion in the comments.

Connecting Git Bash to GitHub (SSH Key): A Beginner’s Guide

Sharon M. — Sun, 18 Jan 2026 08:29:11 +0000

What is covered in this part

Why connecting Git Bash to GitHub matters
How to connect Git Bash to GitHub using an SSH Key

Why connecting Git Bash to GitHub is important

In Part 1, we covered the basics of Git Bash and GitHub. Now, we’ll take the next step by connecting Git Bash to GitHub.

So why is connecting Git Bash to GitHub important? Connecting Git Bash to GitHub allows you to move your work from your computer to GitHub and back again. Without this connection, Git can only track changes on your own computer, which means your work stays local and can’t be shared or backed up online.

Once Git Bash is connected to GitHub, you can push your saved changes to GitHub, where your project is stored safely online.
You can also pull changes from GitHub, which is important when working on projects across multiple devices or with other people.

Yes, you read that right. Multiple people can work on the same project without overwriting each other’s work, because Git keeps a clear history of changes and GitHub shows who made each update and when.

Connecting Git Bash to GitHub

Step 1: Create a GitHub account

You can create your GitHub account here.

Step 2: Configure your Git identity (name and email)

Git needs to know who you are so that it can label your commits correctly.
Now open Git Bash and type in your terminal:
To set your name

git config --global user.name "yourname"

To set your email

git config --global user.email "youremail@example.com"

To display all Git settings that are currently configured on your computer

git config --global --list

You will be able to see your username and user-email displayed in the output.

Step 3: Generate a New SSH Key

You are probably wondering what is an SSH Key?

Simply, an SSH key is a secure digital identity that allows Git Bash to connect to GitHub without typing your password every time.
And this is how we generate it:

ssh-keygen -t ed25519 -C "youremail@example.com"

When prompted:

Press Enter to accept the default path/ file location.
Press Enter again to skip passphrase (or add one, optional). Example output:

Your public key has been saved in /c/Users/your-username/.ssh/id_ed25519

You can also check if you have an already existing key using:

ls ~/.ssh

Example output:

id_ed25519
id_ed25519.pub

Step 4: Start the SSH Agent

This will start a background helper that securely holds your key.

eval "$(ssh-agent -s)"

Example output:

Agent pid 293

Step 5: Add Your Key to the Agent

Essentially, this tells the SSH agent which private key it should use to authenticate you.

ssh-add ~/.ssh/id_ed25519

Example output:

Identity added: /c/Users/your-username/.ssh/id_ed25519

Step 6: Copy Your Public Key

At this point, you need to copy your public SSH key and add it to your GitHub account.

Method 1: Copy the key using Git Bash

You can now use this command to display your public key directly in the terminal. So, you can copy it and add it to GitHub.

cat ~/.ssh/id_ed25519.pub

Example output:

ssh-ed25519 [YOUR_PUBLIC_KEY_HERE]

Method 2: Copy the key manually from the file (alternative)

You can also access the same public key manually using the file path that was generated in Step 3.

Navigate to the .ssh folder on your computer (for example: C:\Users\your-username\.ssh).
Open the .ssh folder by double-clicking it.
Inside the folder, locate the file named id_ed25519.pub.
Open this file using any text editor, such as Notepad or Visual Studio Code.
You will see a line of text that starts with ssh-ed25519 (refer to the example output above). This is your public key.
Copy the entire line.

After copying this public key, navigate to your GitHub account. Then paste it into GitHub → Settings → SSH and GPG keys.

Step 7: Test the connection

This code helps you to confirm if Git Bash is now connected to GitHub.

ssh -T git@github.com

Example output:

Hi your-username! You've successfully authenticated...

👌Your GitHub and Git Bash are now connected.
Quick Tip: You only need to set up SSH once per computer and after this, GitHub authentication becomes seamless.

Feel free to engage with this post in the comment section. You can also check out Part 3 of the series.

How to Install Git Bash on Windows: Step-by-Step Guide for Beginners

Sharon M. — Sun, 18 Jan 2026 06:35:24 +0000

What You’ll Learn in This Part

Introduction
What Is Git Bash?
How to Download and Install Git Bash on Windows
How to verify your Git Bash installation

Introduction

Here is a little explanation to get us started:
GitHub and Git Bash are tools used to create, manage, and collaborate on digital projects—especially projects that involve code, data, or files that change over time.
These tools are among many others that are widely used by software developers, data analysts, data scientists, programmers, and even non-programmers. These tools provide a reliable way to track work, avoid losing progress, and collaborate with others.

What is Git Bash?

Git Bash is a tool you use to communicate with Git, the version control system that tracks changes in your files. It provides a simple terminal where you can:

Navigate folders
Create files
Record changes using clear, text-based commands.

GitHub, on the other hand, is an online platform where those recorded changes are stored, shared, and displayed. When you use Git Bash, you save your work locally as commits and then send them to GitHub, where your projects are safely stored online and can be accessed from anywhere.

Steps to download and set up Git Bash on Windows

Install Git on your PC. You can download it here.
Once Git Bash is downloaded, navigate to your Downloads folder on your PC (or click the download icon at the top of your browser, usually shown as a downward arrow). Locate and double-click the downloaded .exe file to run it. When you get a prompt saying, “Do you want to allow this app to make changes to your device?” Click Yes.
Use the default settings throughout the installation process. Click Next until you reach Install, then click Install. After the installation is complete, You may choose to Launch Git Bash, then click Finish.
After installing Git on Windows, if Git Bash does not launch automatically, press the Windows key on your keyboard, type Git Bash, and click Git Bash from the search results. You should see a terminal window with:

username@hostname MINGW64 /

How to verify your installation

In the Git Bash terminal, type:

git --version

If Git is installed correctly, you’ll see output similar to:

git version 2.x.x.windows.1

I hope this guide was helpful. Let me know in the comments how you set up your Git Bash to suit your needs. Next, we’ll look at how to connect Git Bash to GitHub, so head over to Part 2 of the series.