Forem: Edmund Eryuba

ETL vs ELT: Two Paradigms, One Goal

Edmund Eryuba — Mon, 13 Apr 2026 11:09:58 +0000

What are the similarities, differences, benefits and use cases of ELT and ETL.

The pipelines at a glance

ELT (extract, load, transform) and ETL (extract, transform, load) are both data integration processes that move raw data from a source system to a target database. These data sources can be in multiple, different repositories or in legacy systems that are then transferred using ELT or ETL to a target data location. Both approaches move data from source to destination but where transformation happens changes everything about cost, speed, flexibility, and the kind of analytics you can build.

What is ELT (extract, load, transform)?

With ELT, unstructured data is extracted from a source system and loaded onto a target system to be transformed later, as needed. This unstructured, extracted data is made available to business intelligence systems, and there is no need for data staging.

ELT leverages data warehousing to do basic data transformations, such as data validation or removal of duplicated data. These processes are updated in real-time and used for large amounts of raw data. ELT is a newer process that has not reached its full potential compared to its older sister, ETL. The ELT process was originally based on hard-coded SQL scripts. Those SQL scripts are more likely to have potential coding errors than the more advanced methods used in ETL.

What is ETL (extract, transform, load)?

With ETL, unstructured data is extracted from a source system and specific data points and potential “keys” are identified prior to loading data into the target systems.

In a traditional ETL scenario, the source data is extracted to a staging area and moved into the target system. In the staging area, the data undergoes a transformation process that organizes and cleans all data types. This transformation process allows for the now structured data to be compatible with the target data storage systems.

Where they share common ground

Despite their differences, the two patterns share a substantial foundation; both solve the same fundamental problem of moving heterogeneous data into a single analytical store.

Data integration: Both consolidate data from multiple disparate sources into a unified destination
Transformation necessity: Neither skips transformation, they only differ in when and where it happens.
Orchestration: Both require scheduling, dependency management, and error-handling tooling.
Data quality concerns: Validation, deduplication, and schema enforcement are needed in both patterns.
Observability: Lineage tracking, logging, and alerting are essential regardless of order.

Where they diverge

The key differences between ELT and ETL are the order of operations between the two processes that make them uniquely suited for different situations. Other differences are in data size and data types that each process can handle.

Performance and scalability

ETL pipelines transform data on a dedicated server whose capacity is fixed, when data volumes spike, you hit a ceiling. ELT offloads transformation to cloud warehouses like BigQuery, Snowflake, or Redshift, which scale compute horizontally on demand. For organizations processing billions of rows, ELT's elastic compute model is a significant structural advantage.

Data freshness and raw access

Because ELT loads raw data first, analysts retain access to the original source records. If a transformation rule is wrong, you can fix it and rerun without re-ingesting from the source. With ETL, if data was dropped or transformed before loading, it may be gone permanently, making reruns more expensive and re-extraction from source systems often necessary.

Compliance and sensitivity

ETL's pre-load transformation gives security teams an easier lever: strip or mask personally identifiable information before it ever enters the warehouse. ELT stores raw, potentially sensitive data in the destination system, demanding robust row-level security, column masking, and access policies inside the warehouse itself. It is manageable, but a larger governance surface area.

Benefits of ELT and ETL

ELT Strengths

The ELT approach enables faster implementation than the ETL process, though the data is messy once it is moved. The transformation occurs after the load function, preventing the migration slowdown that can occur during this process. ELT decouples the transformation and load stages, ensuring that a coding error (or other error in the transformation stage) does not halt the migration effort. Additionally, ELT avoids server scaling issues by using the processing power and size of the data warehouse to enable transformation (or scalable computing) on a large scale. ELT also works with cloud data warehouse solutions to support structured, unstructured, semi-structured and raw data types.

ETL Strengths

ETL takes longer to implement but results in cleaner data. This process is well suited for smaller target data repositories that require less frequent updating. ETL also works with cloud data warehouses by using cloud-based SaaS platforms and onsite data warehouses.
There are also many open-source and commercial ETL tools with capabilities and benefits that include the following:

Comprehensive automation and ease-of-use functions that can automate the entire data flow and make recommendations on rules for the extract, transform and load process.
A visual drag-and-drop interface used for specifying rules and data flows.
Support for complex data management to assist with complex calculations, data integrations and string manipulation.
Security and compliance that encrypt sensitive data and are certified compliant with industry or government regulations. This provides a more secure way to encrypt, remove or mask specific data fields to protect client’s privacy.

Use Cases

ELT

An ELT process is best used in high-volume data sets or real-time data use environments. Specific examples include the following:

High-volume environments: Meteorological systems like weather services collect, collate and use large amounts of data on a regular basis. Businesses with large transaction volumes also fall into this category.
Cloud-native data platforms: Using scalable warehouses such as Snowflake, Databricks, and BigQuery that leverage microservices, containerization, and distributed storage enabling modern and flexible analytics architectures..
Real-time ingestion systems: Stock exchanges generate and use large amounts of data in real-time, where delays can be harmful. Additionally, large-scale distributors of materials and components need real-time access to current data for business intelligence.

ETL

ETL is best used for synchronizing several data use environments and migrating data from legacy systems. The following are some specific examples:

Need for data synchronization from several sources: Companies that are merging their ventures may have multiple consumers, supplies and partners in common. This data can be stored in separate data repositories and formatted differently. ETL works to transform the data in a unified format before loading it onto the target data location.
Updating and migrating data from legacy systems: The legacy systems require the ETL process to transform the data into a compatible format with the new structure of the target database.

The verdict

ETL predates the cloud era and remains the right choice when the destination system cannot bear transformation workloads, or when sensitive data must be sanitized before storage. ELT has become the dominant pattern for modern data teams precisely because cloud warehouses turned transformation into a solved compute problem; cheap, fast, and version-controlled through SQL.

In practice, many mature data platforms run both: ELT for the bulk of analytical pipelines, and targeted ETL steps where compliance or system constraints demand it. Understanding the tradeoffs of each and not treating one as universally superior is what separates thoughtful data architecture from following a trend.

Connecting Power BI to a SQL Database (PostgreSQL)

Edmund Eryuba — Fri, 13 Mar 2026 17:16:20 +0000

Introduction

Power Query is a data transformation and data preparation engine. Power Query comes with a graphical interface for getting data from sources and a Power Query editor for applying transformations. Because the engine is available in many products and services, the destination where the data is stored depends on where Power Query is used. Using Power Query, you can perform the extract, transform, and load (ETL) processing of data.

Microsoft Power BI is a comprehensive Business intelligence platform that uses Power Query to ingest data, then adds modelling, visualization, and sharing capabilities. Power Query preps the data, while Power BI turns it into reports

Why use PostgreSQL?

Maintaining dynamic database systems is critical in today’s digital landscape, especially considering the rate in which newer technologies emerge. PostgreSQL is expandable and versatile so it can quickly support a variety of specialized use cases with powerful extension ecosystem, which covers things from time-series data types to geospatial analytics.

Its versatile and approachable design makes PostgreSQL a “one-size-fits-all” solution for many enterprises looking for cost-effective and efficient ways to improve their database management systems.

Built as an open-source database solution, PostgreSQL is completely free from licensing restrictions, vendor lock-in potential, or the risk of over-deployment.

Expert developers and commercial enterprises who understand the limitations of traditional database systems heavily support PostgreSQL. They work diligently to provide a battle-tested, best-of-breed relational database management system.

How Power Query helps with data acquisition

Business users spend up to 80 percent of their time on data preparation, which delays the work of analysis and decision-making. Several challenges contribute to this situation and Power Query helps address many of them.

Enables wide range connectivity of data sources, including data of all sizes and shapes.
Consistency of experience, and parity of query capabilities over all data sources.
Highly interactive and intuitive experience for rapidly and iteratively building queries over any data source, of any size.
When using Power Query to access and transform data, you define a repeatable process (query) that can be easily refreshed in the future to get up-to-date data.
Power Query offers the ability to work against a subset of the entire data set to define the required data transformations, allowing you to easily filter down and transform your data to a manageable size.

Power Query experiences

The Power Query user experience is provided through the Power Query editor user interface. The goal of this interface is to help you apply the transformations you need simply by interacting with a user-friendly set of ribbons, menus, buttons, and other interactive components.

The Power Query editor is the primary data preparation experience. In the editor, you can connect to a wide range of data sources and apply hundreds of different data transformations by previewing data and selecting transformations from the UI. These data transformation capabilities are common across all data sources, whatever the underlying data source limitations.

When you create a new transformation step by interacting with the components of the Power Query interface, Power Query automatically creates the M code required to do the transformation so you don't need to write any code.

Connecting Power BI to a Local PostgreSQL Database

Connecting Microsoft Power BI to an SQL database allows you to import data and build dashboards directly from your database tables. The exact steps depend slightly on the SQL engine (e.g., PostgreSQL, MySQL, or Microsoft SQL Server), but the workflow is mostly the same.

1. Get data in Power BI Desktop

Launch Power BI Desktop. On the Home screen you will see options for selecting a data source or start with a blank report. Click on the Blank report option to be directed to the Home tab.

In Power BI Desktop, you can directly select an Excel worksheet, a Power BI semantic model, a SQL server database, direct data entry, Dataverse data or recently used data source without using the Get data option.

From the Data ribbon, selecting Get Data provides additional methods to select the desired connector.

Select the More option which opens a Get Data window that contains a complete list of available connectors.

2. Connection settings

After clicking Get Data: Choose Database > Select PostgreSQL Database > Click Connect

Power BI includes a built-in PostgreSQL connector that allows direct communication with PostgreSQL servers.

Scroll to PostgeSQL database, select the option and click on Connect to close the window.

In the PostgreSQL database dialog that appears, provide the name of the server and database.

Select either the Import or DirectQuery data connectivity mode. Use Import for snapshots or DirectQuery for live data.

Power BI allows you to optionally include a SQL query in the Advanced Options section if you want to retrieve only specific data from the database.

3. Authentication Credentials

If you're connecting to this database for the first time, select the authentication type you want to use, and then enter your credentials. The authentication types available are:

Database (Username and password)
Microsoft account (Microsoft Entra ID)

These credentials must match the PostgreSQL database user account.

Once authenticated, Power BI establishes a connection to the PostgreSQL server and retrieves metadata about the available tables.

Data preview

The goal of the data preview stage is to provide you with a user-friendly way to preview and select your data.

The Navigator lists all tables in the database, allowing one to preview each table, select multiple tables or load them directly. Either select Load to load the data or Transform Data to continue transforming the data in Power Query editor.

The data preview pane on the right side of the window shows a preview of the data from the object you selected.

Connecting Power BI to a Cloud PostgreSQL Database (Aiven)

Organizations often host databases in the cloud rather than on local machines. One example is Aiven, which provides managed PostgreSQL services.

Connecting Power BI to a cloud PostgreSQL database is similar to connecting to a local database, but additional security settings are required.

1. Obtain Connection Details from Aiven

Get Connection Details: Log in to your Aiven Web Console and navigate to your PostgreSQL service to find the Host, Port, Database Name, and Username/Password.

2. Download the SSL Certificate

Most cloud database providers require encrypted connections for security reasons. Aiven provides an SSL certificate that ensures secure communication between Power BI and the database.

From the Aiven console:

Navigate to Connection Information
Download the CA Certificate
Save the certificate file on your computer

SSL encryption ensures:

Data transmitted over the internet cannot be intercepted
Authentication of the database server
Secure communication between the client and server

3. Connect Using Power BI

Open Power BI: Open Power BI Desktop, click Get Data on the Home ribbon, and select More....

Select connector: Choose Database > PostgreSQL database and click Connect.

Enter Credentials: Provide the server and database name, your Aiven username and password and enable SSL if required. Power BI will use the certificate to verify the database server and establish a secure connection.

Loading Data and Creating Relationships

After connecting to the database, the tables are loaded into Power BI’s data model.

Power BI automatically detects relationships based on matching column names such as customer_id or product_id.
However, relationships can also be created manually.

These relationships form a data model, which defines how tables interact with each other.

Summary

Power BI is a powerful business intelligence platform that allows organizations to transform raw data into meaningful insights. By connecting Power BI to SQL databases such as PostgreSQL, companies can access large datasets, build interactive dashboards and make data-driven decisions.

Connecting to a local PostgreSQL database involves selecting the PostgreSQL connector in Power BI, entering the server and database details, authenticating with credentials, and loading tables into the Power BI model. When connecting to cloud databases such as those hosted on Aiven, additional security measures such as SSL certificates ensure that the connection is encrypted and secure.

Once the data is loaded, Power BI allows analysts to create relationships between tables, forming a structured data model that supports accurate analysis. Strong SQL skills further enhance a Power BI analyst’s ability to retrieve, filter, aggregate, and prepare data efficiently before building reports.

Designing Efficient Queries with SQL Joins and Window Functions

Edmund Eryuba — Mon, 02 Mar 2026 11:08:12 +0000

SQL(Structured Query Language) is a powerful tool to search through large amounts of data and return specific information for analysis. Learning SQL is crucial for anyone aspiring to be a data analyst, data engineer, or data scientist, and helpful in many other fields such as web development or marketing.

SQL Joins

JOINS in SQL are commands which are used to combine rows from two or more tables, based on a related column between those tables. They are predominantly used when a user is trying to extract data from tables which have one-to-many or many-to-many relationships between them.

There are mainly four types of joins that you need to understand. They are:

(INNER) JOIN
LEFT (OUTER) JOIN
RIGHT (OUTER) JOIN
FULL (OUTER) JOIN

INNER JOIN

INNER JOIN is used to retrieve rows where matching values exist in both tables. It helps in:

Combining records based on a related column.
Returning only matching rows from both tables.
Excluding non-matching data from the result set.
Ensuring accurate data relationships between tables.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;

LEFT JOIN

LEFT JOIN is used to retrieve all rows from the left table and matching rows from the right table. It helps in:

Returning all records from the left table.
Showing matching data from the right table.
Displaying NULL values where no match exists in the right table.
Performing outer joins, also known as LEFT OUTER JOIN.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table
LEFT JOIN right_table
ON left_table.id = right_table.id;

RIGHT JOIN

RIGHT JOIN is used to retrieve all rows from the right table and the matching rows from the left table. It helps in:

Returning all records from the right-side table.
Showing matching data from the left-side table.
Displaying NULL values where no match exists in the left table.
Performing outer joins, also known as RIGHT OUTER JOIN.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table 
RIGHT JOIN right_tale
ON left_table.id = right_table.id;

FULL JOIN

FULL JOIN is used to combine the results of both LEFT JOIN and RIGHT JOIN. It helps in:

Returning all rows from both tables.
Showing matching records from each table.
Displaying NULL values where no match exists in either table.
Providing complete data from both sides of the join.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table 
FULL JOIN right_tale
ON left_table.id = right_table.id;

Core Insights

SQL joins are fundamental for relational data modeling, enabling the combination of rows from multiple tables based on defined relationships, typically via primary and foreign keys.

Proper join selection directly affects result cardinality, null propagation, and business logic interpretation. Performance considerations include indexing join columns, minimizing unnecessary joins and understanding join order in execution plans.

Key takeaways are that joins operationalize relational integrity, drive multi-table analytics and must be designed carefully to avoid duplication, unintended filtering or performance degradation especially in high-volume transactional or analytical databases.

SQL Window Functions

A window function in SQL is a type of function that performs a calculation across a specific set of rows (the 'window' in question), defined by an OVER() clause.

Window functions use values from one or multiple rows to return a value for each row, which makes them different from traditional aggregate functions, which return a single value for multiple rows.

Similar to aggregate function GROUP BY, a window function performs calculations across multiple rows. Unlike aggregate functions, a window function does not group rows into one single row.

Key components of SQL window functions

The syntax for window functions is as follows:

SELECT column_1, column_2, column_3, function()
OVER (PARTITION BY partition_expression ORDER BY order_expression) as output_column_name
FROM table_name

In this syntax:

The SELECT clause defines the columns you want to select from the table_name table.
The function() is the window function you want to use.
The OVER clause defines the partitioning and ordering of rows in the window.
The PARTITION BY clause divides rows into partitions based on the specified partition_expression; if not specified, the result set will be treated as a single partition.
The ORDER BY clause uses the specified order_expression to define the order in which rows will be processed within each partition; if not specified, rows will be processed in an undefined order.
Finally, output_column_name is the name of your output column.

These are the key SQL window function components. One more thing worth mentioning is that window functions are applied after the processing of WHERE, GROUP BY, and HAVING clauses. This means you can use the output of your window functions in subsequent clauses of your queries.

The OVER() clause

The OVER() clause in SQL is essentially the core of window functions. It determines the partitioning and ordering of a rowset before the associated window function is applied.
The OVER() clause can be applied with functions to compute aggregated values such as moving averages, running totals, cumulative aggregates, or top N per group results.

The PARTITION BY clause

The PARTITION BY clause is used to partition the rows of a table into groups. This comes in handy when dealing with large datasets that need to be split into smaller parts, which are easier to manage.
PARTITION BY is always used inside the OVER() clause; if it is omitted, the entire table is treated as a single partition.

The ORDER BY clause

The ORDER BY determines the order of rows within a partition; if it is omitted, the order is undefined.
For instance, when it comes to ranking functions, ORDER BY specifies the order in which ranks are assigned to rows.

Frame Specification

In the same OVER() clause, you can specify the upper and lower bounds of a window frame using one of the two subclauses, ROWS or RANGE. The basic syntax for both of these subclauses is essentially the same:

ROWS BETWEEN lower_bound AND upper_bound

RANGE BETWEEN lower_bound AND upper_bound

And in some cases, they might even return the same result. However, there's an important difference.

In the ROWS subclause, the frame is defined by beginning and ending row positions. Offsets are differences in row numbers from the current row number.

As opposed to that, in the RANGE subclause, the frame is defined by a value range. Offsets are differences in row values from the current row value.

Types of SQL Window Functions

Window functions in SQL Server are divided into three main types: aggregate, ranking, and value functions. Let's have a brief overview of each.

Aggregate Window Functions

AVG(): returns the average of the values in a group, ignoring null values.
MAX(): returns the maximum value in the expression.
MIN(): returns the minimum value in the expression.
SUM(): returns the sum of all the values, or only the DISTINCT values, in the expression.
COUNT(): returns the number of items found in a group.
STDEV(): returns the statistical standard deviation of all values in the specified expression.
STDEVP(): returns the statistical standard deviation for the population for all values in the specified expression.
VAR(): returns the statistical variance of all values in the specified expression; it may be followed by the OVER clause.
VARP(): returns the statistical variance for the population for all values in the specified expression.

Sample query:

SELECT name, salary,
  SUM(salary) OVER (PARTITION BY dept) AS dept_total,
  AVG(salary) OVER (PARTITION BY dept) AS dept_avg
FROM employees;

Ranking Window Functions

Used to assign rank or position within partitions.

ROW_NUMBER(): assigns a unique sequential integer to rows within a partition of a result set.
RANK(): assigns a unique rank to each row within a partition with gaps in the ranking sequence when there are ties.
DENSE_RANK(): assigns a unique rank to each row within a partition without gaps in the ranking sequence when there are ties.
PERCENT_RANK(): calculates the relative rank of a row within a group of rows.
NTILE(): distributes rows in an ordered partition into a specified number of approximately equal groups.

Sample query:

SELECT name, salary,
  RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dept_rank
FROM employees;

Offset(Value) Window Functions

Used to access data from other rows.

LAG(): retrieves values from rows that precede the current row in the result set.
LEAD(): retrieves values from rows that follow the current row in the result set.
FIRST_VALUE(): returns the first value in an ordered set of values within a partition.
LAST_VALUE(): returns the last value in an ordered set of values within a partition.
NTH_VALUE(): returns the value of the nth row in the ordered set of values.
CUME_DIST(): returns the cumulative distribution of a value in a group of values.

Sample Query:

SELECT date, revenue,
  LAG(revenue, 1) OVER (ORDER BY date) AS prev_month,
  revenue - LAG(revenue, 1) OVER (ORDER BY date) AS change
FROM monthly_sales;

Summary

SQL window functions provide a powerful analytical layer within standard SQL, enabling complex calculations across related rows while preserving row-level granularity. Unlike GROUP BY, they do not collapse result sets, which makes them ideal for scenarios requiring both detail and aggregate insight in the same query.

The OVER() clause is central, with PARTITION BY defining logical groups, ORDER BY controlling calculation sequence, and optional frame specifications (ROWS or RANGE) refining scope.

Key functional categories include aggregate window functions for running totals and moving averages, ranking functions such as ROW_NUMBER() and RANK() for ordered comparisons and offset functions like LAG() and LEAD() for time-series or sequential analysis.

When used correctly, window functions significantly reduce query complexity, eliminate the need for self-joins in many analytical patterns and improve expressiveness in reporting and business intelligence workloads.

Turning Data into Insight: An Analyst’s Guide to Power BI

Edmund Eryuba — Sun, 08 Feb 2026 18:44:08 +0000

Introduction: The reality of messy business data

In most organizations, data rarely arrives in a clean, analysis-ready format. Analysts typically receive information from multiple sources: spreadsheets maintained by business teams, exports from transactional systems, cloud applications, and enterprise platforms such as ERPs or CRMs. These datasets often contain inconsistent formats, missing values, duplicate records, and unclear naming conventions.

Working directly with such data leads to unreliable metrics, incorrect aggregations and ultimately poor business decisions. This is where Power BI plays a critical role. Power BI is not just a visualization tool, it is an analytical platform that allows analysts to clean, model, and interpret data before presenting it in a form that decision-makers can trust.

From raw data to business action: The analyst workflow

A typical analytical workflow in Power BI follows a logical sequence:

Load raw data from multiple sources e.g., imports from excel, databases or online services.
Clean and transform the data using Power Query.
Model the data into a meaningful structure.
Create business logic using DAX.
Design dashboards that communicate insight.
Enable decisions and actions by stakeholders.

Each step builds on the previous one. If any stage is poorly executed, the final insight becomes misleading, regardless of how attractive the dashboard looks.

Cleaning and transforming data with power query

Data cleaning is the foundation of all reliable analytics. Common data quality issues include:

Columns stored in the wrong data type.
Missing or null values.
Duplicate customer or transaction records.
Inconsistent naming and coding systems.

These issues directly affect calculations. For example, a null freight value treated as blank instead of zero will distort average shipping costs. Duplicate customer records inflate revenue totals. Incorrect data types prevent time-based analysis entirely.

Power Query provides a transformation layer where analysts can reshape data without altering the original source. This ensures reproducibility and auditability.

Key Transformation Principles

There are several key principles that should guide an analyst in their approach to data transformation:

1. Remove what is not need

Unnecessary columns increase model size, memory usage, and cognitive complexity. Every column should justify its existence in a business question.

2. Standardize naming

Column and table names should reflect business language, not system codes.
For example:

Cust_ID → Customer ID
vSalesTbl → Sales

This improves both usability and long-term maintainability.

3. Handle missing and invalid values

Nulls, errors, and placeholders must be explicitly addressed. Analysts must decide whether missing values represent:

Zero
Unknown
Not applicable Each choice has analytical consequences.

4. Remove duplicates strategically

Duplicates should be removed only when they represent the same real-world entity. Otherwise, analysts risk deleting legitimate records.

Building meaningful data models

Most analytical errors in Power BI do not come from DAX formulas or charts. They come from poor data models.

A strong model reflects how the business actually operates. This typically follows a star schema:

Fact tables: transactions (Sales, Orders, Payments)
Dimension tables: descriptive attributes (Date, Product, Customer, Region)

This structure ensures:

Correct aggregations.
Predictable filter behavior.
High performance.

Without proper modeling, even simple metrics like “Total Sales by Region” can produce incorrect results due to ambiguous relationships or double counting.

Creating business logic with DAX

DAX (Data Analysis Expressions) is a library of functions and operators that can be combined to build formulas and expressions in Power BI, Analysis Services, and Power Pivot in Excel data models. It enables dynamic, context-aware analysis that goes beyond traditional spreadsheet formulas.

Examples of business logic encoded in DAX:

What counts as “Revenue”?
How is “Customer Retention” defined?
What is the official “Profit Margin” formula?

These definitions must be centralized and reusable. Measures become the organization’s single source of analytical truth.

DAX uses a formula syntax similar to Excel but extends it with advanced functions designed specifically for tabular data models in Power BI. It allows users to create measures, calculated columns and calculated tables to perform dynamic and context-aware calculations.

Measures vs Calculated Columns

Calculated columns: A calculated column is a column that you add to an existing table (in the model designer) and then create a DAX formula that defines the column's values. They operate row by row and are stored in memory.
Measures are evaluated dynamically where results change based on report context.

Creating Measures for Advanced Calculations

Measures are a core component of DAX used for calculations on aggregated data.
They are evaluated at query time not stored in the data model
Measures respond dynamically to filters, slicers and report context
Commonly used measures include SUM, AVERAGE and COUNT
DAX supports both implicit and explicit measures
Using correct data types is essential for accurate measure calculations

For most analytical metrics, measures are preferred, because they respond to filters, slicers, and user interactions.

Understanding Context: The Core of Correct Analytics

Context is one of the most important concepts in DAX because it determines how and where a formula is evaluated. It is what makes DAX calculations dynamic: the same formula can return different results depending on the row, cell, or filters applied in a report.

Without understanding context, it becomes difficult to build accurate measures, optimize performance, or troubleshoot unexpected results.

There are three main types of context in DAX:

Row Context

Refers to the current row being evaluated. It is most commonly seen in calculated columns, where the formula is applied row by row.

Filter Context

It is the set of filters applied to the data. These filters can come from slicers and visuals in the report, or they can be explicitly defined inside a DAX formula.

Query Context

Created by the layout of the report itself.

If analysts misunderstand context, they produce:

Wrong totals.
Misleading KPIs.
Inconsistent executive reports.

In summary, context is the foundation of how DAX works. It controls what data a formula can “see” and therefore directly affects the result of every calculation. Mastering row, query, and filter context is essential for building reliable, high-performing, and truly dynamic analytical models in Power BI and other tabular environments.

Designing dashboards that communicate insight

Designing interactive dashboards helps businesses make data-driven decisions. A dashboard is not a collection of charts. It is a decision interface.

It is essential to design professional reports that focus on optimizing layouts for different audiences, and leveraging Power BI’s interactive features.

Good dashboards:

Highlight trends and deviations.
Compare performance against targets.
Expose anomalies and risks.
Support follow-up questions.

Bad dashboards:

Show too many metrics.
Focus on visuals over meaning.
Require explanation to interpret.

Turning Dashboards into Business Decisions

This is the most important step, and the most neglected.

Dashboards should answer questions like:

Which regions are underperforming?
Which products drive the most margin?
Where is customer churn increasing?
What happens if we change pricing?

Real business actions include:

Reallocating marketing budgets.
Optimizing inventory levels.
Identifying operational bottlenecks.
Redesigning sales strategies.

If no decision changes because of a dashboard, then the analysis failed in capturing key business indicators.

Common pitfalls that undermine analytical value

Even experienced analysts fall into these traps:

Treating Power BI as a visualization tool instead of a modeling tool.
Writing complex DAX on top of poor data models.
Using calculated columns instead of measures.
Ignoring filter propagation and relationship direction.
Optimizing visuals before validating metrics.

These issues lead to highly polished dashboards with fundamentally wrong numbers, an undesired outcome in analytics.

Conclusion

Power BI provides an integrated analytical environment where data preparation, semantic modeling, calculation logic, and visualization are combined into a single workflow.

The analytical value of the platform does not emerge from individual components such as Power Query, DAX, or reports in isolation, but from how these components are systematically designed and aligned with business requirements.

Effective use of Power BI requires analysts to impose structure on raw data, define consistent relationships, implement reusable calculation logic through measures and ensure that visual outputs reflect correct filter and evaluation contexts.

When these layers are properly engineered, Power BI supports reliable aggregation, scalable analytical models, and consistent interpretation of metrics across the organization, enabling stakeholders to base operational and strategic decisions on a shared and technically sound analytical foundation.

Data Modelling for High Performance and Accurate Analytics in Power BI

Edmund Eryuba — Sun, 01 Feb 2026 09:16:40 +0000

This article explores data modelling in Power BI with a focus on different schema types and explains how proper modelling enhances performance and ensures accurate reporting.

What is Data Modelling

Data modelling is one of the most critical steps in building effective business intelligence (BI) solutions. In Power BI, data modelling refers to the process of structuring data into related tables, defining relationships and creating a logical framework that supports efficient querying, accurate calculations and meaningful reporting.

A well-designed data model is not just about organizing tables; it directly impacts report performance, usability, scalability, and the correctness of insights derived from data. Poor data modelling leads to slow reports, incorrect aggregations, complex DAX expressions, and ultimately unreliable business decisions.

In Power BI, data modelling involves:

Identifying business entities (facts and dimensions)
Structuring tables logically
Defining relationships between tables
Setting cardinality and filter direction
Creating calculated columns and measures
Ensuring data granularity and consistency

What is a Schema in Power BI?

In Power BI, a schema refers to the structure and organization of data within a data model. Schemas define how data is connected and related within the model, influencing the efficiency and performance of data queries and reports. Understanding schemas requires modelers to classify their model tables as either dimension or fact.

Fact tables

Fact tables store quantitative, transactional data that can be sales orders, quantity sold, revenue, profit and more. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns. The dimension key columns determine the dimensionality of a fact table, while the dimension key values determine the granularity of a fact table.

Example of a fact table: Consider a simple sales analytics scenario in Power BI.

This table stores transactional (measurable) data.

SalesID	DateKey	ProductKey	CustomerKey	Quantity	SalesAmount
1001	20240101	501	301	2	200
1002	20240101	502	302	1	150
1003	20240102	501	303	3	300
1004	20240103	503	301	1	120

Characteristics:

Contains foreign keys to dimensions.
Contains numeric measures.
Has many rows (high volume).

Dimension tables

Dimension tables describe the business entities that are modelled. Entities can include products, people, places, and concepts including time itself. A dimension table contains a key column (or columns) that acts as a unique identifier, and other columns. Other columns support filtering and grouping your data.

This table provides descriptive attributes about products.

ProductKey	ProductName	Category	Brand
501	Laptop	Electronics	Dell
502	Headphones	Accessories	Sony
503	Mouse	Accessories	Logitech

Types of Schemas in Power BI:

1. Star Schema

The star schema consists of a central fact table connected directly to multiple dimension tables, much like the appearance of a star.
The central fact table contains quantitative data (e.g., sales), while the dimension tables hold descriptive attributes related to the facts (e.g. Employee, Date, Territory). Dimension tables are not connected to each other.

Star schemas are ideal for straightforward reporting and querying. They are efficient for read-heavy operations, making them suitable for dashboards and summary reports.

2. Snowflake Schema

The snowflake schema is a normalized version of the star schema. In this design, dimension tables are further divided into related tables, resulting in a more complex structure.
The normalization process eliminates redundancy by splitting dimension tables into multiple related tables. This results in a web-like structure, resembling a snowflake.

Snowflake schemas are used in scenarios requiring detailed data models and efficient storage. They are beneficial when dealing with large datasets where data redundancy needs to be minimized.

3. Galaxies Schema (Fact Constellation Schema)

The galaxies schema involves multiple fact tables that share dimension tables, creating a complex, interconnected data model.

This schema consists of multiple fact tables linked to shared dimension tables, enabling the analysis of different business processes within a single model.

Galaxies schemas are suitable for large-scale enterprise environments where multiple related business processes need to be analysed. They support complex queries and detailed reporting across various domains.

Implementing schemas in Power BI

a. Creating a Star Schema

Set Up Fact and Dimension Tables: Identify and create the central fact table and surrounding dimension tables.
Link Tables: Establish relationships between the fact table and dimension tables using foreign keys.
Optimize for Performance: Index key columns and use efficient data types to enhance query performance.

b. Implementing a Snowflake Schema

Normalize Dimension Tables: Split dimension tables into related sub-tables to reduce redundancy.
Create Relationships: Define relationships between sub-tables and the main dimension tables, ensuring referential integrity.
Optimize Storage: Use appropriate storage and indexing strategies to manage complex joins efficiently.

c. Setting Up a Galaxies Schema

Identify Fact Tables: Determine the various fact tables needed for different business processes.
Share Dimension Tables: Create shared dimension tables to link multiple fact tables.
Ensure Efficient Querying: Design the schema to support complex queries and optimize performance through indexing and data partitioning.

Role of data modelling in performance and reliable reporting

a. Query and Engine Performance

Query and engine performance refers to how efficiently Power BI processes data when users interact with reports. Power BI’s VertiPaq engine performs best when data is organised using a star schema, where fact tables store numerical data and dimension tables store descriptive attributes.

This structure improves compression, reduces the number of joins required during query execution, and simplifies DAX calculations. As a result, reports load faster, visuals respond more quickly, and overall system performance improves.

b. Memory and Scalability

Memory and scalability describe the ability of a Power BI model to handle large and growing datasets. Proper data modelling controls dataset size by reducing column cardinality and removing unnecessary fields. Low-cardinality columns compress efficiently, while column pruning helps minimise memory usage and refresh time.

By structuring data into lean fact and dimension tables rather than wide flat tables, Power BI models become more scalable and capable of supporting high data volumes without performance degradation.

c. Correct Aggregations and Metrics

Correct aggregations ensure that reported values accurately reflect business operations. This depends on defining clear data granularity and using proper relationship structures. Each fact table must represent a consistent level of detail, and filters should flow from dimensions to facts.

Poor modelling can result in double counting, ambiguous totals, or misleading KPIs. A well-designed model prevents these issues by enforcing one-to-many relationships and maintaining logical data structure.

d. Filter Propagation and User Trust

Filter propagation determines how slicers and filters affect report visuals. In a properly modelled system, filters behave predictably and consistently across all visuals, allowing users to explore data intuitively.

When modelling is poor, filters may behave inconsistently, leading to confusing or contradictory results. Reliable filter behavior builds user trust and ensures that insights derived from reports are credible and easy to interpret.

e. Maintainability and Governance

Maintainability refers to how easy the model is to manage and extend over time. A strong data model supports reusable measures, consistent dimensions, and standard business definitions across reports. This creates a single source of truth for the organisation, reduces duplication of logic, and simplifies governance.

As a result, the reporting environment becomes easier to maintain, more consistent, and more reliable for long-term decision-making.

Conclusion

Understanding different schemas in Power BI is crucial for designing efficient data models. Each schema has unique advantages: the star schema is ideal for straightforward reporting and querying, offering simplicity and ease of use; the snowflake schema provides detailed, normalized structures, reducing redundancy and optimizing storage; and the galaxies schema supports complex, large-scale data models with multiple fact tables sharing dimension tables.

Choosing the right schema improves query performance, data storage efficiency, and data refresh operations. By mastering these schemas, you can create robust and scalable data models, enabling your organization to make data-driven decisions effectively.

Understanding and implementing different schemas in Power BI is crucial for designing efficient and effective data models. Each schema type; star, snowflake, and galaxies, offers unique benefits and use cases. By mastering these schemas, you can create robust data models that support comprehensive analysis and insightful reporting. Experiment with different designs based on your data needs and continue refining your skills to become a Power BI expert.

Linux for Data Engineers: From Terminal to Text Editing

Edmund Eryuba — Sun, 25 Jan 2026 16:13:21 +0000

Linux is an open-source operating system that is based on the Unix operating system. It was created by Linus Torvalds in 1991.
Open-source means that the source code of the operating system is available to the public. This allows anyone to modify the original code, customize it, and distribute the new operating system to potential users.

Why should you learn about Linux?

In today's data center landscape, Linux and Microsoft Windows stand out as the primary contenders, with Linux having a major share.

Here are several compelling reasons to learn Linux:

Given the prevalence of Linux hosting, there is a high chance that your application will be hosted on Linux. So, learning Linux as a data engineer or developer becomes increasingly valuable.
With cloud computing becoming the norm, chances are high that your cloud instances will rely on Linux.
Linux serves as the foundation for many operating systems for the Internet of Things (IoT) and mobile applications.
Linux is built for automation, which is central to data engineering. Linux enables repeatability, fault tolerance and observability of the entire workflow.

What is a Linux Kernel?

The kernel is the central component of an operating system that manages the computer and its hardware operations. It handles memory operations and CPU time.

The kernel acts as a bridge between applications and the hardware-level data processing using inter-process communication and system calls.
The kernel loads into memory first when an operating system starts and remains there until the system shuts down. It is responsible for tasks like disk management, task management, and memory management.

What is a Linux distribution?

The Linux kernel is reused and configured differently across distributions. You can further combine different utilities and software to create a completely new operating system.

A Linux distribution or distro is a version of the Linux operating system that includes the Linux kernel, system utilities, and other software. Being open source, a Linux distribution is a collaborative effort involving multiple independent open-source development communities.

Today, there are thousands of Linux distributions to choose from, offering differing goals and criteria for selecting and supporting the software provided by their distribution.

Distributions vary from one to the other, but they generally have several common characteristics:

A distribution consists of a Linux kernel.
It supports user space programs.
A distribution may be small and single-purpose or include thousands of open-source programs.
Some means of installing and updating the distribution and its components should be provided.

Some popular Linux distributions are:

Ubuntu: One of the most widely used and popular Linux distributions. It is user-friendly and recommended for beginners.
Linux Mint: Based on Ubuntu, Linux Mint provides a user-friendly experience with a focus on multimedia support.
Arch Linux: Popular among experienced users, Arch is a lightweight and flexible distribution aimed at users who prefer a DIY approach.
Manjaro: Based on Arch Linux, Manjaro provides a user-friendly experience with pre-installed software and easy system management tools.
Kali Linux: Kali Linux provides a comprehensive suite of security tools and is mostly focused on cybersecurity and hacking.

How to install and access Linux

There are various methods that can be utilized in order to access Linux including on a Windows machine. This section goes into detail exploring these methods.

Install Linux as the primary OS

Installing Linux as the primary OS is the most efficient way to use Linux, as you can use the full power of your machine.
We'll focus on installing Ubuntu, which is one of the most popular Linux distributions. Linux has other numerous distributions suited for user specific applications that can be explored based on user preference.

Step 1 – Download the Ubuntu iso file. Make sure to select a stable release that is labelled "LTS". LTS stands for Long Term Support which means you can get free security and maintenance updates for a long time (usually 5 years).
Step 2 – Create a bootable pen drive: There are a number of softwares that can create a bootable pen drive.
Step 3 – Boot from the pen drive: Once your bootable pen drive is ready, insert it and boot from the pen drive. The boot menu depends on your laptop. You can google the boot menu for your laptop model.
Step 4 – Follow the prompts. Once, the boot process starts, select try or install ubuntu. The process will take some time. Once the GUI appears, you can select the language, and keyboard layout and continue. Enter your login and name. Remember the credentials as you will need them to log in to your system and access full privileges. Wait for the installation to complete.
Step 5 – Restart: Click on restart now and remove the pen drive.
Step 6 – Login: Login with the credentials you entered earlier.

And there you go! Now you can install apps and customize your desktop.

Accessing the terminal

An important part is learning about the terminal where you'll run all the commands and see the magic happen. You can search for the terminal by pressing the "windows" key and typing "terminal".
The shortcut for opening the terminal is ctrl + alt + t.

You can also open the terminal from inside a folder. Right click where you are and click on "Open in Terminal". This will open the terminal in the same path.

How to use Linux on a Windows machine

Sometimes you might need to run both Linux and Windows side by side. Luckily, there are some ways you can get the best of both worlds without getting different computers for each operating system.
This section explores a few ways to use Linux on a Windows machine.

Option 1: "Dual-boot" Linux + Windows

With dual boot, you can install Linux alongside Windows on your computer, allowing you to choose which operating system to use at startup.

This requires partitioning your hard drive and installing Linux on a separate partition. With this approach, you can only use one operating system at a time.

Option 2: Use Windows Subsystem for Linux (WSL)

Windows Subsystem for Linux provides a compatibility layer that lets you run Linux binary executables natively on Windows.

Using WSL has some advantages. The setup for WSL is simple and not time-consuming. It is lightweight compared to virtual machines (VMs) where you have to allocate resources from the host machine. You don't need to install any ISO or virtual disc image for Linux machines which tend to be heavy files. You can use Windows and Linux side by side.

How to install WSL2

First, enable the Windows Subsystem for Linux option in settings.

Go to Start. Search for "Turn Windows features on or off."
Check the option "Windows Subsystem for Linux" if it isn't already.

Next, open your command prompt and provide the installation commands.
Open Command Prompt as an administrator:
Run the command below:

wsl –install

Note: By default, Ubuntu will be installed.

Once installation is complete, you'll need to reboot your Windows machine. So, restart your Windows machine.

Once installation of Ubuntu is complete, you'll be prompted to enter your username and password.
And, that's it! You are ready to use Ubuntu.
Launch Ubuntu by searching from the start menu.

Option 3: Use a Virtual Machine (VM)

A virtual machine (VM) is a software emulation of a physical computer system. It allows you to run multiple operating systems and applications on a single physical machine simultaneously.

You can use virtualization software such as Oracle VirtualBox or VMware to create a virtual machine running Linux within your Windows environment. This allows you to run Linux as a guest operating system alongside Windows.

VM software provides options to allocate and manage hardware resources for each VM, including CPU cores, memory, disk space, and network bandwidth. You can adjust these allocations based on the requirements of the guest operating systems and applications.

Option 4: Use a Browser-based Solution

Browser-based solutions are particularly useful for quick testing, learning, or accessing Linux environments from devices that don't have Linux installed.
You can either use online code editors or web-based terminals to access Linux. Note that you usually don't have full administration privileges in these cases.

Online code editors: They offer editors with built-in Linux terminals. While their primary purpose is coding, you can also utilize the Linux terminal to execute commands and perform tasks.

Replit is an example of an online code editor, where you can write your code and access the Linux shell at the same time.

Web-based Linux terminals: Online Linux terminals allow you to access a Linux command-line interface directly from your browser. These terminals provide a web-based interface to a Linux shell, enabling you to execute commands and work with Linux utilities.
One such example is JSLinux.

Option 5: Use a Cloud-based Solution

Instead of running Linux directly on your Windows machine, you can consider using cloud-based Linux environments or virtual private servers (VPS) to access and work with Linux remotely.

Services like Amazon EC2, Microsoft Azure, or DigitalOcean provide Linux instances that you can connect to from your Windows computer. Note that some of these services offer free tiers, but they are not usually free in the long run.

Introduction to Bash Shell and System Commands

The Linux command line is provided by a program called the shell. Over the years, the shell program has evolved to cater to various options.

Different users can be configured to use different shells. But most users prefer to stick with the current default shell. The default shell for many Linux distros is the GNU Bourne-Again Shell (bash). Bash is succeeded by the Bourne shell (sh).

To find out your current shell, open your terminal and enter the following command:

echo $SHELL

Command breakdown:

The echo command is used to print on the terminal.
The $SHELL is a special variable that holds the name of the current shell.

In my setup, the output is /bin/bash. This means that I am using the bash shell.

Bash is very powerful as it can simplify certain operations that are hard to accomplish efficiently with a GUI (or Graphical User Interface). Remember that most servers do not have a GUI, and it is best to learn to use the powers of a command line interface (CLI).

Terminal vs Shell

The terms terminal and shell are often used interchangeably, but they refer to different parts of the command-line interface.
The terminal is the interface you use to interact with the shell. The shell is the command interpreter that processes and executes your commands.

What is a prompt?

When a shell is used interactively, it displays a $ when it is waiting for a command from the user. This is called the shell prompt.

[username@host ~]$

If the shell is running as root, the prompt is changed to #.

[root@host ~]#

Command Structure

A command is a program that performs a specific operation. Once you have access to the shell, you can enter any command after the $ sign and see the output on the terminal.
Generally, Linux commands follow this syntax:

command [options] [arguments]

Here is the breakdown of the above syntax:

command: This is the name of the command you want to execute. ls (list), cp (copy), and rm (remove) are common Linux commands.
[options]: Options, or flags, often preceded by a hyphen (-) or double hyphen (--), modify the behavior of the command. They can change how the command operates. For example, ls -a uses the -a option to display hidden files in the current directory.
[arguments]: Arguments are the inputs for the commands that require one. These could be filenames, user names, or other data that the command will act upon. For example, in the command cat access.log, cat is the command and access.log is the input. As a result, the cat command displays the contents of the access.log file.

Options and arguments are not required for all commands. Some commands can be run without any options or arguments, while others might require one or both to function correctly. You can always refer to the command's manual to check the options and arguments it supports. You can view a command's manual using the man command.

You can access the manual page for ls with man ls.

Manual pages are a great and quick way to access the documentation. I highly recommend going through man pages for the commands that you use the most.

Managing Files From the Command line

The Linux File-system Hierarchy

All files in Linux are stored in a file-system. It follows an inverted-tree-like structure because the root is at the topmost part.

The / is the root directory and the starting point of the file system. The root directory contains all other directories and files on the system. The / character also serves as a directory separator between path names. For example, /home/alice forms a complete path.
You can learn more about the file system using the man hier command.

Navigating the Linux File-system

The absolute path is the full path from the root directory to the file or directory. It always starts with a /. For example, /home/john/documents.
The relative path, on the other hand, is the path from the current directory to the destination file or directory. It does not start with a /. For example, documents/work/project.

Locating your current directory: You can locate your current directory in the Linux file system using the pwd command.

Changing directories: The command to change directories is cd and it stands for change directory. You can use the cd command to navigate to a different directory.

Some other commonly used cd shortcuts are:

Command	Description
`cd ..`	Go back one directory
`cd ../..`	Go back two directories
`cd or cd ~`	Go to the home directory
`cd -`	Go to the previous path

Managing Files and Directories

Creating new directories: You can create an empty directory using the mkdir command.

# creates an empty directory named "foo" in the current folder
mkdir foo

You can also create directories recursively using the -p option.

Creating new files: The touch command creates an empty file. You can use it like this:

# creates empty file "file.txt" in the current folder
touch file.txt

The file names can be chained together if you want to create multiple files in a single command.

# creates empty files "file1.txt", "file2.txt", and "file3.txt" in the current folder

touch file1.txt file2.txt file3.txt

Removing files and directories: You can use the rm command to remove both files and non-empty directories. The rmdir command removes an empty directory.

Command	Description
`rm file.txt`	Removes the file file.txt
`rm -r directory`	Removes the directory directory and its contents
`rm -f file.txt`	Removes the file file.txt without prompting for confirmation
`rmdir` directory	Removes an empty directory

Copying files using the cp command: To copy files in Linux, use the cp command.

Syntax to copy files: cp source_file destination_of_file This command copies a file named file1.txt to a new file location /home/adam/log.

cp file1.txt /home/adam/logs

The cp command also creates a copy of one file with the provided name.
This command copies a file named file1.txt to another file named file2.txt in the same folder.

cp file1.txt file2.txt

Moving and renaming files and folders: The mv command is used to rename and move files and folders from one directory to the other.

Syntax to move files: mv source_file destination_directory

# Moves a file named file1.txt to a directory named backup

mv file1.txt backup/

To move a directory and its contents:

mv dir1/ backup/

Renaming files and folders in Linux is also done with the mv command.

Syntax to rename files: mv old_name new_name

#Renames a file from file1.txt to file2.txt

mv file1.txt file2.txt

Locating Files and Folders: The find command lets you efficiently search for files, folders, and character and block devices.
Below is the basic syntax of the find command:

find /path/ -type f -name file-to-search

Where,

/path is the path where the file is expected to be found. This is the starting point for searching files. The path can also be/or . which represents the root and current directory, respectively.
-type represents the file descriptors. They can be any of the below:
f – Regular file such as text files, images, and hidden files.
d – Directory. These are the folders under consideration.
l – Symbolic link. Symbolic links point to files and are similar to shortcuts.
c – Character devices. Files that are used to access character devices are called character device files. Drivers communicate with character devices by sending and receiving single characters (bytes, octets). Examples include keyboards, sound cards, and the mouse.
b – Block devices. Files that are used to access block devices are called block device files. Drivers communicate with block devices by sending and receiving entire blocks of data. Examples include USB and CD-ROM
-name is the name of the file type that you want to search.

Basic Commands for Viewing Files

Display files and files contents: The cat command in Linux is used to display the contents of a file.

Here is the basic syntax of the cat command:

cat [options] [file]

If you want to view the contents of a file named file.txt, you can use the following command:

cat file.txt

This will display all the contents of the file on the terminal at once.

Viewing text files interactively using `less` and `more`

While cat displays the entire file at once, less and more allow you to view the contents of a file interactively. This is useful when you want to scroll through a large file or search for specific content.

The syntax of the less command is:

less [options] [file]

The more command is similar to less but has fewer features. It is used to display the contents of a file one screen at a time.
The syntax of the more command is:

more [options] [file]

The Essentials of Text Editing in Linux

Text editing skills using the command line are one of the most crucial skills in Linux. In this section, you will learn how to use two popular text editors in Linux: Vim and Nano. Vim and nano are safe choices to learn text editing as they are present on most Linux distributions.

Mastering Vim: Introductory Guide to Vim

Introduction to Vim

Vim is a popular text editing tool for the command line. Vim comes with its advantages: it is powerful, customizable, and fast. Vim has two variations: Vim (vim) and Vim tiny (vi). Vim tiny is a smaller version of Vim that lacks some features of Vim.

Here are some reasons why you should consider learning Vim:

Most servers are accessed via a CLI, so in system administration, you don't necessarily have the luxury of a GUI. But Vim will always be there.
Vim uses a keyboard-centric approach, as it is designed to be used without a mouse, which can significantly speed up editing tasks once you have learned the keyboard shortcuts. This also makes it faster than GUI tools.
Vim is suitable for all – beginners and advanced users. Vim supports complex string searches, highlighting searches, and much more. Through plugins, Vim provides extended capabilities to developers and system admins that includes code completion, syntax highlighting, file management, version control, and more.

The three Vim modes

You need to know the 3 operating modes of Vim and how to switch between them. Keystrokes behave differently in each command mode. The three modes are as follows:

Command mode.
Edit mode.
Visual mode.

Command Mode.

When you start Vim, you land in the command mode by default. This mode allows you to access other modes.
To switch to other modes, you need to be present in the command mode first

Edit Mode

This mode allows you to make changes to the file. To enter edit mode, press I while in command mode.

Visual mode

This mode allows you to work on a single character, a block of text, or lines of text. Let's break it down into simple steps. Remember, use the below combinations when in command mode.

Shift + V → Select multiple lines.
Ctrl + V → Block mode
V → Character mode The visual mode comes in handy when you need to copy and paste or edit lines in bulk.

Extended command mode.

The extended command mode allows you to perform advanced operations like searching, setting line numbers, and highlighting text. We'll cover extended mode in the next section.

Shortcuts in Vim: Making Editing Faster

Note: All these shortcuts work in the command mode only.

Basic Navigation

Command	Explanation
`h`	Move left
`j`	Move down
`k`	Move up
`l`	Move right
`0`	Move to the beginning of the line
`$`	Move to the end of the line
`gg`	Move to the beginning of the file
`G`	Move to the end of the file
`Ctrl+d`	Move half-page down
`Ctrl+u`	Move half-page up

Editing

Command	Explanation
`i`	Enter insert mode before the cursor
`I`	Enter insert mode at the beginning of the line
`a`	Enter insert mode after the cursor
`A`	Enter insert mode at the end of the line
`o`	Open a new line below the current line and enter insert mode
`O`	Open a new line above the current line and enter insert mode
`x`	Delete the character under the cursor
`dd`	Delete the current line
`yy`	Yank (copy) the current line
`p`	Paste below the cursor
`P`	Paste above the cursor

Searching and Replacing

Command	Explanation
`/`	Search for a pattern which will take you to its next occurrence
`?`	Search for a pattern that will take you to its previous occurrence
`n`	Repeat the last search in the same direction
`N`	Repeat the last search in the opposite direction
`:%s/old/new/g`	Replace all occurrences of `old` with `new` in the file

Exiting

Command	Explanation
`:w`	Save the file but don't exit
`:q`	Quit Vim (fails if there are unsaved changes)
`:wq` or `:x`	Save and quit
`:q!`	Quit without saving

Multiple Windows

Command	Explanation
`:split` or `:sp`	Split the window horizontally
`:vsplit` or `:vsp`	Split the window vertically
`Ctrl+w followed by h/j/k/l`	Navigate between split windows

Mastering Nano

Getting started with Nano: The user-friendly text editor

Nano is a user-friendly text editor that is easy to use and is perfect for beginners. It is pre-installed on most Linux distributions.
To create a new file using Nano, use the following command:

nano

To start editing an existing file with Nano, use the following command:

nano filename

List of key bindings in Nano

General

Command	Explanation
`Ctrl+X`	Exit Nano (prompting to save if changes are made)
`Ctrl+O`	Save the file
`Ctrl+R`	Read a file into the current file
`Ctrl+G`	Display the help text

Editing

Command	Explanation
`Ctrl+K`	Cut the current line and store it in the cutbuffer
`Ctrl+U`	Paste the contents of the cutbuffer into the current line
`Alt+6`	Copy the current line and store it in the cutbuffer
Ctrl+J	Justify the current paragraph

Navigation

Command	Explanation
`Ctrl+A`	Move to the beginning of the line
`Ctrl+E`	Move to the end of the line
`Ctrl+C`	Display the current line number and file information
`Ctrl+_ (Ctrl+Shift+-)`	Go to a specific line (and optionally, column) number
`Ctrl+Y`	Scroll up one page
`Ctrl+V`	Scroll down one page

Search and Replace

Command	Explanation
`Ctrl+W`	Search for a string (then Enter to search again)
`Alt+W`	Repeat the last search but in the opposite direction
`Ctrl+\`	Search and replace

Miscellaneous

Command	Explanation
`Ctrl+T`	Invoke the spell checker, if available
`Ctrl+D`	Delete the character under the cursor (does not cut it)
`Ctrl+L`	Refresh (redraw) the current screen
`Alt+U`	Undo the last operation
`Alt+E`	Redo the last undone operation

Conclusion

This article introduced Linux from both a conceptual and practical perspective, covering its core components, common distributions, and different ways to access it. We explored essential command-line skills, including file system navigation, system commands, and text editing using Vim and Nano.

For data engineers, Linux is a critical platform because most data systems and cloud infrastructures run on it. Mastery of Linux enables efficient automation, system management, troubleshooting, and deployment of data pipelines. As a result, Linux is not just a supporting skill, but a foundational requirement for working effectively in modern data engineering environments.

An Introduction to Git: Concepts, Commands, and Workflows

Edmund Eryuba — Sat, 17 Jan 2026 18:35:17 +0000

What is Git?

Git is a distributed version control system designed to track changes in source code and manage collaboration among developers. It enables individuals and teams to maintain a complete history of a project, revert to previous versions when needed, and work on the same codebase without conflicts. Because each developer has a full copy of the repository, Git offers high performance, reliability, and the ability to work offline. Git is the core technology behind popular platforms such as GitHub, GitLab, and Bitbucket.

Understanding Git Bash

Git Bash is a command-line interface that provides a Unix-like shell environment for using Git on Windows systems. It allows users to run Git commands in a terminal similar to those found on Linux and macOS, making cross-platform development easier. In addition to Git commands, Git Bash supports basic shell operations such as navigating directories, creating files, and managing folders, which are commonly used during development workflows.

Initializing and Managing a Repository

To begin tracking a project with Git, a repository is initialized using the git init command. This creates a hidden .git directory that stores all version history and configuration data. Files are then added to the staging area with git add ., which prepares changes for inclusion in a commit. A commit is created using git commit -m "message", capturing a snapshot of the staged changes along with a descriptive message. The git status command is used regularly to view the current state of files in the repository.

Working with Remote Repositories

Remote repositories enable collaboration by allowing developers to share code through platforms such as GitHub. The git clone command creates a local copy of an existing remote repository. Once changes are made locally, they can be uploaded to the remote repository using git push. To retrieve updates made by others, the git pull command is used, which fetches and merges changes into the local branch.

Branching and Collaboration Basics

Branching allows developers to work on new features or fixes independently without affecting the main codebase. The git branch command lists available branches, while git checkout or git switch is used to move between them. After completing work on a branch, changes can be merged back into the main branch. Through these features, Git and Git Bash provide a structured and efficient approach to version control and team collaboration.

Getting Started with the Git Workflow

To get started, it's important to know the basics of how Git works. You may choose to do the actual work within a terminal, an app like GitHub Desktop, or through GitHub.com. Below are the basic git terminologies aligned with GitHub usage.

Repository (Repo)

On GitHub, a repository is an online-hosted version of your project. It stores your code, commit history, issues, pull requests, and documentation. Most collaboration happens around the GitHub repository, which acts as the central reference point for all contributors. A local repository is connected to GitHub using a remote named origin.

git remote add origin https://github.com/username/repository.git

Commit

A commit is a recorded change to the repository. On GitHub, commits are visible in the Commits tab, where collaborators can review what changed and who made the change. Each commit pushed to GitHub becomes part of the shared project history.

git commit -m "Add login validation"

Branch

In GitHub, branches are heavily used to isolate work. The default branch is usually main, which represents production-ready code. Feature development and bug fixes are done in separate branches, which are later merged into main through pull requests.

git branch feature-auth
git switch feature-auth

Push

Pushing sends your local commits to GitHub, making them visible to collaborators. Until you push, your commits exist only on your local machine. This step is essential for teamwork and backup.

git push origin feature-auth

Pull

Pulling retrieves updates from GitHub and merges them into your local branch. This ensures you are working with the latest version of the project, especially when multiple contributors are involved.

git pull origin main

Pull Request (PR)

A pull request is a GitHub feature used to propose merging one branch into another, usually into main. It allows team members to review code, leave comments, request changes, and approve work before it becomes part of the main codebase. Pull requests are central to GitHub-based collaboration.

Merge

Merging on GitHub usually happens through a pull request rather than directly via the command line. Once approved, GitHub merges the feature branch into the target branch and records the merge in the project history.

Fork

A fork is a GitHub-specific feature that creates a personal copy of someone else’s repository under your account. Forks are common in open-source projects where contributors do not have direct access to the main repository. Changes made in a fork are submitted back using a pull request.

Collaboration workflow

A typical GitHub collaboration workflow involves cloning the repository, creating a branch, committing changes, pushing the branch to GitHub, opening a pull request, undergoing code review, and merging the changes. GitHub enhances Git by adding visibility, discussion, and project management features around this workflow.

When using GitHub, beginners should always pull before starting work, create one branch per feature or fix, push changes frequently, use pull requests instead of direct merges, and write clear commit messages. This approach keeps the repository clean, traceable, and easy to collaborate on.