Forem: Michael Mwai

Extract Transform Load vs Extract Load Transform (ETL vs ELT)

Michael Mwai — Tue, 14 Apr 2026 15:57:04 +0000

A data pipeline constitutes tools, steps and processes that automate movement of harvested data to a destination system. The destination system can be either for analysis or for storage of the data.
A pipeline can be broken down into 3 broad processes namely:

Extraction
Transfromation
Loading

Extraction

This is always the initial stage of any data pipeline. It is at this stage where data is gathered and harvested from different sources it is found. The data can be from more than one source. There are numerous sources of data and these include, but are not limited to:

Files - Files in different formats likes CSVs, excel sheets, PDFs can contain raw data that can extracted.
Web scraping data - Data can be sourced by scraping a single or multiple websites with the relevant data.
Databases - Databases are used by organisations to store data that they can consider raw data that they can use.

Transformation

Raw data from the extraction stage is always considered messy. Messy in this context can mean a lot of different things depending on how the intended data is meant to look. Messy can mean data is in an usable format(eg. numerical data written in words), or has values in a unit the user does not prefer(eg.imperial units instead of metric units) or the data is full of errors(like spelling errors) and many other ways. It is at this stage that the data is manipulated and transformed to the desired state by the user.
There are numerous processes that can be performed in this stage:

Data wrangling - This is the process of converting data from its raw form to a tidy format that has structure. Involves actions like converting names of countries to their abbreviations.
Data Cleaning - This is the process of removing data that does not meet some preset user defined rules for the task to ensure the data is accurate, consistent and relevant. An example would be removing data of people aged 60 and below if the scope of an analysis is on the 'elderly'.
Data transformation - It is the process of converting raw from on one format and structure to another to make sure it is easy to use , integrate or analyze.

Loading

This is the stage where data that has undergone transformation or also raw data is loaded into a system. The system can either be for storage like in data warehouses and databases or for analysis like in Power BI.

It is the order of the transformation and loading stage after extraction that determines whether a pipeline is ELT or ETL. In ETL, the methodology is that the transformation comes prior to the loading while in ELT, the loading precedes the transformation.

ETL: Extract → Transform → Load

ELT: Extract → Load → Transform

ETL

ETL methodology has been around for many years and has been the standard way of operating a data pipeline. This approach was majorly driven by the historic high prices of memory(storage). With storage being a scarce resource, organizations could not afford to store data they considered junk. Only data that was useful could be stored and therefore data had to be wrangled, cleaned and transformed before it was stored. Tools that use this approach are substantial in number and have been perfected over the years to become reliable and robust. Examples of these are Talend and Informatica.

ELT

With the boom of the internet in the 2000s, it became an avenue for providing services and the amount of data handled by organizations grew exponentially. Data became even more important for some of these organizations because it could be used to derive more insights that can potentially increase revenue. Data quickly became the new gold and organizations needed more of it. Luckily, by around 2010, memory had become so inexpensive that the potential advantage of not storing what was considered junk(but could be useful in future) was no longer compelling. With memory being justifiably cheap, the only problem was scaling resources to handle the large amounts of data pouring in. We were in luck again because the rise of the cloud solved this challenge. With cloud services, you could scale operations and resources like storage easily, with flexibility and efficiently. Consequently, organizations no longer needed to transform data beforehand hence the adoption of ELT.

So, when and in what situation does ETL make more sense than ELT and vice versa? To get there, we must understand the functional differences between ETL and ELT. The differences are as follows:

ETL	ELT
Raw data is lost after transformation	Raw data is always preserved for re-transformation
Reduces storage cost because raw data is not saved	Higher storage costs because raw data takes more space in storage
Older methodology and therefore has a lot of robust tools	New methodology and therefore needs powerful destination systems that can handle transformation - which are few in the market
Slower iteration because re-runs require full re-extraction	Allows iterate on transforms without re-ingestion
Stronger compliance by filtering sensitive data early	Sensitive data reaches the warehouse unfiltered
Schema changes break pipelines easily	Schema changes don't affect the data pipeline
Does not scale with cloud warehouse compute — external processing layers hit capacity ceilings as data volumes grow	Scale easily with cloud warehouse compute

Use cases for ETL and ELT

ETL	ELT
When working with regulated sensitive data eg. financial information, HIPAA etc. The data must be masked or drop sensitive fields before they enter storage	When you need to preserve raw data for auditing, debugging, or future re-modelling
When the cost of storing raw data outweighs the benefit of preserving it	When you want analysts and data teams to iterate on business logic without re-ingesting data
When your destination is a legacy relational database with limited compute (MySQL, PostgreSQL, on-premise data warehouses)	When you are on a modern cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks) with scalable compute
Data volume is modest and transformations are stable, well-understood, and unlikely to change often	When data volumes are large and growing, and you favour ingestion speed over transformation speed

The choice of whether to use ETL or ELT is not a straight-forward answer. The choice is a function of organization needs, cost, the SOPs(standard operating procedures) of an organisation or institution, the existing regulation on data handled, destination of the data and many others. The merits and demerits of each system have to be weighed and the approach with the most benefits to the user or organization is picked. Organizations have also adapted by switching between ELT and ETL when conditions necessitate the switch. This has ensured flexibility and compliance in their operations.
It is worth noting that the modern approach is ELT. The industry has shifted decisively toward ELT for most modern data workloads since they are implemented on the cloud. ETL is however going to be around for a long time because it has its unique use cases and there is room for both paradigms to coexist.

How to Connect PowerBI to a local and cloud-hosted postgreSQL database

Michael Mwai — Sat, 21 Mar 2026 13:27:33 +0000

PowerBI is a business analytics platform developed and maintained by Microsoft. It has gained traction for its simplicity and has increasingly become the go to software for business analytics by many small, medium and large businesses. Many, if not most, of these businesses have their data stored in databases because databases ensure the data is secure, has structure and can be stored and retrieved at anytime. Analytics platforms like powerBI therefore have to provide means through which analysts can access the company data that is housed in the databases.

Connecting PowerBI to a local postgres database

The following steps outline how to connect PowerBI to a local postgres database:

Launch PowerBI on your computer.
Create a Blank report and from the Home tab click on 'Get Data'.
Click on More... option on the Common Data Sources Wizard that appears
On the Get Data wizard box scroll down and locate PostgreSQL database option and select it. Click on the green Connect button at the bottom of the wizard.
The PostgresSQL database wizard appears with fields server and database to be filled. Since the postgres database is locally hosted, fill the server filled with localhost. You should also supply the port number of choice for this connection. The server name and port number is separated by a colon (:). By convention, the port number for postgreSQL is 5432 and will be the default port number if the port number is not supplied. If 5432, is unvailable, it is standard to use 5433. The server field will therefore be filled by localhost:5432
Proceed to fill the Database field with the name of database you want to connect to.
Proceed to Choose Data Connectivity Mode. Import Mode should be your go to connectivity mode if you want to do a full scale data analysis since it imports the data tables to your PowerBI allowing you to perfom EDA on the data.
Click OK to connect.
A Navigator wizard box appears upon successful connection showing your database on the left pane with the tables it holds under it. Select the tables you wish to use by checking the checking box next to it.
Click Transform if the data in the tables needs any transformation before use, otherwise, click Load to load all tables into your model.

Connecting to a Postgres Database Hosted on the Cloud

For this demonstration, I will showing how to connect to a postgres database hosted on Aiven. Aiven is an open source data platform that offers a variety of services including databases like MySQL and PostgreSQL.
Ensure you have set up your postgres database on Aiven before taking the following steps:

Sign in to your Aiven account
After logging in, on the menu on the left, select services to show the services under your account.
Locate the postgres service and ensure it is running. If not, click on the Action ('...') button the farthest right of the service to restart the service.
Click on your postgres services to view the connection details of your service like host, database name, port number, user, password.
Open a blank PowerBI report, click Get Data from the Home Tab and then More... option then PostgresSQL database and click Connect.
Populate the Server and Database fields with the information from the Connection details from Aiven. Click connect and provide the Username, password and click connect.
Select the tables to add to your Data Model and click Load to load the tables or Transform to transform the data before loading for use.

[Boost]

Michael Mwai — Mon, 16 Mar 2026 16:53:31 +0000

I Think a Lot of Developers Are Quietly Grieving the Old Internet

NorthernDev ・ Mar 16

#discuss #webdev #programming #culture

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

Michael Mwai — Mon, 09 Feb 2026 06:40:34 +0000

With the importance of data in business being realized, the need and demand for adequate data has been rising exponentially. These businesses will try gather the data they need directly from source(the people) to increase its acccuracy but can also get such data from third parties. The required information reaches the businesses in many forms, and part of a data analyst's job is making sure that information is organized in a way that's conducive for analysis. Power BI is one of the go-to software resources used to handle business data for analysis.

Getting Data

Power BI offers analysts a variety of data connectors, making it easy to pull and merge data from diverse sources like SQL databases, Excel workbooks, CSV files, and even PDFs. This makes it easy to source and combine data from different sources without leaving Power BI.

Data Cleaning and Transformation

Power BI utilizes Power Query to do transformation of data that is already loaded or about to be loaded. Power Query gives the business analyst power to transform the raw data to a desirable format that is usable in the analysis stage. These transfromations can include actions like removing duplicate data, adding or replacing missing values, make corrections on the present data, removing outliers. Also, parts of the data can be changed into a desirable structure eg. changing date format from MM-DD-YYYY to DD-MM-YYYY.

Data Modelling

This stages allows an analyst set the general outlay of the business data allowing for tables to organized as a star schema or snowflake schema. The various tables will be identified as the fact or dimensional tables.
The relationship between the tables is also managed in this section ensuring data across the multiple tables is accessible.

DAX (Data Analysis Expressions)

The DAX language was created specifically for the handling of data models, through the use of formulas and expressions. It is these expressions, formulas and functions that are used to do analysis operations on the data. Operations like aggregation, logical analysis, time and data analysis, statistical analysis, filter functions, financial calculations are used to evaluate the data.
DAX uses diffirent operators to that indicate the kind of operation that will be done depending on predetermined preference value. The operators (from highest precedence to lowest) include:

() - Parenthesis

2.FN() -Scalar functions

3.IN - Inclusive OR list

^ - Exponential
Sign - unary plus/minus eg.-1
Multiplication/division(*,/)
addition/subtraction(+,-)
& - Text Concatenation
=, ==, <>, <, >, <=, >= - Comparison operators
NOT - Logical negation
&& - Logical AND

12 || - Logical OR

Data Visualizations

It is easier for people to comprehend data presented graphically than numerically.For this reason, Power BI offers an assortment of data visualization tools like charts, graphs, cards, filters that are used to give the summary of the insights drawn from the data. Different type of graphs are used for different roles, for example, a line chart is used to show a trend while a bar graph is used to compare categorical data.
The final visualization is the dashboard that shows the most important data summaries inform of charts, graphs, KPIs to inform the decision making process.

Reporting

Power BI allows analysts to generate and publish reports that detail their findings that are shareable with the decision makers.

Conclusion

Power BI is a powerful end to end analytics tool that allows an analysts to get data, clean and transform it, analyse it, visualize insights and publish a report all within the same application.
Well communicated data-backed insights empower decision making teams to move away from guesswork and take decisive action based on empirical findings.

Data Modelling and Scheming in PowerBI

Michael Mwai — Mon, 02 Feb 2026 11:14:06 +0000

Power BI is a data analytics and data visualization tool that can be useful to generate insights that will be used to make informed business decisions. Since is the source of these insights is the data, it has to be accurate, consistent and relevant in order to generate quality results that are usable in the decision making process. Data modelling and data scheming are concepts used to ensure the data quality is maintained, allowing for analysing and visualizing to be done quickly and efficiently. Data modelling is the process of structuring your data in tables, defining the relationships between the tables and defining calculations done data in the tables in form of expressions and formulas. Data scheming on the other hand is process of coming up with an architectural design pattern of your data tables to ensure analysis of data sourced from the different table is seamless and efficient.

A robust data model by the current standards has to have 2 types of tables:

A fact table
A dimension table

The fact table is the primary and central table in a data model. It used to store quantitative and measurable data like transactions and orders.
A dimensional table is one that has descriptive information about the data. Thereby giving more context about the data.

Relationships

Relationships are conceptual ideas that serve to indicate how data between different tables is related or connected. This is achieved through the use of a primary and a foreign key. The primary is located in dimensional tables while the foreign key is the fact table.
There are several types of relationships:

One-to-Many: This is the most common relationship, where one record in the dimension table relates to multiple records in the fact table.
One-to-One: Where one record in the dimension table relates to one record in the fact table.
Many-to-Many: Where more than one record in a table is related to more than one record in another table.

Schemas

Star Schema
This schema involves have a central table that serves as the fact table and other tables radiate from it and make up the dimensional tables. This organization forms a start like representation hence the name star schema.
Star schema’s strength lay in its simplicity and performance. The performance is guaranteed since few joins are needed to aggregate the data from multiple data tables. For these reasons, start schema stand to be the most widely used and recommended schema in Power BI.
Snowflake Schema

The snowflake schema extends the star schema by introducing more dimension tables. Instead of a single product dimension, you can break it into separate tables for products, categories, and subcategories.
Although this reduces data redundancy and mirrors traditional database design principles, it comes with trade-offs in Power BI like slower queries due to the necessity for more joins. The higher number of joins is as a result of the additional relationship chains that the engine must navigate, potentially slowing down the query performance. Additionally, it is hard to manage as it makes DAX calculations complicated and makes models more difficult for business users to understand.

Why Modeling Quality Matters

Performance: Poor data modelling not only slows down reports but can also produce incorrect results thereby reducing productivity. On the other hand, A well-designed model model minimizes the number of joins required for calculations, significantly speeding up report generation.
Accuracy and Reliability: A well-structured data model can also make automation a seamless process well defined relationships allow Power BI to efficiently handle daily, hourly, or scheduled data refreshes, preventing "timeout errors" common in un-modeled, flat, or chaotic datasets. Clear relationships and distinct fact/dimension tables also prevent calculation errors and ensure that filters apply correctly. Ambiguous relationships or a flat, denormalized table can lead to incorrect aggregations and misleading insights.
Usability and Maintainability: A logical and intuitive data model is easier for report developers to understand and build upon. It simplifies DAX calculations and reduces the likelihood of introducing errors. It also makes the model easier to maintain and extend as business requirements evolve.
Scalability: A well-designed schema can handle growing data volumes without a proportional drop in performance. This is vital as organizations collect more and more data over time.

In conclusion, poorly done data modelling and data scheming can result in reduced productivity arising from slow calculations, inaccurate calculations and delayed reports.

Since data is business specific, it is prudent that one organizes data tables into clear relationships that mirror the actual operations of the business to improve conciseness and reduce ambiguity.

Introduction to Linux for Data Engineers, Including Practical Use of Vi and Nano with Examples

Michael Mwai — Mon, 26 Jan 2026 12:12:46 +0000

It is an undeniable fact that linux powers the biggest portion of the world’s computing infrastructure; and there is enough statistical data to back this up. This is attributed to linux’s reputation of being stable and secure. In this era where data is the new gold mine, a lot of this data flows through these systems that are powered by linux operating system. It is therefore paramount for any data engineer to have a good grasp of linux OS, its concepts and its most basic commands to ensure they are productive while working with these systems. These basic commands include but not limited to:
ls - List files and folders in a directory
cd – Change directry
pwd – Show the current working directory
man – Used to look up any command in the manual
mv – Used to move or rename files
cp – Used to make a copy of a file.
chmod – Used to change the file permissions
chown – Used to change the ownership of a file

It is important to be aware of the flags/options that go with these commands for the purpose of allowing the commands to behave differently or to enable additional functionality.

Linux Text Editors

a. nano
nano is an easy to use text editor that is quite easy to grasp.
To open a file on nano, type nano and the filename afterwards and press enter.
Useful commands in nano are:
Ctrl + C – Show cursor position in the editor
Ctrl + X – Exit the editor
Ctrl + G – Display the help screen
Ctrl + O – Write to a file

Eg.
nano configs.txt → Opens the file configs.txt if it exists. If it doesn’t exist, nano creates it and opens it.

b. Vi
Vi provides 3 modes in which keys perform different actions depending on the mode vi is in. The modes are:

Command mode It is the default mode. In this mode, each key is an editor command, that is, any keyboard stroke is interpreted as a command that can modify the files contents.

Insert mode You type “i” to switch to insert mode from command mode. It is indicated by an “?INSERT?” at the bottom of the screen. Insert mode is used to enter(insert) text into a file. To exit insert mode, you press Esc key and by default go back to command mode.

Line mode One types : to switch to the line mode from command mode. In this mode, each key is an external command. These are actions like writing the file contents to the disk or exiting. To exit insert mode, you press Esc key and by default go back to command mode.

Conclusion
Linux mastery is a must have skill for a data engineer because it is the foundational operating system for the majority of data infrastructure, cloud computing platforms, and production environments.
Text editors like vi and nano are useful utility softwares used in linux for many tasks such as creating or modifying system configuration files, writing scripts, developing source code hence the need to know how to use and navigate through them.