Forem: PETER AMORO

The Role of Docker in Data Engineering and ETL Automation

PETER AMORO — Sat, 09 May 2026 06:42:15 +0000

Introduction

Modern data engineering systems rely heavily on reliable, scalable, and consistent environments for processing large volumes of data. ETL (Extract, Transform, Load) pipelines often involve multiple technologies such as databases, APIs, workflow orchestration tools, and programming frameworks that must work together seamlessly across development, testing, and production environments.

Docker was created to solve these challenges through containerization. It allows developers to package applications together with all their dependencies into lightweight, portable containers that can run consistently across different environments. Whether the application is deployed on a developer laptop, testing server, cloud platform, or production environment, Docker ensures the behavior remains the same.

Today, Docker is one of the most important technologies in modern DevOps, cloud computing, microservices architecture, and software deployment pipelines.

What is Docker?

Docker is an open-source containerization platform used to develop, package, ship, and run applications inside containers.

A container is a lightweight, standalone, executable package that contains:

Application code
Runtime environment
System tools
Libraries
Dependencies
Configuration files

Containers isolate applications from the underlying operating system while sharing the host machine kernel. This makes them significantly faster and more resource-efficient than traditional virtual machines.

Docker simplifies software deployment because developers no longer need to worry about environmental inconsistencies.

History of Docker

Docker was released in 2013 by Solomon Hykes as part of a Platform as a Service (PaaS) company called dotCloud.

Before Docker became popular, virtualization was mainly achieved using virtual machines (VMs). Although VMs solved some deployment issues, they consumed large amounts of system resources because each virtual machine required a full operating system.

Docker introduced lightweight container technology that could:

Start quickly
Consume fewer resources
Improve scalability
Increase portability
Simplify deployment automation

Docker rapidly gained popularity in the DevOps community and became a standard tool in modern software engineering.

How Docker Works

Docker uses a client-server architecture consisting of:

1. Docker Client

The Docker client is the command-line interface developers use to interact with Docker.

Example:

docker build
docker run
docker ps

The client sends commands to the Docker daemon.

2. Docker Daemon

The Docker daemon is the background service responsible for:

Building images
Running containers
Managing networks
Managing storage volumes

3. Docker Images

A Docker image is a read-only template used to create containers.

Images contain:

Application source code
Dependencies
Libraries
Environment settings

Images are built using a Dockerfile.

Example:

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

CMD ["python", "app.py"]

4. Docker Containers

A container is a running instance of a Docker image.

Containers are isolated environments that allow applications to run independently from the host system.

Multiple containers can run simultaneously on the same machine.

Docker vs Virtual Machines

Docker containers are often compared to virtual machines because both provide isolated environments.

However, they differ significantly in architecture and performance.

Feature	Docker Containers	Virtual Machines
Operating System	Shares host OS kernel	Each VM has full OS
Startup Speed	Seconds	Minutes
Resource Usage	Lightweight	Heavy
Portability	Very high	Moderate
Performance	Near-native	Slower
Isolation Level	Process-level	Hardware-level

Virtual machines are useful for complete operating system isolation, while Docker containers are ideal for lightweight application deployment.

Key Docker Components

Dockerfile

A Dockerfile is a text file containing instructions used to build Docker images.

Example:

FROM node:20

WORKDIR /app

COPY package.json .

RUN npm install

COPY . .

CMD ["npm", "start"]

Each instruction creates a layer inside the image.

Docker Hub

Docker Hub is a cloud-based registry where Docker images are stored and shared.

Developers can:

Pull images
Push custom images
Share applications
Access official images

Example:

docker pull postgres

This downloads the PostgreSQL image from Docker Hub.

Docker Compose

Docker Compose is a tool used to define and manage multi-container applications using a YAML configuration file.

Example:

services:
  app:
    build: .
    ports:
      - "8000:8000"

  postgres:
    image: postgres
    ports:
      - "5432:5432"

Example in a crytpo-etl

services:
  postgres:
    image: postgres:latest
    container_name: postgres
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: 12345
      POSTGRES_DB: postgres
    ports:
      - "5433:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  crypto_etl:
    build: .
    container_name: crypto_etl
    environment:
      DB_USER: postgres
      DB_PASSWORD: 12345
      DB_HOST: postgres
      DB_PORT: 5432
      DB_NAME: postgres
    depends_on:
      postgres:
        condition: service_healthy

Docker Compose simplifies running applications that depend on multiple services such as:

Databases
APIs
Backend services
Message queues

Advantages of Docker

1. Environment Consistency

Docker eliminates the "works on my machine" problem by ensuring applications run identically across environments.

2. Lightweight and Fast

Containers consume fewer resources than virtual machines because they share the host operating system kernel.

3. Portability

Docker containers can run on:

Windows
Linux
macOS
Cloud platforms
Kubernetes clusters

4. Scalability

Docker supports horizontal scaling by allowing multiple containers to run simultaneously.

This is essential in microservices architecture.

5. Faster Deployment

Applications packaged as containers can be deployed rapidly across testing and production environments.

6. Simplified Dependency Management

All dependencies are packaged inside the container, reducing installation conflicts.

Docker in DevOps

Docker plays a major role in DevOps practices because it supports:

Continuous Integration (CI)
Continuous Deployment (CD)
Infrastructure as Code
Automated testing
Scalable deployments

Docker integrates with tools such as:

Jenkins
GitHub Actions
Kubernetes
GitLab CI/CD
Terraform

In CI/CD pipelines, Docker allows developers to:

Build application images
Run automated tests
Deploy containers automatically
Maintain consistent environments

Docker Networking

Docker containers communicate through Docker networks.

Docker provides several network types:

Bridge Network

Default network for containers running on the same host.

Host Network

Shares the host network directly.

Overlay Network

Used in Docker Swarm for communication across multiple hosts.

Docker Volumes

Containers are temporary by nature.

If a container is deleted, its internal data may also be lost.

Docker volumes provide persistent storage by storing data outside the container.

Example:

docker volume create postgres_data

Volumes are commonly used with:

Databases
Logs
Uploaded files
Persistent application data

Docker Security Considerations

Although Docker improves deployment efficiency, security remains important.

Common security best practices include:

Using official base images
Avoiding running containers as root
Keeping images updated
Scanning images for vulnerabilities
Limiting container privileges
Managing secrets securely

Organizations often combine Docker with Kubernetes security policies and monitoring tools.

Real-World Applications of Docker

Docker is widely used across industries.

Web Application Deployment

Developers deploy applications consistently across development, staging, and production environments.

Microservices Architecture

Each service runs independently in its own container.

Data Engineering

Docker is used for:

Airflow pipelines
ETL processes
PostgreSQL databases
Apache Spark clusters
Kafka environments

Machine Learning

Data scientists package models and dependencies for reproducible deployment.

Cloud Computing

Cloud providers support Docker-based deployments.

Examples include:

AWS
Azure
Google Cloud Platform

Challenges and Limitations of Docker

Despite its advantages, Docker also has some limitations.

1. Security Risks

Containers share the host operating system kernel, making kernel vulnerabilities potentially dangerous.

2. Persistent Data Complexity

Managing data persistence requires proper volume configuration.

3. Learning Curve

Understanding container orchestration, networking, and storage can be challenging for beginners.

4. Monitoring and Logging

Large-scale container environments require advanced monitoring solutions.

Example Docker Workflow

A typical Docker workflow involves:

Writing application code
Creating a Dockerfile
Building the image
Running containers
Testing locally
Pushing images to Docker Hub
Deploying containers to production

Example commands:

docker build -t my_app .

docker run -p 8000:8000 my_app

docker ps

Future of Docker

Docker continues to remain highly relevant in modern software engineering.

As organizations increasingly adopt:

Cloud-native applications
Kubernetes
Microservices
DevOps automation
Hybrid cloud systems

containerization will continue playing a critical role.

Although Kubernetes now dominates container orchestration, Docker remains one of the easiest and most powerful tools for building and managing containers.

Conclusion

Docker has transformed modern software deployment by introducing lightweight, portable, and consistent application environments through containerization.

Its ability to simplify dependency management, improve scalability, enhance DevOps workflows, and support cloud-native development has made it an essential technology in modern computing.

From small startups to large enterprise systems, Docker is now deeply integrated into software engineering practices worldwide.

As businesses continue embracing automation, distributed systems, and cloud infrastructure, Docker will remain a foundational technology for application deployment and infrastructure management.

The Role of ETL Pipelines in Modern Data Warehousing

PETER AMORO — Thu, 07 May 2026 07:42:35 +0000

Abstract

Organizations today generate massive amounts of data from websites, applications, APIs, and transactional systems. However, raw data is often inconsistent and unsuitable for direct analysis. ETL pipelines, which stand for Extract, Transform, and Load, help organizations collect, process, and store data in centralized warehouse systems optimized for analytics and business intelligence. This article discusses the importance of ETL pipelines in modern data warehousing, including their relationship with OLTP and OLAP systems, dimensional modeling, and workflow automation.

INTRODUCTION

Modern businesses rely heavily on data-driven decision-making. Companies analyze customer behavior, monitor operations, and identify trends using large volumes of historical and current data. Operational systems such as banking platforms and e-commerce applications are designed for transaction processing, not large-scale analytics.

To support reporting and business intelligence, organizations use data warehouses. A data warehouse is a centralized repository designed for analytical processing and historical data storage. ETL pipelines play a critical role in moving data from operational systems into warehouse environments.

Understanding ETL Pipelines

ETL stands for Extract, Transform, and Load.

Extract

The extraction phase involves collecting raw data from different sources such as APIs, databases, cloud platforms, and files. Challenges during extraction include network failures, duplicate records, and inconsistent formats.

Transform

The transformation phase cleans and standardizes the data. Common tasks include removing duplicates, handling missing values, renaming columns, and converting timestamps into readable formats. This stage improves data quality before analysis.

Load

The loading phase stores transformed data into warehouse systems such as PostgreSQL, Snowflake, or BigQuery. Organizations may use full loads or incremental loading strategies depending on business requirements.

OLTP vs OLAP Systems

Operational systems use Online Transaction Processing (OLTP), which focuses on real-time transactions and fast write operations. Examples include banking systems and e-commerce platforms.

Data warehouses use Online Analytical Processing (OLAP), which focuses on complex queries, historical analysis, and reporting. OLAP systems are optimized for read performance and business intelligence activities.

Separating OLTP and OLAP workloads improves both operational efficiency and analytical performance.

Dimensional Modeling

Modern warehouses commonly use dimensional modeling techniques such as star schema and snowflake schema.

A star schema contains:

A central fact table
Multiple connected dimension tables

This structure simplifies analytical queries and improves reporting performance.

A snowflake schema normalizes dimension tables into multiple related tables, reducing redundancy but increasing query complexity.

Workflow Automation with Apache Airflow

Modern ETL pipelines are often automated using Apache Airflow. Airflow uses Directed Acyclic Graphs (DAGs) to define workflow tasks and dependencies.

Airflow provides:

Scheduling
Monitoring
Retry mechanisms
Logging
Workflow automation

For example, a stock market ETL pipeline can automatically fetch hourly stock data, transform it using Python and Pandas, and load it into PostgreSQL for analysis.

Conclusion

ETL pipelines are essential components of modern data warehousing systems. They enable organizations to collect, clean, transform, and store data efficiently for analytics and business intelligence.

Concepts such as OLTP and OLAP separation, dimensional modeling, and workflow orchestration help organizations build scalable and reliable analytical systems. As data volumes continue to grow, ETL pipelines will remain fundamental to modern data engineering and decision-making processes.

How I Built My First ETL Pipeline with Apache Airflow

PETER AMORO — Mon, 27 Apr 2026 07:08:48 +0000

ABSTRACT
In this project, I built a simple ETL (Extract, Transform, Load) pipeline to understand how real-world data systems work. The pipeline fetches stock market data from an external API, cleans and structures it, and stores it in a PostgreSQL database. Apache Airflow is then used to automate the workflow. The goal of this project was not just to make the pipeline work, but to understand each stage clearly before integrating everything together.

INTRODUCTION
Most data from APIs is not ready to be used directly. When I first fetched stock data, I realized that although the data was available, it was not readable or structured in a way that could be analyzed easily. This is where ETL pipelines become important.

An ETL pipeline extracts data from a source, transforms it into a usable format, and loads it into a storage system. Apache Airflow helps automate this process by scheduling tasks and managing workflows. Instead of running scripts manually, Airflow allows the pipeline to run automatically at set intervals.

EXTRACT PHASE

The first step was extracting stock data from the Massive (Polygon.io) API. I used the aggregates endpoint to fetch hourly data for a stock ticker. At first, I tried using a different endpoint and only got daily data, which made me realize how important it is to understand the API structure.

The API returned data in JSON format, but not all of it was useful. The actual data I needed was inside the “results” field. One mistake I made early on was trying to use the entire response instead of focusing on this section. Once I fixed that, I was able to get clean hourly records.

TRANSFORM PHASE

The transform step was where the data started to make sense. The raw data had short column names like “o”, “c”, and “t”, which were not intuitive. I converted the data into a Pandas DataFrame and renamed the columns to more meaningful names like “open”, “close”, and “timestamp”.

Another important step was converting the timestamp from Unix milliseconds into a readable datetime format. Without this, it would have been difficult to understand when each record was captured.

I also added new columns such as price change, calculated as the difference between the closing and opening price, and average price, calculated from the high and low values. This made the dataset more useful for analysis instead of just being raw data.

LOAD PHASE

After transforming the data, I loaded it into a PostgreSQL database using SQLAlchemy. This step allowed me to store the data permanently instead of keeping it only in memory.

At this stage, I used a simple approach where the table is replaced each time the pipeline runs. I understand that in a real production system, this would need to be improved to avoid losing historical data and to handle duplicates properly.

USING APACHE AIRFLOW
Once the ETL process was working on its own, I integrated it with Apache Airflow. One important lesson I learned was that Airflow should not be used before testing the ETL logic. Initially, I tried running everything inside Airflow, which made debugging difficult.

Airflow works by defining tasks inside a DAG (Directed Acyclic Graph). Each step of the pipeline becomes a task, and Airflow manages the execution order. This makes it easy to automate the pipeline and run it at intervals such as every hour.

CONCLUSION
This project helped me understand how ETL pipelines work in practice. By extracting data from an API, transforming it into a clean format, and loading it into a database, I was able to simulate a real-world data engineering workflow.

Apache Airflow adds value by automating and managing the pipeline, making it more reliable and scalable. Moving forward, I plan to improve the pipeline by adding support for multiple stock tickers and implementing better data loading strategies.

ETL Vs ELT

PETER AMORO — Mon, 13 Apr 2026 07:22:41 +0000

ABSTRACT

Data engineering has existed in some form since companies started dealing with with data such as predictive analysis, descriptive analytics and reports to give meaningful insight. This article examines the process of sourcing data which is the extraction, the transformation of data which involves the formatting of data into structured dat i.e cleaning data and the loading of data which involves the ingestion, transformation and serving data into the destination like databases.

INTRODUCTION

In this article, i will take you through the 2 different processes of the data engineering lifecycles which are mainly ETL and ELT and the processes involved in each of them.

ETL (EXTRACTION, TRANSFORMATION, LOADING)

Extraction: This involves getting data from various sources like APIs, IoT devices (IoT short for Internet of Things — these are devices that collect data through sensors, like the fingerprint scanners on doors that collect fingerprints and store them in a central system), databases, flat files like CSVs, and web scraping. The goal of extraction is simply to pull raw data from wherever it lives and bring it into a staging area — think of it like gathering all your ingredients from the fridge before you start cooking. The data at this stage is messy, unorganized, and not yet ready to be used.
Transformation: This is the most critical step in ETL. Once the raw data has been extracted, it is processed and cleaned before being loaded into the destination. Transformation can include:

Cleaning — removing duplicates, fixing typos, handling missing values
Formatting — converting date formats, standardizing column names, changing data types
Filtering — removing data that is not relevant to the business use case
Aggregating — summarizing data, for example calculating total sales per month
Joining — combining data from multiple sources into one unified dataset

This transformation happens in an intermediate processing layer or staging environment, outside the final destination database.
Loading: After the data has been cleaned and transformed, it is loaded into the destination — usually a data warehouse like Amazon Redshift, Google BigQuery, or Snowflake. Because the data is already clean and structured at this point, it is immediately ready for analysts and business intelligence tools to query and use.

ELT (EXTRACTION, LOADING, TRANSFORMATION)

ELT follows the same three steps as ETL but in a different order — the transformation happens after the data has already been loaded into the destination.
Extraction: Same as in ETL — raw data is pulled from various sources such as APIs, databases, IoT devices, and flat files.
Loading: The raw, unprocessed data is loaded directly into the destination — typically a modern cloud data warehouse. These modern warehouses are powerful enough to store and process huge volumes of raw data without needing it to be cleaned first.
Transformation: Once the raw data is sitting inside the data warehouse, transformations are performed there using tools like dbt (data build tool). This approach takes advantage of the processing power of the warehouse itself rather than relying on a separate transformation layer.

Which One Should You Use?
The answer depends on your use case:

Use ETL when you are dealing with sensitive data that must be cleaned and masked before it enters your system, or when working with older, on-premise databases that cannot handle raw data at scale.
Use ELT when you are working with large volumes of data, using a modern cloud data warehouse, and need flexibility to transform data in multiple ways for different teams.

In modern data engineering, ELT has become the more popular approach because cloud warehouses have become powerful and affordable enough to handle transformations internally, making the process faster and more flexible.

CONCLUSION
Both ETL and ELT are fundamental processes in the data engineering lifecycle. They both serve the same ultimate goal — getting clean, reliable data into the hands of the people who need it. The key difference lies in where and when the transformation happens. As a beginner in data engineering, understanding these two processes gives you a solid foundation for working with data pipelines, warehouses, and analytics workflows.

Understanding SQL Joins and Window Functions

PETER AMORO — Wed, 08 Apr 2026 13:19:24 +0000

Understanding SQL Joins and Window Functions

Abstract

Relational databases store data across multiple tables to maintain organization and reduce redundancy. Retrieving meaningful insights from these tables often requires combining data and performing analytical operations. Two powerful SQL features that help accomplish this are joins and window functions.

Joins allow developers to combine data from different tables based on relationships between columns. Window functions, on the other hand, enable advanced calculations across sets of rows without collapsing the dataset into grouped results. This article introduces the key concepts behind SQL joins and window functions and demonstrates how they can be used in real-world queries.

Introduction

Structured Query Language (SQL) is the standard language used to interact with relational databases. In most real-world systems, data is not stored in a single table but distributed across multiple related tables.

For example, employee information might exist in one table while department information exists in another. To analyze this data effectively, developers must combine related tables and perform calculations across multiple rows.

SQL provides two powerful mechanisms for this:

Joins, which allow data from multiple tables to be combined.
Window functions, which allow analytical calculations while keeping the original dataset intact.

Understanding these features is essential for writing efficient SQL queries and performing advanced data analysis.

SQL Joins

A join is used to combine rows from two or more tables based on a related column.

Joins are commonly used when information is stored across different tables but needs to be viewed or analyzed together.

INNER JOIN

An INNER JOIN returns only the rows where matching values exist in both tables.

SELECT employees.name, departments.department_name
FROM employees
INNER JOIN departments
ON employees.department_id = departments.department_id;

In this query, only employees whose department IDs match entries in the departments table will appear in the result.

LEFT JOIN

A LEFT JOIN returns all rows from the left table and the matching rows from the right table. If no match exists, the result will contain NULL values.

SELECT employees.name, departments.department_name
FROM employees
LEFT JOIN departments
ON employees.department_id = departments.department_id;

This type of join is useful when you want to ensure that all records from the main table are included, even if related data is missing.

RIGHT JOIN

A RIGHT JOIN returns all rows from the right table and the matching rows from the left table.

SELECT employees.name, departments.department_name
FROM employees
RIGHT JOIN departments
ON employees.department_id = departments.department_id;

This ensures that all records from the right table appear in the result.

FULL JOIN

A FULL JOIN returns all rows from both tables. When no match exists between rows, the missing values are represented as NULL.

SELECT employees.name, departments.department_name
FROM employees
FULL JOIN departments
ON employees.department_id = departments.department_id;

This join is useful when you want to view all available records from both tables.

SQL Window Functions

While joins allow us to combine tables, window functions allow us to perform calculations across rows without grouping the data into a single result.

Unlike GROUP BY, which aggregates rows into one output per group, window functions maintain the original rows while adding calculated values.

The general syntax for a window function is:

function_name() OVER (PARTITION BY column ORDER BY column)

Ranking Example

Window functions are often used to rank data. The RANK() function assigns a rank to each row based on a specified order.

SELECT
name,
salary,
RANK() OVER (ORDER BY salary DESC) AS salary_rank
FROM employees;

This query ranks employees based on their salary from highest to lowest.

Using PARTITION BY

PARTITION BY divides rows into groups and applies the window function separately within each group.

SELECT
name,
department_id,
salary,
RANK() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) AS department_rank
FROM employees;

In this example, employees are ranked within their own departments rather than across the entire table.

Running Totals

Window functions can also calculate cumulative values, such as running totals.

SELECT
employee_id,
salary,
SUM(salary) OVER (ORDER BY employee_id) AS running_total
FROM employees;

This query calculates a running total of salaries as the rows progress.

Discussion

Both joins and window functions are widely used in real-world database systems. Joins enable developers to retrieve related data from multiple tables, which is essential for reporting and application development.

Window functions provide powerful analytical capabilities that allow developers and analysts to perform calculations such as ranking, running totals, and comparisons without losing row-level detail.

Together, these SQL features make it possible to build complex queries that provide deeper insights into structured datasets.

Conclusion

SQL joins and window functions are essential tools for working with relational databases. Joins allow data from multiple tables to be combined based on relationships, while window functions enable advanced analytical calculations across rows.

By understanding and applying these concepts, developers and data professionals can write more efficient queries and unlock meaningful insights from their data.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI: A Case Study of Kenya Crop Data

PETER AMORO — Mon, 09 Feb 2026 13:58:49 +0000

Abstract

Data-driven decision-making depends on the ability to transform raw, unstructured data into meaningful insights. In practice, datasets are often messy, incomplete, and difficult to interpret without proper analytical tools. This article examines how analysts use Power BI to translate messy agricultural data into actionable insights, using a Kenya crops dataset as a case study. Through data cleaning, modeling, DAX calculations, and dashboard design, Power BI enables analysts to convert complex datasets into interactive visual reports that support informed decision-making in agriculture and policy planning.

1. Introduction

Modern data analysis goes beyond collecting information; it focuses on extracting insights that can guide action. In sectors such as agriculture, where data influences food security, economic planning, and sustainability, accurate analysis is especially critical. However, agricultural datasets are often inconsistent, poorly structured, and difficult to analyze in their raw form.

During practical training with Power BI, a Kenya crops dataset was used to demonstrate how analysts transform messy data into meaningful dashboards. This article explores the analytical process, highlighting the role of data cleaning, DAX (Data Analysis Expressions), and visualization in converting raw agricultural data into actionable insights.

2. Methodology

This is the raw Kenya Crop Data as an excel spreadsheet before its loaded into the Power BI

2.1 Data Preparation

The initial dataset contained information on crop types, production levels, regions, and time periods in Kenya. Before analysis, the data required cleaning and transformation. Power Query in Power BI was used to:

Remove duplicate and missing records
Standardize crop and region names
Rename columns for clarity
Ensure numerical fields were correctly formatted

These steps ensured data consistency and reliability, forming a strong foundation for further analysis.

2.2 Data Modeling and DAX

After cleaning, the dataset was modeled within Power BI to support analytical calculations. DAX was used to create measures such as:

Total Cost of crop production by region

Total Profit

Total Revenue

-Total Yield

DAX enabled dynamic calculations that adjusted automatically when filters and slicers were applied, allowing deeper exploration of the data.

2.3 Dashboard Development

Interactive dashboards were designed to present insights visually. Bar charts, line charts, and geographic maps were used to display trends, regional comparisons, and crop distributions. Slicers allowed users to filter data by crop type, region, or year, making the dashboard accessible to both technical and non-technical users.

3. Discussion

The Kenya crops Power BI dashboard demonstrated how structured analysis turns raw data into insight. Patterns in crop production across regions became immediately visible, enabling comparison and trend identification. Analysts could quickly identify high-performing regions, declining production trends, or dominant crops.

This process highlights the analyst’s role as a translator between data and decision-makers. Rather than presenting raw tables, analysts use DAX and dashboards to communicate insights clearly and efficiently. In agriculture, such insights can inform policy decisions, resource allocation, and long-term planning.

4. Conclusion

Power BI provides analysts with the tools needed to transform messy datasets into actionable intelligence. Through careful data preparation, logical DAX calculations, and effective dashboard design, raw agricultural data can be converted into meaningful insights.

The Kenya crops dataset demonstrated how Power BI supports data-driven decision-making in agriculture. By translating complex data into clear visuals and metrics, analysts enable stakeholders to make informed decisions that impact food security, economic development, and sustainability. Ultimately, the value of Power BI lies not in visualization alone, but in its ability to bridge the gap between data and action.

Getting Started with Linux for Data Engineers: Working with Servers, Permissions and Nano Editor.

PETER AMORO — Mon, 26 Jan 2026 07:17:07 +0000

Abstract

In class i learnt that Linux is the dominant operating system in modern data engineering environments. Most data pipelines, databases and cloud-based data platforms are deployed on Linux servers. This article introduces fundamental Linux concepts for data engineers, presents essential Linux commands in the order they are typically used, and demonstrates how the nano text editor can be applied in practical data engineering tasks.

1. Introduction

Data engineering involves building, maintaining, and optimizing data pipelines that operate in production environments. Linux plays a central role in this ecosystem due to its stability, performance, and extensive command-line tooling. As a result, data engineers are expected to interact with Linux systems directly through the terminal.

2. Why Data Engineers Use Linux

Linux is widely used in data engineering for several reasons:

Most databases and distributed data platforms are optimized for Linux
Powerful command-line tools enable efficient automation
Remote access simplifies server and cloud infrastructure management
Strong support for scripting and scheduling data pipelines

Because of these advantages, Linux proficiency is considered a foundational skill for data engineers.

3. Server Access Using SSH

In real-world environments, data engineers access servers remotely using SSH (Secure Shell).

While logged in as root, create a new user:

Switch from the root user to the newly created user:

Confirm the active user and working directory:

In order to create directories i make my user account a super user

Verify the user is now sudo-enabled

4. Essential Linux Commands in a Typical Workflow

Linux commands are usually executed in a logical sequence during real-world data engineering tasks. The commands below are ordered according to a typical workflow.

5. Checking the Current Directory

creating directories or folders

A directory for organizing project files is created using mkdir:

Listing Files and Directories

Navigating into the Directory

The newly created directory is accessed using:

cd example

Creating Files

An empty file is created using:

touch etl.py

Downloading Data

A dataset is downloaded from the internet using:

wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

Viewing File Contents

The contents of a file are displayed using:

cat iris.csv

Viewing Large Files

Large files are viewed page by page using:

more iris.csv

Previewing File Data

The first and last lines of a file are viewed using:

head iris.csv

tail iris.csv

Searching Within Files

Specific patterns within files are searched using:

grep petal_width iris.csv

Appending Output to a File

Text is appended to a file using output redirection:

echo ETL job completed successfully >> iris.csv

5. Using the Nano Editor

Nano is a lightweight and beginner-friendly text editor available on most Linux systems. It allows users to edit files directly from the terminal without requiring complex commands or modes.

Opening a File with Nano

Files such as Python scripts, SQL files, or configuration files are opened using:

nano iris.csv

Saving and Exiting Nano

Changes are saved using the appropriate nano shortcut, and the editor is exited safely after editing is complete. Nano displays all available shortcuts at the bottom of the screen, making it easy to use for beginners.

7. Conclusion

This assignment demonstrates foundational Linux skills required for data engineering tasks. By securely accessing a remote server using SSH, managing user accounts, organizing files with Linux commands, and editing scripts using the nano editor, data engineers gain practical experience working in real-world Linux environments. Mastery of these skills provides a strong foundation for working with production data systems.

Learning Git & GitHub as a Data Engineering Student at LuxDevHQ

PETER AMORO — Sat, 17 Jan 2026 15:12:26 +0000

As a student who recently started learning Data Engineering, one of the first tools we were introduced to was Git and GitHub it is one of the most important tools for anyone working with code and data and planning to collaborate with my fellow students at LuxDevHQ

This article explains in a simple way:

What Git and GitHub are
How to set up Git Bash
How to connect Git to GitHub
How to push and pull code
How version control works

What is Git?

Git is a version control system.

It helps you keep track of every change you make to your files and keep track of each version of your code.

This is very important in Data Engineering because we:

Experiment with data
Change code often
Work on long projects

Git makes sure nothing gets lost.

What is GitHub?

GitHub is a website where Git projects are stored online.

It allows you to:

Back up your work
Share your projects
Work with other people
Show your work to employers

Git works on your computer.

GitHub stores your Git projects on the internet.

Installing Git Bash

For my case, i use a macbook so i dont install Git Bash natively so i install git via the terminal.

On macOS

Open Terminal and type

git --version



![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sjljwj0u29l9t7jhenq8.png)



# Connecting to a GitHub Account

## Step 1: Login or Create a GitHub account., since i have an account i just login 



## Step 2: Configure Git with your details

Run the following commands:

bash
git config --global user.name "Your Name"
git config --global user.email "youremail@example.com"


Confirm the setup:

bash
git config --list


![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/oisj0pdxrst6raqbw6zp.png)


# Connecting Git to GitHub Using SSH

Using **SSH** allows you to push and pull code **without entering your password every time**.

## Generate an SSH key

bash
ssh-keygen -t ed25519 -C "youremail@example.com"


Press **Enter** for all prompts.

## Start the SSH agent and add the key

bash
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519


## Copy the SSH key

bash
pbcopy < ~/.ssh/id_ed25519.pub


## Add the SSH key to GitHub

1. Go to **GitHub Settings**
2. Select **SSH and GPG keys**
3. Click **New SSH key**
4. Paste the key and save

Test the connection:

bash
ssh -T git@github.com


A success message confirms the connection.

![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fk86o2r75v3b6cizf9db.png)


# Understanding Version Control

**Version control** is a system that:

* Tracks **file changes**
* Saves a **history of your project**
* Allows you to **restore previous versions**
* Enables **safe collaboration**

Git works like a **timeline**, recording every meaningful change.

# Creating a Local Git Repository

## Create a project folder

bash
mkdir luxDev
cd luxdev





## Initialize Git

bash
git init

![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j7iait5xdeax62en3tbo.png)

Git now tracks this folder.

# Tracking Changes with Git

## Check file status

bash
git status


![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ibpj5heslws8dfe5rzol.png)



## Add files to the staging area

bash
git add .


## Commit changes

bash
git commit -m "Initial commit"


A **commit** is a saved snapshot of your project.

# Pushing Code to GitHub

## Create a repository on GitHub

* Click **New Repository**
* Do **not** add a README file
* Copy the **SSH repository link**

![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fb673jj5ont71p5lhrsi.png)


## Connect the local project to GitHub

bash
git remote add origin git@github.com:username/repository-name.git


## Push code to GitHub

bash
git branch -M main
git push -u origin main


Your code is now available on **GitHub**.

# Pulling Code from GitHub

To download updates from GitHub:

bash
git pull origin main


This is useful when:

* Working on **multiple devices**
* Collaborating with **other developers**
* Updating **shared projects**

# Viewing Project History

bash
git log




This command shows **previous commits**, authors, and dates.