Forem: Cliffe Okoth

Where Does Your Data Live? Decoding the Modern Data Ecosystem

Cliffe Okoth — Sun, 03 May 2026 01:37:22 +0000

If you are stepping into the world of data engineering or analytics, you have likely been hit with a wave of storage buzzwords like data lake and data warehouse. In this article, we will demystify these terms so you can understand exactly where your data belongs.

Database

Imagine you just launched a business. You need a system to record daily operations every time a customer buys a product, updates their password or submits a support ticket. This is the job of a standard Database.
A database is a collection of structured or unstructured data stored in a computer system, managed by a Database Management System (DBMS).
Databases are most useful for small, atomic transactions and typically contain only the most up-to-date information. Common types include:

Relational (SQL) Databases for structured data as in tables with fixed rows and columns. Examples include Postgresql, MySQL
Non-relational (NoSQL) Databases for unstructured data like JSON (JavaScript Object Notation), documents. Examples include MongoDB

Databases have the following core features:

ACID Properties: To guarantee absolute data integrity during transactions, databases adhere strictly to the ACID framework:
- Atomicity: Database transactions are treated as a single, "all-or-nothing" unit.
- Consistency: Data must seamlessly transition from one valid state to another without breaking the user defined rules.
- Isolation: Multiple transactions can happen concurrently without interfering with one another.
- Durability: Once a transaction is complete, the changes are permanent and irreversible, even if the system crashes.
Query Language: Databases allow users to interact directly with the system using specific languages, most commonly SQL (Structured Query Language). This enables developers and analysts to easily retrieve, filter, aggregate or update information.
Indexing: Think of this like the index at the back of a textbook. Instead of forcing the system to scan an entire table, indexes act as structural shortcuts that allow the database to locate specific data instantly.
Normalization: This is the design practice of breaking down large datasets into smaller, interconnected tables. It eliminates duplicate information, reduces redundancy and keeps the database organized and efficient.
Data Backup and Recovery: To safeguard against hardware failures, software bugs or unexpected downtime, databases come equipped with robust mechanisms to safely back up and restore data.
Data Modelling: Designing a database requires a clear structural blueprint. This process moves through three phases:
- Conceptual modelling maps out the high-level data relationships.
- Logical modelling adds the technical details.
- Physical modelling translates that design into the actual working database schema.

Use cases for databases

Databases excel in scenarios that require real-time data handling and high transaction volumes.
Key use cases include:

Real-Time Transaction Processing: Databases are built to execute immediate operations, such as processing payments at a retail point-of-sale (POS) system or handling financial transfers in banking.
Customer Relationship Management (CRM): They allow CRM platforms to manage real-time customer orders, interactions and support tickets.
Enterprise Resource Planning (ERP): Databases power the day-to-day operational software of businesses, managing records for everything from employee payroll to live inventory management.

Databases are perfect for storing records in real-time, but what happens when you want to compare current sales to those from five years ago?
Running a massive historical query could cripple your business' active, database-dependent operations.
To remedy this, a separate storage system dedicated to historical data should suffice.

Data Warehouse

To solve the historical reporting problem, a data warehouse is used. Instead of handling real-time transactions, it stores massive amounts of structured, historical data from multiple sources to help organizations spot long-term trends and make data-driven decisions.
It is usually denormalized to prioritize read operations ahead of write operations. These are the key features of a data warehouse:

Centralized Data: Data warehouses consolidate information from multiple systems to give analysts a comprehensive, high-level view of the organization's data.
Time-Variant Data: Data warehouses retain historical records, allowing businesses to analyze past performance, compare specific time periods, and identify long-term trends.
Denormalized Architecture: Data is deliberately structured with fewer tables to minimize complex relationships, which drastically speeds up read performance and simplifies heavy analytical queries.
Aggregated Data: Information is frequently summarized at various levels of detail, enabling analysts to quickly pull high-level overviews or drill down into granular metrics when necessary.
Query Optimization: To process massive analytical workloads efficiently, warehouses utilize advanced performance techniques such as indexing, data segmentation and materialized views.
BI Integration: Data warehouses natively support and connect with Business Intelligence (BI) platforms to power interactive dashboards, robust reporting and data visualizations.

Use cases for data warehouses

Data warehouses are better suited for use cases that involve the analysis and reporting of large datasets. These use cases include:

Business Intelligence (BI): Data warehouses consolidate large volumes of historical data, which is ideal for analytics, reporting and forecasting.
Trend analysis and reporting: Data warehouses are ideal for generating business reports, dashboards and exploring patterns over time.
Predictive analytics and data mining: Data warehouses support advanced analytics that help businesses make data-driven decisions, such as predicting customer behavior or market trends.

Examples of data warehouses include: Amazon Redshift, Google BigQuery, Snowflake.

Data warehouses are incredibly organized, but this rigid structure is a double-edged sword. While it guarantees clean, structured data, it leaves you with a problem, where do you put millions of messy, unstructured website click logs or raw JSON files?

Data Lake

When data is too large or unstructured for a data warehouse, it gets dumped into a data lake. Here, data from disparate sources is stored in its original, raw format.
Due to its storage flexibility, it acts as a playground for data scientists who train machine learning models on the data before it is fully structured. Like data warehouses, data lakes are not intended to satisfy the transaction and concurrency needs of an application.
Key features of a data lake:

Support for diverse formats: Handles data in formats like JSON and Parquet, accommodating a wide range of use cases.
Real-time analytics readiness: Ideal for machine learning and advanced data science workloads.
Horizontal scalability: Uses cost-efficient storage solutions such as Amazon S3 or Azure Blob Storage, allowing seamless growth with increasing data volumes.

Examples of data lakes include: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage.

As your hypothetical company grows, your Data Warehouse becomes massive. Now the Marketing team is complaining that it takes them too long to find the specific campaign metrics they need among all the finance, HR and engineering data.

Enter the Data Mart.

Data Mart

A data mart is a specialized, smaller-scale database designed to serve the specific needs of a single business unit such as marketing or finance. Its primary goal is to filter an organization's massive data pool into a highly focused, manageable repository for quick access.

Types of Data Marts

There are three main types of data marts, categorized by how they source their information and their relationship to a central data warehouse:

Dependent Data Marts: These are directly partitioned from an enterprise's central data warehouse. Using this top-down approach, the data mart extracts a specific, predefined subset of the primary data whenever a department needs to run an analysis.
Independent Data Marts: These operate as fully standalone repositories without relying on a central data warehouse. Teams extract, process and store data directly from various internal or external sources.
Hybrid Data Marts: As the name implies, these blend the two approaches by pulling information from both an existing data warehouse and external operational systems. This provides the speed and structured interface of a top-down approach while maintaining the flexible integration of an independent setup.

Historically, companies had to maintain both a Data Lake (for raw, cheap machine learning storage) and a Data Warehouse (for fast, structured BI reporting). Moving data between the two was challenging and expensive. Recently, a new architecture emerged to bridge this gap: the Data Lakehouse.

Data Lakehouse

A data lakehouse is a modern hybrid architecture that combines the massive, cost-effective storage of a data lake with the robust data management capabilities of a warehouse. By bridging the gap between raw data storage and high-speed analytics, a lakehouse can simultaneously support unstructured machine learning workloads and structured Business Intelligence workflows.

Key Features of a Data Lakehouse:

ACID Compliance: Unlike traditional data lakes, lakehouses guarantee reliable transactions to maintain strict data consistency and integrity.
Flexible Schemas: They support both "schema-on-write" and "schema-on-read". This gives engineers flexibility when ingesting raw data, while still providing a rigid, reliable structure when analysts need to query it.
Native BI Integration: Lakehouses connect seamlessly with popular Business Intelligence platforms like Tableau, Power BI, and Looker, making it easy for decision-makers to visualize their data directly from the source.

Final Thoughts
There is no single "best" data storage solution, only the right tool for the job. In fact, a robust modern data ecosystem usually relies on these systems working together:

Your Database captures the live sale.
Your Data Lake stores the messy, raw website logs of how the customer found you.
Your Data Warehouse analyzes five years of those sales trends.
Your Data Mart gives the marketing team instant access to only the metrics they care about.

The Blueprint for Modern Data Orchestration

Cliffe Okoth — Sat, 02 May 2026 00:40:22 +0000

If the terms orchestration or Apache Airflow sound like intimidating data jargon, this article will help you cut through the noise and understand the basics.
So, what exactly is data orchestration?
In DataOps (Data Operations), it is the underlying system that manages data workflows (such as ETL pipelines) to ensure tasks run at the right time and in the correct sequence.
For example, if data transformation depends on extraction, orchestration makes sure the extraction process runs to completion first.
What is a DAG? A DAG is a model that contains all the tasks to be run. DAG stands for:

Directed meaning tasks have a specific direction.
Acyclic meaning it has no circular dependencies — extraction cannot depend on transformation if transformation depends on extraction.
Graph meaning a collection of tasks (nodes) connected by dependencies (edges). What is a Task? This is a step in a DAG that describes a single unit of work.

Think of the DAG as an orchestra conductor and the tasks as the instruments.
To bring this orchestration to life, tools like Apache Airflow are used to define, schedule and monitor batch-oriented pipelines.
An Airflow instance contains the following main components:

The Scheduler submits tasks to the executor and triggers scheduled workflows.
A DAG processor reads DAG files and organizes them in the metadata database.
The Webserver is the Airflow User Interface for inspecting, triggering and debugging the behaviour of DAGs and tasks.
A dedicated folder of DAG files which contains the DAG, is read by the scheduler to figure out which tasks to run and when to run them.
The Metadata Database stores the state of tasks, DAGs and variables.

At this point you might be asking yourself, Why not just use cron jobs? Well, think of cron jobs as an alarm clock and Airflow as a project manager. Cron just runs your script at a certain time with no regard for the task's dependencies.
Say you schedule extract.py for 12:00 AM and transform.py for 1:30 AM. If extraction takes 40 minutes, Cron will blindly trigger the transformation at 1:30 AM, leading to corrupted data or a crash.
Airflow, acting as a project manager, understands this dependency; it waits for extraction to finish and will automatically retry the task if it times out or fails.
To make sense of this jargon, below is an example of a simple DAG:

from airflow.sdk import DAG 
from airflow.providers.standard.operators.python import PythonOperator
from airflow.providers.standard.operators.bash import BashOperator
from datetime import datetime, timedelta 

# Step 1: Define your Python functions 
def my_function():
    # Your logic here
    pass

# Step 2: Set default arguments
default_args = {
    'owner': 'your_name',
    'depends_on_past': False,           # don't wait for previous DAG runs
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': False,
    'retries': 1,                       # retry once if it fails
    'retry_delay': timedelta(minutes=5)
}

# Step 3: Create DAG object
with DAG(
    dag_id='template_dag',              # unique DAG identifier
    default_args=default_args,          # default args defined above
    description='Template for new DAGs',# DAG description
    schedule_interval='@daily',         # frequency of execution (you could use cron expressions for granularity)
    catchup=False,                      # don't run for previous dates
    max_active_runs=1                   # run one instance at a time
)

# Step 4: Define tasks
task1 = PythonOperator(
    task_id='python_task',          # unique task identifier
    python_callable=my_function,    # Python function to be executed
    dag=dag
)

task2 = BashOperator(
    task_id='bash_task',
    bash_command='echo "Hello World"',
    dag=dag
)

# Step 5: Set dependencies
task1 >> task2

From the example above, we use Python to declare tasks and their dependencies. These instructions are then interpreted by the orchestration engine and run sequentially. This is what data engineers refer to as Workflow As Code.
The DAG above is defined using traditional operators as in PythonOperator and BashOperator.
However, this is not the only method used; Airflow has a built-in TaskFlow API that defines DAGs using Python decorators, which makes it easier to pass data between DAGs.
Here is an example of a simple ETL pipeline using TaskFlow API:

import json
from airflow.decorators import dag, task
from pendulum import datetime

# 1. Define the DAG using the @dag decorator
@dag(
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=False,
    tags=["example", "taskflow"],
)
def taskflow_etl_pipeline():

    # 2. Extract: Task returns a dictionary 
    @task()
    def extract():
        data_string = '{"1001": 30.5, "1002": 28.2, "1003": 31.1}'
        return json.loads(data_string)

    # 3. Transform: Receives data directly from the upstream task
    @task()
    def transform(raw_data: dict):
        total_value = sum(raw_data.values())
        return {"total": total_value, "count": len(raw_data)}

    # 4. Load: Final task to "load" or print the data
    @task()
    def load(processed_data: dict):
        print(f"Loading data: Total value is {processed_data['total']}")

    # 5. Define dependencies by calling the functions
    raw_data = extract()
    summary = transform(raw_data)
    load(summary)

# Instantiate the DAG
taskflow_etl_pipeline()

How can you tell if your DAG runs? Use the airflow dags list command to check if it's been parsed by the scheduler.
If not, use airflow dags list-import-errors to check for syntax errors. Alternatively, you could check the user interface at localhost:8080.
To ensure configuration errors are avoided, use the following link for a step-by-step guide on installation and setup:
Step by step guide on how to Install and Setup Apache Airflow

Best Practices

As your workflows grow in complexity, adhering to a few core principles will save you from scheduling nightmares and data corruption. Let's look at some of them:

1. Idempotency: A task should return the exact same outcome whether it is run once, twice or a hundred times for the same execution date.
2. Atomicity: Each task should perform one defined operation. This ensures modularity. If the transformation phase fails, you only need to retry that specific task instead of re-fetching all your raw data from the source. See diagram below

Left - monolith | Right - modular

3. Encapsulation: Only define the DAG structure at the top level. If you put heavy data processing, API calls or database queries in the global scope of your file, the scheduler will execute that code every single time it parses the file. This will crash your Airflow instance.

Summary

To sum everything up, Apache Airflow might seem intimidating at first, but at its core, it is simply a tool designed to bring order to chaos. By embracing orchestration, you transform isolated, manually run scripts into reliable, automated data pipelines. To recap the key takeaways:

Data Orchestration is essential to data pipelines, it ensures your data tasks run in the right sequence and at the right time.
DAGs are the blueprint, they provide a map of your tasks and dependencies, ensuring no task runs out of order.
Airflow does the heavy lifting by handling the logistics of executing and monitoring your tasks so you can focus on the logic.
Workflow as Code: Whether you use traditional operators or the modern, Pythonic TaskFlow API, you have the flexibility to define complex pipelines.

What is the difference between ETL and ELT?

Cliffe Okoth — Fri, 10 Apr 2026 23:21:50 +0000

Overview

Say you have data in a dozen different places, and you need it all in one spot, fully cleaned and ready for analysis. That is the core goal of data integration. To get the job done, data engineers rely on two primary data pipeline architectures: ETL (Extract, Transform, Load) and its modern alternative, ELT (Extract, Load, Transform). While both move data from source to storage, the timing of how they process that data changes everything. Let's break down how they work.

ETL

ETL(Extract, Transform, Load) is a data integration process that extracts raw data from a single or multiple sources, transforms this data into a usable format, then loads the resultant data into a database where end-users can access it.
What do these three processes entail?

Extract: This is the first step of the process. It includes extracting data from target sources that can range from structured sources like databases (SQL, NoSQL), to semi-structured data (JSON, XML) to unstructured data (emails, flat files). It is crucial in this step, to gather data without altering its original format as it is processed in the next stage.
** Transform:** In this step, data gets cleansed and restructured to meet operational needs. Data is usually not loaded directly into the data destination, it is first loaded into a staging database (layer between the raw data and the clean data). This ensures a quick roll back in case something goes wrong in the pipeline. Common transformations include:
- Data Filtering: Removing irrelevant data.
- Data Sorting: Organizing data into a required order.
- Data Aggregating: Summarizing data to provide meaningful insights (e.g. average sales, total sales).
Load: This is the final process where transformed data is uploaded to a target database where end-users can access it. Depending on the use case, there are two types of loading methods:
- Full Load: All data is loaded into the target system, often used during the initial population of the warehouse.
- Incremental Load: Only new or updated data is loaded, making this method more efficient for ongoing data updates.

So, how does ETL work? Think of a modern ETL pipeline as a factory assembly line. The system doesn't wait to gather all the raw materials before starting production. Instead, it multitasks—extracting new data while simultaneously cleaning the previous batch and loading the finished product. How fast this assembly line moves depends entirely on the business's needs, generally falling into two categories:

Batch processing pipelines: This is the most popular method where data is extracted, transformed and loaded periodically.
Real-time processing pipelines: This method depends on streaming sources for data, with transformations performed using a real-time processing engine like Spark. Unlike batch processing which is scheduled, this method occurs in real time e.g fraud detection.

Advantages of ETL

Data quality: Data quality and consistency are often improved in ETL processes through cleansing and transformation steps.
Data governance: ETL can help enforce data governance policies by ensuring that data is transformed and loaded into the target system in a consistent and compliant manner.
Legacy systems: ETL is often used to integrate data from legacy systems that may not be compatible with modern data architectures.
Complex transformations: ETL tools often provide a wide range of transformation capabilities, making them suitable for complex data manipulation tasks.
Enhanced Decision-Making: ETL helps businesses derive actionable insights, enabling better forecasting, resource allocation and strategic planning.
Operational Efficiency: Automating the data pipeline through ETL speeds up data processing, allowing organizations to make real-time decisions based on the most current data.

Challenges of ETL

While ETL is essential, building and maintaining reliable data pipelines has become one of the more challenging parts of data engineering. These are some of the issues that plague this process:

Limited Re-usability: A pipeline built in one environment cannot be used in another, even if the underlying code is very similar, meaning data engineers are often the bottleneck and tasked with reinventing the wheel every time.
Data Quality: Managing data quality in increasingly complex pipeline architectures is difficult. Bad data is often allowed to flow through a pipeline undetected, devaluing the entire data set. To maintain quality and ensure reliable insights, data engineers are required to write extensive custom code to implement quality checks and validation at every step of the pipeline.
Scaling Inefficiencies: As pipelines grow in scale and complexity, companies face increased operational load managing them, which makes data reliability incredibly difficult to maintain. Data processing infrastructure has to be set up, scaled, restarted, patched and updated - which translates to increased time and cost.
Silent Failures: Pipeline failures are difficult to identify and even more difficult to solve due to lack of visibility and tooling.

Regardless of these challenges, ETL is a crucial process for data-driven businesses.

Solutions for common ETL Challenges

Data Quality Management: Use data validation and cleansing tools, along with automated checks, to ensure accurate and relevant data during the ETL process.
Optimization Techniques: Overcome performance bottlenecks by making tasks parallel, using batch processing and leveraging cloud solutions for better processing power and storage.
Scalable ETL Systems: Modern cloud-based ETL tools offer scalability, automation and efficient handling of growing data volumes.

Real world use cases

These are some of the ways ETL is used in the real world:

Sensor Data Integration: Gathering raw, continuous data from multiple IoT sensors, filtering out anomalies, and moving the clean data to a single point where it can be analyzed for equipment maintenance.
Cloud Migration: Moving legacy data from an on-premise (client-managed) warehouse, transforming its structure to match modern schemas, and loading it into the new cloud platform.
Marketing Data Integration: Collecting campaign data from various distinct sources (like Facebook Ads, Google Ads, and email platforms), standardizing currency and date formats and preparing it for analysis before loading it into a final reporting destination.
Database Replication: Continuously extracting data from multiple operational databases, transforming it to unified schema and replicating it into a central data warehouse for reporting.

These are some of the tools you could use for ETL:
Open-source tools: Apache Nifi.
Commercial ETL tools Informatica and Microsoft SSIS

ELT

ELT stands for "Extract, Load, Transform." In this process, the transformation of data occurs after it is loaded into storage. That means there's no need for data staging.

The ELT process does not differ much from ETL, transformation just comes after data loading.

_So, why would you choose ELT?
_The ELT approach offers several potential advantages, particularly in environments dealing with large data volumes and diverse data types:

Flexibility: The ELT process allows you to store new, unstructured data with ease, without transformation.
Speed: Cloud warehouses enable quick data transformation due to their processing power.
Cost efficiency: Using the computing power of a cloud data warehouse for transformations can sometimes be more cost-effective than maintaining separate infrastructure, especially when the data warehouse offers optimized processing.

Challenges of ELT

ELT also comes with a fair amount of challenges:

Data governance and security: Loading raw data, which might contain sensitive user information, into a data lake or data warehouse requires robust data governance and compliance measures.
Data Management: If raw data loaded into a data lake isn't properly catalogued, the data lake can turn into a "data swamp" where data is hard to find, trust or use effectively. A strong data management strategy is crucial.
Data Quality: Since transformations occur later in the process, ensuring data quality might require dedicated steps post-loading. Monitoring and validating data within the target system becomes important.

Real world use cases

This is how ELT can be used in the real world:

Mobile Lending Applications: Ingesting massive volumes of raw, unstructured user and transaction data from a mobile lending app directly into a data lake then using the warehouse's computing power to transform specific segments of that data to train machine learning algorithms for credit scoring.
Event Analytics: Dumping massive volumes of raw website clickstream data or server logs directly into a cloud data warehouse as soon as they are generated. Transformations are only applied later when data analysts need to query specific user behaviors or run a security audit.
Rapid Storing of Unstructured Data: Loading new, completely unstructured data (like raw text, audio files, or social media feeds) directly into storage, providing immediate access to all raw information whenever it is needed for future analysis.

ELT Tools

Open-source tools
* ELT Platforms: Airbyte
* Orchestrators: Apache Airflow
* Transformation Framework: data build tool (dbt)

Commercial tools
* ELT Platforms: Matillion, Hevo Data, Weld
* Connectors: Fivetran
* Data Replication: Stitch

ETL vs. ELT

The choice between ETL and ELT depends on several factors, such as:

Data complexity: ETL is often used for complex transformations that require specialized tools and expertise.
Skills and resources: ETL requires specialized skills and resources for building and maintaining transformation pipelines. ELT may be easier to implement because it leverages the resources of cloud data warehouses.
Data volume: ELT is generally better suited for large volumes of data because it leverages the processing power of cloud data warehouses for transformations.
Target system: ELT is best suited for cloud-based data warehouses and data lakes that have the processing power to handle transformations.

Summary

To cap this off, in modern data engineering, transforming raw data into actionable insights requires robust data integration pipelines. The two dominant approaches for moving and preparing this data are ETL and ELT.

ETL (Extract, Transform, Load): This traditional approach extracts raw data, cleans and structures it within an intermediate staging area, and finally loads it into a target database or data warehouse.
- Best for: Enforcing strict data quality, ensuring regulatory compliance/governance, and executing highly complex transformations—often used with legacy systems.
- Trade-offs: Can suffer from scaling inefficiencies, rigid maintenance requirements, and processing bottlenecks.
ELT (Extract, Load, Transform): This modern approach extracts raw data and loads it directly into a data lake or cloud data warehouse without prior staging. Transformations are performed post-load, leveraging the massive computational power of the destination system.
- Best for: Handling massive data volumes, quickly ingesting unstructured data and minimizing latency.
- Trade-offs: Requires robust security measures to protect sensitive raw data and strict cataloging to prevent the data lake from degrading into an unmanageable mess.

In conclusion, the choice between the two processes depends heavily on one's specific needs. ETL remains the standard for complex transformations where data quality must be guaranteed prior to storage. Conversely, ELT has emerged as the preferred choice for modern, cloud-based environments dealing with massive, diverse datasets where speed and flexibility are the top priorities.