Forem: Mwirigi Eric

Change Data Capture (CDC) in Data Engineering: Concepts, Tools and Real-world Implementation Strategies

Mwirigi Eric — Thu, 18 Sep 2025 13:12:08 +0000

What is Change Data Capture?

According to Confluent, Change Data Capture is a process of tracking all changes occurring in data sources (such as databases, data warehouses, etc.), and capturing them in destination systems.

Why CDC Matters in Data Engineering?

CDC eliminates Bulk Load Updates by enabling incremental loading/real-time streaming of data changes to target destinations.
It enables Log-based Efficiency by capturing changes directly from transaction logs, thus reducing system resource usage and ensures performance efficiency.
Facilitates real-time data movement, ensuring zero-downtime database migrations. This ensures that up-to-date data is available for real-time analytics and reporting.
CDC enables seamless synchronization of data across systems, which is crucial for time-sensitive decisions in high-velocity data environments.

How Does CDC Work?

Now that we have a grasp of what CDC is and why it matters, it's crucial to explore how it essentially works so as to have an even better understanding.

Firstly, CDC can be initiated in two approaches i.e Push and Pull. In the Push approach, a source system sends/pushes data changes into a target system(s), whereas, in the Pull approach, the target system regularly polls a source system and "pulls" all identified changes.

Therefore, CDC works by identifying and recording change events happening in various data sources (such as databases), and transferring these changes from the source system in real-time/near real-time to a target system such as a data warehouse or a streaming platform e.g. Kafka.

CDC Implementation Methods/Patterns

CDC implementations can be accomplished through the following methods:

Log-based CDC: In this method, a CDC application processes the changes recorded in the database transaction logs, and shares the updates with other systems. This method is suitable for real-time data synchronization since it offers low latency and high accuracy.
Trigger-based CDC: In this method, triggers are executed once specific modifications occur in a database, and the changed data is then stored in a change/shadow table. The method is simple to implement, however, it burdens source systems since triggers are activated each time a transaction happens in the source table.
Time-based CDC: Changes are identified by polling a timestamp column in the source database, and the updates delivered to target system(s). Time-based CDC is easier to implement, however, it puts additional load on a system timestamp polling occurs frequently.

CDC Tools

Basically, a CDC tool automates the tracking and replication of changes across systems, by detecting data modification (updates, deletion or addition) and replicating them to a target database or system.

There are several tools that can be adopted for CDC implementation, such as Rivery, Hevodata, Debezium and Oracle, each with different capabilities. This article will focus on Debezium, and how it can be used for CDC integrations.

1. Understanding Debezium

From the official documentation, Debezium is a set of distributed services to capture changes in databases so that applications can see those changes and respond to them. It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Debezium is built on top of Apache Kafka, and provides a set of Kafka Connect compatible connectors, which record the history of data changes in a Database Management System (DBMS) by detecting the changes as they occur, and streaming a record of each change event into a kafka topic.

2. CDC Architecture with Kafka and Debezium

Debezium is commonly deployed by means of Apache Kafka Connect, which is a framework and runtime for implementing and operating:

Source connectors e.g. debezium that send records to kafka.
Sink connectors that propagate records from kafka topics to other systems.

The CDC Architecture is illustrated in the following figure:

From the above, when a new record is added to a database, the debezium source connector detects and records the change, and pushes it to a kafka topic. The sink connector then streams this record to a target system like a data warehouse, where it can be consumed by an application or service.

3. How CDC is Implemented Using Debezium and Apache Kafka

When debezium captures changes from a database, it follows the following workflow:

Connection Establishment: Debezium connects to the database and positions itself in the transaction log.
Initial Snapshot: For new connectors, debezium typically performs an initial snapshot of the database to capture the current state before processing incremental changes.
Change Capture: As database transactions occur, debezium reads the transaction log and converts changes into events.
Event publishing: Change events are published to Kafka topics, typically one topic per table by default.
Schema Management: If used with Schema Registry, event schemas are registered and validated.
Consumption: Applications or sink connectors consume the change events from Kafka topics.

4. How to Implement PostgreSQL CDC Using the Debezium PostgreSQL Connector

Considered Architecture:

PostgreSQL database with logical replication enabled.
Apache Kafka for message streaming.
Kafka connect with the Debezium PostgreSQL connector.

Step 1: Enable Logical Replication in PostgreSQL:

In PostgreSQL configuration file, update the following settings to appropriate values, and restart PostgreSQL to apply the changes.
```
wal_level = logical
max_wal_senders = 1
max_replication_slots = 1
```
Thereafter, grant necessary replication permissions to the database user as follows:

psql -U postgres -c "ALTER USER myuser WITH REPLICATION;"

Step 2: Install Debezium PostgreSQL Connector:

Download and install the Debezium PostgreSQL connector plugin in your Kafka Connect setup. Ensure Kafka Connect is properly configured and running.

Step 3: Configuring the Connector:

Create a JSON configuration file for the Debezium PostgreSQL connector, specifying connection details, replication settings, and the databases and tables to monitor, as follows:

    {
    "name": "postgres-connector",
    "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "myuser",
    "database.password": "mypassword",
    "database.dbname": "mydb",
    "database.server.name": "server1",
    "table.include.list": "public.users",
    "plugin.name": "pgoutput"
      }
    }

Step 4: Register/Deploy the Connector to Start Monitoring the Database:

Use Kafka Connect’s REST API running on port 8083 to deploy the connector configuration, as follows:

curl -X POST -H "Content-Type: application/json" --data @connector-config.json http://localhost:8083/connectors

Step5: Generating and Observing Change Events:

Create table 'customers' in the database and insert values

   psql -U myuser -d mydb -c "CREATE TABLE customers (id SERIAL PRIMARY KEY, name            VARCHAR(255), email VARCHAR(255));"

Insert customer details in the table

psql -U myuser -d mydb -c "INSERT INTO customers(name,email) VALUES ('Tyler', 'tyler@example.com');"

View the change event using Kafka's console consumer by running the following command:
```
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic server1.public.customers --from-beginning
```
The change event structure will appear as follows:

{
    "schema":{...},
    "payload":{
        "before":null,
        "after":{"id":1, "name":"Tyler", "email":"tyler@example.com"},
        "op":"c",//'c'indcates create(insert)
        "ts_ms":1620000000001
    }
}

The event includes both "before" and "after" states. For an insert, "before" is null since the row didn't exist previously. The "op" field indicates the operation type, and "ts_ms" provides a timestamp.

Updates and Deletion of rows can be executed and the events viewed in a similar manner.

Challenges and Solutions

Whilst the adoption of CDC plays a crucial role in an organization's data management process, it comes with several challenges, which necessitates careful consideration during implementation. Some of these challenges include:

1. Schema Evolution:

Overtime, databases undergo changes such as column additions, deletions and renaming, which can disrupt CDC workflows if not handled properly.
In production, this challenge can be mitigated by adopting schema registries, which validates schemas for compatibility when changes occur.

2. Late Data:

In CDC, late data represents data that arrives after a batch or time window has passed. To handle this, several measures are adopted:
- Adding source-based commit timestamps to messages;
- Defining a cut-off timestamp;
- Using a holding table for delayed messages.

3. Fault Tolerance:

In CDC fault tolerance can be ensured through the adoption of tools that support retries and error handling for failed events. Additionally, enabling persistence in message brokers like Kafka ensures durability, thus ensuring that no events are lost during system outages.

4. Event Ordering:

Without proper event handling mechanisms in CDC, data inconsistencies and inaccuracies are bound to happen in target systems, since the events are split and processed in a distributed manner.
- To address this the following measures can be adopted:
- Use of partitioned messaging system like Kafka which guarantee per-partition ordering.
- Implementing merge logic in the target system that is both sorted and idempotent.
- Processing data in partition-aware batches in the target system.

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Mwirigi Eric — Mon, 08 Sep 2025 15:16:03 +0000

Introduction

Apache Kafka is a crucial component in modern data engineering, and a good understanding of its core concepts, architecture and applications is essential for any data engineer in todays data-driven world.

What is Apache Kafka?

According to the Official Kafka website, Apache Kafka is defined as a distributed event streaming platform. But what is event streaming? From the Kafka documentation, event streaming is the practice of capturing data in real-time from various sources such as databases, sensors and cloud services.

Thus, simply put, apache kafka provides a platform that handles and processes real-time data, whereby the platform works as a cluster of one or more nodes, making it scalable and fault-tolerant.

Apache Kafka Core Concepts

1. Kafka Architecture:

Brokers - Are set of servers that run kafka and form the storage layer. They store and serve requested data.
Zookeeper - Can be defined as a centralized service for managing and coordinating kafka brokers. In a cluster, it tracks broker membership, topic configuration, leader selection and meta data related to kafka topics, brokers and consumers.
Kraft - kraft is a protocol that replaces zookeeper-based metadata management system. With this protocol, kafka brokers manage metadata internally without the need of an external system like zookeeper.

2. Topics, Partitions and Offsets

Topics: These are logical channels to which events are published, similar to a database table. Topics are multi-producer, and multi-subscriber, and for scalability purposes are split into partitions.
Partitions: When a kafka topic is created, it is divided into one or more partitions, each functioning as an independent, ordered log file that holds aa subset of the topic's data. Hence allowing for efficient parallel data processing.
Offset: In Kafka, each record in a partition is assigned a unique identifier (offset). During writing, each new message written to topic is appended to the log and kafka assigns it the next sequential offset. when reading, consumers can specify an offset to start reading a particular message.

3. Producers: Kafka producers are client applications that publish (write) data to kafka topics, whereby the data is sent to the appropriate topic and partition based on either the key-based partitioning (involves using consistent hashing mechanism on the key of the message, which ensures that all messages with the same key go to the same partition) or the round-robin partitioning (in which kaka distributes messages across partitions in a rotating manner).

Acknowledgement Modes(Acks) -A kafka producer can choose to receive acknowledgement/confirmation of data writes in 3 modes, namely:
- acks=0: Producer sends data without waiting for an acknowledgement, hence prone to possible data loss in the event of sending data to a broker which is down.
- acks=1: Producer sends data and the the leader confirms receipt. If the ack is not received, the producer retries sending the data again.
- acks=all: In this scenario, both the leader and replica are requested to acknowledge receipt of messages.

4. Consumers: Kafka consumers are client applications that subscribe to kafka topics and processes the data.

Consumer groups: consist of client applications (consumers) responsible for reading messages from a topic across multiple partitions through polling, allowing consumers to process events from a topic in parallel.
Consumer group offset: For each topic and partition, the consumer group has a number that represents the latest consumed record from kafka. Hence, a consumer can keep track of where it is in a topic. Consumer offsets can be managed by the following strategies:
- Auto-commit - the enable.auto.commit = true property enables kafka to commit the consumer offset to the kafka cluster periodically as defined by auto.commit.interval.ms.
- Manual commit - Manual offset commit defined by enable.auto.commit = false allows control over when offsets are recorded.

5. Kafka Message Delivery and Processing

Message delivery semantics: Kafka semantic guarantees refers to how the broker, producer and consumer agree to share messages. According to the Kafka documentation, messages can be shared in the following ways:
- At-most-once - Messages are delivered once, and if there is a system failure, the messages are lost and are not re-delivered.
- At-least-once - Messages are delivered one or more times, and incase of a failure there is no loss.
- Exactly-once - This is the preferred mode in that each message is delivered once, and there is no loss or reading a message twice even if some part of the system fails.

6. Retention Policies

In kafka, retention refer to the policy that determines how long kafka keeps data/records in a topic before they are eligible for deletion. The following retention policies are provided by kafka:

Time-based Retention: Records are retained for a specified duration e.g. 7 days, after which upon expiry, the oldest records are deleted.
Size-based Retention: Records are retained until the total size of log segments reaches a specified limit e.g. 1GB, after which the oldest records are deleted.
Log Compaction: In log compaction, kafka retains only the latest value for each key in a topic, discarding older updates. This mode is useful for use cases where a complete and up-to-date view of a dataset is needed, for example when maintaining the latest state of user profile.

7. Back Pressure and Flow Control

In distributed systems, back pressure is a mechanism for regulating data flow in order to prevent overloads to parts of the system. Simply put, back pressure acts like a traffic control system, regulating the speed and volume of data flow to maintain optimal performance and reliability across the entire system.

During operations, a system sometimes experiences consumer lags (delay in time it takes a message to move from a producer to a consumer), therefore it is essential to perform Consumer Lag monitoring, so that slow consumers can be identified quickly and remedial action taken.

To undertake consumer lag monitoring, several methods can be used, among them the Offset explorer tool and proprietary kafka monitoring services such as Amazon's.

8. Serialization and Deserialization

Serialization converts data objects to binary format suitable for transmission or storage. Deserialization involves converting the binary data back to its original object form. The following schema formats are adopted:

JSON Schema: JSON schema converts an application's data object into a JSON string then serializes it into bytes before sending. On receiving, the consumer retrieves the JSON Schema using the extracted schema ID and deserializes the data back to the application's data object.
Avro Schema: - Avro schema uses the kafkaAvroSerializer to convert data into binary format. On receiving, the consumer uses the kafka Avro deserializer to extract schema ID, queries the schema registry to retrieve the Avro schema which then converts the binary data back to the data object.
Protobuf Schema: The producer application first creates a Protobuf message, and then the KafkaProtobufSerializer converts the message into a binary format. On receiving the binary message, a consumer uses the kafkaProtobufSerializer which converts it back into a protobuf message, extracts schema ID, retrieves the schema and uses it to deserialize the binary data into a structured protobuf message object.

9. Replication and Fault Tolerance

Basically, replication involves maintaining a copy of every topic in multiple brokers, ensuring that the data is accessible even in the event of broker failure, thus ensuring fault tolerance.

Leader replica: A leader replica receives all write requests for a partition. All produce and consume requests go through the leader, for consistency.
Follower replica: They replicate data from the leader by fetching log segments and applying the to their local logs. If current leader fails, the can take over as leader.
In-Sync replicas (ISR): These are a subset of replicas synchronized with the leader and kafka guarantees durability by ensuring that messages are only committed when they are replicated to all ISR replicas.

10. Kafka Connect

Kafka connect serves as a centralized data hub for simple data integration between databases, key-value stores, file systems etcetera.

Source Connectors: Pull/ingest data from external sources such as databases and message queues into kafka topics.
Sink Connectors: Deliver data from kafka topics to external systems.

11. Kafka Streams

Kafka Streams API: This is a powerful, lightweight library for building real-time, scalable and fault-tolerant stream processing applications.
Stateless versus Stateful Operations: Stateless operations transform individual records in the input streams independently, whereas Stateful operations maintain and update state based on the records processed so far, enabling operations like aggregations and joins.
Windowing: The windowing concept involves processing and aggregating data streams in pre-determined time frames, making windowed joins and aggregations possible.

12. ksqlDB

ksqlDB enables the querying, reading, writing and processing of data in real-time and at a scale using SQL-lie syntax.

13. Transactions and Idempotency

Kafka transactions ensure atomicity by allowing producers to group multiple write operations into single transaction. If the transaction commits successfully, all the writes are visible to consumers, and if aborted, none of the write is visible.

Exactly-Once Semantics: Ensures that each message is processed exactly once , even in the event of failure. This is achieved through idempotent producers and transactional consumers configured with isolation levels.
Isolation Levels: read-uncommitted allows consumers to see all records including aborted transactions, whereas read-committed allows consumers to only see records from committed transactions.

14. Security in Kafka

SSL/TLS authentication protocol: The secure socket layer (SSL) and transport layer security (TLS) protocols provide encryption and authentication for data in transit.
SASL protocol: The Simple authentication and security layer (SASL) provides a framework for adding authentication and data security services to connection-based protocols.

15 Operations and Monitoring

Consumer Lag Monitoring: Consumer lags can be monitored by:
- Consumer group script - exposes key details about the consumer group performance by detailing partition's current offsets, log end offset and lag.
- Using Burrow - Burrow is an open-source monitoring tool that monitors committed consumer offsets and generates reports.
Under-replicated Partitions: This is a kafka broker metric that can reveal various problems, from broker failures to resource exhaustion and suggests responding to fluctuations by running preferred replica elections.
Throughput and Latency: Throughput indicates the number of messages that can be processed in a given amount of time, whereas Latency describes how fast messages can be processed.

16. Scaling Kafka

Partition Count Tuning: Involves determining and adjusting optimal number of partitions for a given kafka topic, while considering throughput + parallelism, resource utilization and broker limitations.
Adding Brokers: Kafka architecture allows addition of broker nodes to a cluster as data processing needs grow. The horizontal scalability offers increased throughput, fault tolerance and ability to handle large data volumes. The guidelines for adding brokers can be found here setting up a multi-broker kafka cluster.
Rebalancing Partitions: When a consumer joins or leaves a consumer group, kafka automatically reassigns partitions among the consumers in that group to ensure an even distribution of workload.

17. Performance Optimization

Batching: In batching, kafka producers groups multiple messages from the same partition into a single batch. batch size defines the maximum size in bytes of a batch whereas linger.ms specifies the maximum time in milliseconds the producer will wait to accumulate messages for a batch before sending it, even if the batch-size has not been reached.
Compression: Prior to sending messages within a batch to kafka brokers, compression can be done by configuring as compression.type which specifies the compression algorithm to adopt e.g. gzip, zstd, and Snappy.
Page Cache Usage: Page cache is a transparent buffer maintained by the OS that keeps recently accessed file data in memory. During writes, data is to the page cache first and the OS flushes these page to disk synchronously. During reads, consumer requests are served from the page cache with the OS handling the prefetching of data.
Disk and Network Considerations: This is achieved by considering SSDs and NVMe (non-volatile memory express) which offer faster read/write speeds. For network considerations, ensuring high-bandwidth low-latency network infrastructure and distributing partitions and topics across multiple brokers is essential.

Apache Kafka Real-world Applications

1. Netflix

Real-time Data streaming: Netflix leverages the apache kafka capabilities to handle real-time data streaming that include user interactions, content consumption, system logs and operational metrics.
Microservices communication: In Netflix’s microservices architecture, Kafka is used to enable asynchronous communication between services.
Log aggregation and Monitoring: Kafka is instrumental in aggregating logs from different microservices and components within Netflix’s architecture such as centralized logging and real-time monitoring.

2. Uber

Real-time Pricing: Uber adopts Kafka in its pricing pipeline which adjusts models based on dynamic factors like driver availability, location, and weather to manage supply and demand.

Docker and Docker Compose : A Beginner's Guide

Mwirigi Eric — Mon, 25 Aug 2025 13:49:43 +0000

This guide tries to simplify the intricacies of docker and docker compose with the aim of enabling a solid grasp of the fundamentals for beginners.

What is Docker?

Basically docker is a platform/tool that simplifies the development, running and shipping of applications.
Docker achieves this by virtualizing the operating system of the computer on which it is installed and running.

What does Docker actually do?

Docker provides the ability to package and run an application in an isolated environment.

Why do we need Docker in Developing Applications?

Docker allows you to separate your applications from your infrastructure, thus enabling fast deployment of the applications.

Before diving deeper into docker, you can download docker from the following link, which provides guidelines for installation and running of docker on various operating systems: Download Docker

Docker Key Words/Components

Docker Engine/Daemon - Listens for requests from docker client(s), manages containers and coordinates docker operations.
Docker Client - Is a command line interface (CLI) through which users execute commands to manage their images and containers.
Docker Registry - Stores container images. An example is the Docker Hub
Dockerfile - A text that contains instructions describing how to build a docker image. Typically, a docker file contains the following instructions:
- FROM <image> : specifies the base image from which to build from.
- WORKDIR <path>: specifies the working directory.
- COPY<host-path><image-path>: Instructs the builder to copy files from host and place them into the container image.
- RUN<command>: Instructs the builder to run the specified command.
- CMD<command>, <arg1>: Sets the default command to be used by the container using this image.

A typical Dockerfile will look like this:

    FROM python:3.10

    WORKDIR /app

    COPY requirements.txt .

    RUN pip install -r requirements.txt

    COPY . .

    CMD ["python", "main.py"]

Image(s) - An image is a blueprint/template of creating a container i.e. contains instructions for creating a container.
Container(s) - A container is an instance of an image. It contains everything required to run a particular application, including code, libraries, system tools etc.
To build the image and run a container from the Terminal, run the following commands:

docker build -t demo .

where -t denotes tag which identifies the name given your image i.e. demo.

docker run --name demo1 -p 5000:5000 demo

--name is the name of container to be run, and -p sets the port of the docker container to localhost.

Other common commands include:

docker ps - lists running containers

docker images - lists local images

docker stop <id> - stop running a particular container specified by its id.

docker -rm <id> - removes a container.

What is Docker Compose?

Docker compose is a tool for defining and running multi-container applications, whereby these containers are managed in a single docker-compose.yaml file.

So what's the difference between Docker and Docker Compose?

Docker builds, run and manages containers (containers package applications with their dependencies), whereas Docker Compose simplifies the management of multi-container applications using a singledocker-compose.yaml file.

Key Components of Docker Compose YAML File

Version - Specifies the version of the docker compose file format.
Services - Specifies individual containers or services, making up an application. Each service may have a given docker image, environment variables, volumes, networks, ports to expose, and more options.
Volumes - Defines shared volumes that could be mounted into services to persist data. Volumes are crucial in storing data that survives container restarts or must be shared between many containers.
Networks - Defines custom networks for your services to communicate over. A network creates an isolated communications channel, and each service can have zero or more networks.

A Recap of Data Engineering Concepts

Mwirigi Eric — Mon, 11 Aug 2025 13:38:42 +0000

As a data Engineer, an in-depth comprehension of the core concepts applied in the field, plays a pivotal role in carrying out daily tasks as well as career progression in the field.

In this article, we explore the concepts and aim to expound their meanings, applications and relevance in the field of Data Engineering.

1. Batch versus Streaming Ingestion

Data Ingestion refers to the process of data collection (from various sources), and moving it to a target destination either in batches or in real-time. Primarily, the methods adopted for data ingestion are batch and streaming.

Batch Ingestion involves collection and processing of data in chunks whereby the process can either be scheduled or designed to occur automatically.

The method is effective for resource intensive jobs and repetitive tasks and is ideal for applications such as data warehousing and ETL processes.

Streaming Ingestion involves collection and processing of data as received i.e in real-time, from source to target.

The method is effective for applications that require real-time data processing such as network traffic, fraud detection and mobile money services such as Mpesa.

2. Change Data Capture (CDC)

This is a data integration pattern that captures only the changes made to data, (through insert, update and delete statements), and represents these changes as a list called CDC feed.

The CDC approach focusses on incremental changes, hence ensuring that target systems access the most current data at all times.

CDC is usually implemented through four methods namely:

Log-based method which reads a database transaction logs.
Trigger-based method which used database triggers attached to source table events.
Time-based methods that rely on dedicated columns that record the las modified time for each record.
Polling-based method that checks for changes based on a time-stamp or version column.

3. Idempotency

Idempotency is a property of an operation producing the same results despite of the number of times the operation is run. This ensures the repeatability and predictability of data processes, thus ensuring data consistency across distributed systems, as well as effective handling of failures and retries.

4. OLTP versus OLAP Data Processing Systems

Online Transaction Processing (OLTP) systems are designed for real-time transaction management of concurrent users, and focuses on data consistency and fast retrieval of records. Examples of OLTP systems use include; ATM withdrawals, online banking and e-commerce purchases.
Online Analytical Processing (OLAP) systems are designed to analyze aggregated historical data from various sources, for the purposes of data analysis, reporting, and business intelligence. A real-world user case of OLAP system is the Netflix movie recommendation system.

5. Columnar versus Row-based Storage

Data storage and management adopts two formats i.e columnar and row-based storage formats.

Columnar Storage organizes data by columns whereby each column is stored separately allowing the system to read/write specific columns independently. Common user cases include data warehouses such as Google BigQuery, and Amazon Redshift.

The advantages of columnar storage include Faster aggregations, Efficient data compression and Efficient analytical queries (optimized for read-heavy queries).

The disadvantages include; Inefficient for transactional operations and complexity in implementing and management.

Row-based Storage organizes data by rows, where each row stores a complete record. This format is used in relational databases such as PostgreSQL and MySQL.

The advantages of row-based storage include; Simple data access, and Efficient for transactional operations.

On the other hand, the row-based format is Inefficient for analytical queries and might use more space if not optimized properly.

6. Partitioning

Data Partitioning involves dividing data into smaller segments/partitions, whereby each partition contains a subset of the entire dataset.

Partitioning data helps in improving query performance by limiting data retrieval to only the relevant data, thus reducing the servers’ workload and accelerating data processing.

Data can be partitioned using the following methods:

Horizontal Partitioning (Row-based) – Splits tables by rows, thus each partition has the same columns but different records, whereby the partitioning is based on a partition key.
Vertical Partitioning (Column-based) – Splits tables by columns allowing the reading/writing of columns independently.
Functional Partitioning – Partitions data according to operational requirements, with each partition containing data specific to a particular function.

7. ETL versus ELT

The Extract, Transform and Load (ETL) and Extract, Load and Transform (ELT) methods are the two common approaches for data integration, with their major difference being the order of operations.

Extract, Transform and Load (ETL) – transforms data (on a separate processing server) before loading it to a target system such as a data warehouse. Real-world example of ETL user case would be in ecommerce where ETL allows business gain insights into customer behavoir and preferences.
Extract, Load and Transform (ELT) - performs data transformations directly within the data warehouse itself. It allows for raw data to be sent directly to the data warehouse, eliminating the need for staging processes. Real-world example of ELT user case would be in cloud data warehouses such as snowflake and Amazon redshift.

8. CAP Theorem

The CAP theorem (Brewer’s theorem) in distributed systems claims that; All three of the desirable properties of Consistency, Availability and Partition Tolerance cannot be concurrently guaranteed in any distributed data system.

Consistency - Means that all clients see the same data at the same time, no matter which node they connect to.
Availability - Means that any client making a request for data gets a response, even if one or more nodes are down.
Partition Tolerance - Means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

Therefore, in a distributed system only two of the properties can be achieved simultaneously, thus guides developers in prioritizing the properties that best suit their needs.

9. Windowing in Streaming

In streaming, windowing is a way of grouping events into a set of time-based collections or windows, in which the windows are specified based on time intervals or number of records.

Windowing allows stream processing applications to break down continuous data streams into manageable chunks for processing and analysis.

10. DAGs and Workflow Orchestration

A Directed Acyclic Graph (DAG) is a conceptual representation of tasks, whose order is represented by a graph in which nodes are linked by one-way connections that do not form any cycles.

DAGs represent tasks as nodes and dependencies as edges, thereby enforcing a logical execution order, ensuring that tasks are executed sequentially based on their dependencies.

In data engineering DAGs are used in orchestrating ETL processes as well as managing complex data workflows that involve multiple tasks and dependencies, such as a machine learning workflow.

11. Retry Logic and Dead Letter Queues

Retry logic - is a method used in software and systems development to automatically attempt an action again if it fails the first time. This helps to handle temporary issues, such as network interruptions or unavailable services, by giving the process another chance to succeed.
Dead Letter Queue- is a special type of message queue that stores messages that fail to be processed successfully by consumers. It acts as a safety net for handling failed messages and helps in debugging and retrying failed tasks.

A real-world user case of the Dead Letter Queue is Uber, which leverages Apache Kafka features using non-blocking request reprocessing Dead Letter Queue to achieve decoupled, observable error-handling without disrupting real-time traffic.

12. Backfilling and Reprocessing

Data Backfilling involves filling in missing historical data that does not exist in the system or correcting stale data in the system.
Data Reprocessing involves preparing raw data for analysis by cleaning and transforming it into a usable format.

13. Data Governance

Data governance is a framework of principles that manage data throughout its lifecycle, from collection and storage to processing and disposal.

Why it matters: Without effective data governance, data inconsistencies in different systems across an organization might not get resolved.

14. Data Versioning and Time Travel

Data Versioning involves creation of a unique reference (query or ID) for a collection of data, to allow for quicker development and processing of data while reducing errors.
Time Travel feature in data versioning, enables versioning of big data stored in a data lake, allowing access to any historical version of the data thus providing robust data management through data rollback capabilities for bad writes/deletes.

15. Distributed Processing Concepts

Distributed data processing involves handling and analyzing data across multiple interconnected devices or nodes, leveraging the collective computing power of interconnected devices.

Benefits of distributed processing include scalability, robust fault tolerance, enhanced system performance and efficient handling of large volumes of data.

Real-world adoption of distributed concepts include: Fraud detection and risk management, Personalized product recommendation in e-commerce and Network Monitoring and Optimization.

Data Warehouse: Exploring Key Components and Architecture

Mwirigi Eric — Tue, 29 Jul 2025 09:03:15 +0000

1. Defining a Data Warehouse

A data warehouse is a centralized repository designed specifically to store, manage, and analyze large volumes of historical and current data from various sources within an organization. It is optimized for analytical processing and business intelligence activities.

The following attributes are associated with a data warehouses:

Designed for analytical tasks using data from various applications.
Includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
Designed to handle massive amount of data accumulated over a period (s) of business operations.
Information is organized in well-defined tables with clear relationships for ease of access by users and analytical tools.

2. Data Warehouse Models/Types

The three main types of data warehouses include:

Enterprise Data Warehouse (EDW)

The EDW serves as a central or main database to facilitate decision-making through the enterprise. The key benefits of having an EDW include:

Access to cross-organizational information.
Ability to run complex queries.
Access to detailed insights enabling data-driven decisions and early risk assessment.
Operational Data Store (ODS)

An ODS is a temporary storage area for near real-time data, used for operational reporting and analysis, often reflecting the most up-to-date information. It focusses on current operational data as opposed to historical trends.

Data Mart

A data mart is a subset of a DWH that supports a particular department, region or business unit. Typically, data marts are designed for specific user groups and their analytical needs.

3. Components of Data Warehouse

A data warehouse components work together to store, manage, and analyze vast amounts of data. The key components are include:

Operational Source Systems: They provide raw data originating from various internal and external sources, such as operational systems, third-party providers, and web-based applications.
Load Manager: Manages the ETL (Extract, Transform and Load) processes for data extraction and transformation, ensuring that the data is adequately prepared and meets the required format prior to entry into the warehouse.
Warehouse Manager: Oversees data storage, aggregation and analysis within the data warehouse, handling tasks like de-normalization, backup, collection and optimization for better performance.
Query Manager: Handles user queries within the data warehouse, i.e supports querying, reporting and data retrieval, with functionality dependent on the available end-user tools.
Detailed Data: Stores granular, raw data for complex analysis and reporting, providing comprehensive insights.
Summarized Data: Stores predefined aggregations of detailed data for faster queries and reports, providing high-level insights for decision making.
Archive and Backup Data: Ensures data integrity and recovery through regular backups and archival storage.
Meta Data: Contains information about data structure, source and transformational processes, thereby supporting the ETL processes, warehouse management and querying, by providing essential context for data.
End-user Access Tools: Include analysis, reporting and data mining tools, enabling uses to access, query and derive insights from the data.

4. Data Warehouse Architecture

Data warehouse architecture refers to the design of an organization’s data collection and storage framework. It consists of planning, designing, constructing, and managing daily operational processes for how data is used for organizational intelligence and decision support.

1) Components of Data warehouse Architecture

Data Sources: These are operational databases and external systems from which raw data is extracted, such as databases, spreadsheets, XML AND JSON files, emails and images.
Extract, Transform and Load (ETL) Processes: ETL processes are responsible for extracting data from the source systems, transforming it into a standardized format (thus ensuring data consistency), and loading it into the data warehouse.

Further, through data validation, cleansing and standardization, the ETL processes contribute to data integrity by ensuring that the data is accurate, complete and reliable.

Data Staging Area: This is a temporary storage location that holds the data before it is processed and integrated into the data warehouse.
Data Warehouse Database: This is the central repository where the cleansed, integrated, and historical data is stored. This database is optimized for analytical queries and reporting.
Metadata Repository: Metadata, or data about the data, is stored in this repository, providing information about the data warehouse's structure, content, and usage.
Front-end Tools: Front-end tools (business intelligence tools) enable users to access, analyze, and visualize the data stored in the data warehouse, supporting informed decision-making.

2) Types of Data Warehouse Architecture

Single-tier Architecture

The data warehouse is built on a single, centralized database that consolidates all data from various sources into one system.

Suitability: Suits small-scale applications and organizations with limited data processing needs.

Advantage: Minimizes the number of layers and simplifies overall design, leading to faster processing and access.

Disadvantage: Lacks the flexibility and modularity of more complex architectures.

Two-tier Architecture

The data warehouse connects directly to business intelligence (BI) tools, often through an online analytical processing (OLAP) system.

Suitability: Ideal for businesses with moderate data volumes and relatively simple reporting or analytic needs.

Advantage: Provides faster access to data for analytic purposes.

Disadvantage: May face challenges in handling larger data volumes, as scaling becomes difficult due to the direct connection between the warehouse and BI tools.

Three-tier Architecture

The model separates the system into distinct layers i.e., the data source layer, staging area layer and the analytics layer, thus enabling efficient ETL processes, followed by analytics and reporting.

Suitability: Ideal for large-scale enterprises that require scalability, flexibility and ability to handle large data volumes.

Advantage: Enables businesses to manage data more efficiently and supports advanced analytics and real-time reporting.

3) Data Warehouse Architecture Patterns/Schemas

A data warehouse schema is a blueprint of how data is related logically within a data warehouse.

The basic components of all data warehouse schemas are the Fact and Dimension tables, whose roles are as follows:

Fact Tables: Aggregates metrics, measurements or facts about business processes. Fact tables store primary keys of dimension tables as foreign keys within the fact table.
Dimension Tables: Provide descriptive attributes needed to interpret the metrics provided for in the fact tables.

Aspect	Fact Tables	Dimension Tables
Purpose	Stores numerical metric (measures)	Provides descriptive, categorical context
Data Type	Numeric	Textual or categorical
Structure	Compact; uses keys and measures	Wide; contains attributes and hierarchies
Query Focus	Supports aggregation and analysis	Optimized for filtering and grouping

The following are the common schemas used in data warehousing:

Star Schema

Stores data in a star format, consisting of a central table (fact table) and a number of directly connected tables (dimension tables). The fact table contains information about metrics or measures, while the dimension tables contain information about descriptive attributes.

Snowflake Schema

The snowflake schema consists of a central table (fact table), and a number of other tables (dimension tables and sub-dimension tables.)

Comparison between Star and Snowflake Schemas

Feature	Star Schema	Snowflake Schema
Elements	Single Fact Table connected to multiple dimension tables with no sub-dimension tables	Single Fact Table connects to multiple dimension tables that connects to multiple sub-dimension tables
Normalization	Denormalized	Normalization
Number of dimensions	Multiple dimension tables map to a single Fact Table	Multiple dimension tables map to multiple dimension tables
Data Redundancy	High	Low
Performance	Fewer foreign keys resulting in increased performance	Decreased performance compared to Star Schema from higher number of foreign keys
Complexity	Simple, designed to be easy to understand	More complicated compared to Star Schema—can be more challenging to understand
Storage Use	Higher disk space due to data redundancy	Lower disk space due to limited data redundancy

5. Real World Applications of Data Warehouses

1) Spotify: Enhanced Customer Insights

Spotify uses its data warehouse in curating customer insights and enables:

Creation of personalized playlists based on consumers' listening habits.
Identification of emerging music trends across different demographics.
Optimization of users' interface based on interaction patterns.

2) Airbnb: Market Analysis and Pricing

Airbnb uses its data warehouse to analyze accommodation markets globally, thus enabling:

Dynamic pricing recommendations for hosts.
Identification of underserved market segments.
Personalized search results based on user preferences.
Fraud detection and security enhancement.

3) Amazon: Supply Chain Optimization

Amazon’s data warehouse supports its global supply chain operations. Through analysis of historical order data and inventory levels, the company optimizes:

Inventory placement to minimize shipping times.
Staffing levels based on predicted order volumes.
Routing efficiency for delivery networks.
Procurement decisions for high-demand products.