Forem: Eric Kahindi

Implementing a CDC pipeline with Debezium

Eric Kahindi — Sat, 29 Nov 2025 21:40:23 +0000

Introduction

This project is a comprehensive real-time data pipeline that captures, processes, and visualises cryptocurrency market data from Binance. This project leverages Apache Kafka, PostgreSQL, Cassandra, and Grafana to build a scalable, event-driven architecture for financial data processing.

If you need any more info, visit the repository to get the full pictures with all the source files.

Technology Stack

Data Ingestion: Python with Binance API
Stream Processing: Apache Kafka
Change Data Capture: Debezium PostgreSQL Connector
Operational Database: PostgreSQL
Analytical Database: Apache Cassandra
Visualization: Grafana
Orchestration: Docker & Docker Compose

Architecture

Data Flow Patterns

Write Path:

Binance API → ETL Service → PostgreSQL Insert
                              ↓
                         Debezium CDC
                              ↓
                         Kafka Topics
                              ↓
                         Cassandra Sink

Read Path:

Cassandra → Grafana Data Sources → Dashboards

Component Interaction Flow

Source Layer (Binance API)
- Exposes RESTful endpoints for market data
- No built-in event streaming capability
- Requires polling via ETL producer
Ingestion Layer (PostgreSQL + Binance ETL)
- Receives transformed data from API
- Enables Write-Ahead Logging (WAL) for CDC
- Maintains logical replication slots for consistency
Streaming Layer (Apache Kafka)
- Decouples producers from consumers
- Provides topic-based pub/sub model
- Enables event replay and stream processing
Persistence Layer (Cassandra)
- Optimized for time-series analytical queries
- Distributed storage for fault tolerance
- Supports high-throughput writes
Visualization Layer (Grafana)
- Queries both PostgreSQL and Cassandra
- Real-time dashboard rendering
- Alert configuration capabilities

Key Components

1. Binance ETL Producer

The ETL Producer is the entry point that orchestrates data collection from Binance. It fetches market data for the top cryptocurrency symbols.
The producer identifies the top 5 crypto gainers in the 24-hour period before fetching detailed data:

def fetch_top5_24hr() -> tuple[list, list]:
    try:
        logging.info(f'Starting to load top 5  data ...')
        response = requests.get(f"https://api.binance.com/api/v3/ticker/24hr")
        response.raise_for_status()
        data = response.json()
        # Filter based on the usdt tethered cryptos  
        usdt_pairs = [
            item for item in data
            if item['symbol'].endswith('USDT') and float(item['quoteVolume']) > 1000000 and float(item['askQty']) > 0.0
            ]
        # sort the data
        sorted_data = sorted(
            usdt_pairs,
            key = lambda x: float(x["priceChangePercent"]),
            reverse=True
        )

        top_5 = [x['symbol'] for x in sorted_data[:5]]

        # push this to the other tasks with xcoms 
        return sorted_data[:5] , top_5

Once the top 5 symbols are identified, the producer fetches and transforms the other 3 types of market data for each of the symbols identified in the list:

Kline Data (OHLCV Candlestick Data) - Fetching 5-minute candlestick data for the last 24 hours. Uses interval='5m' to capture granular price movements.
Recent Trades (Individual Trade Transactions) - Fetching up to 300 most recent trades. Each trade shows: who traded, at what price, how much, and when.
Order Book (Bid/Ask Levels) - Fetching the current order book with up to 300 levels on each side. Returns: {'bids': [[price, qty], ...], 'asks': [[price, qty], ...]}

Then executes the fetch and transform functions in the following main function. It completes in approximately 5-10 seconds.

if __name__ == "__main__":
    # Initialize Cassandra schema
    setup_cassandra()

    # Fetch top 5 gainers (runs once)
    data, top_5 = fetch_top5_24hr()
    transform_load_top5_24hr(data)

    # For each of the top 5 symbols
    for i, item in enumerate(top_5):
        symbol = item['symbol']
        ranking = i + 1

        # Fetch and load klines (25 hours of 5-min data)
        klines = fetch_klines_24hrs(symbol)
        transform_load_klines_data(klines, symbol, ranking)

        # Fetch and load recent trades (300 most recent)
        trades = fetch_recent_trades(symbol)
        transform_load_recent_trades(trades, symbol, ranking)

        # Fetch and load order book (bid/ask levels)
        ob = fetch_OB(symbol)
        transform_load_OB_data(ob, symbol, ranking)

2. PostgreSQL

PostgreSQL serves as the operational database—the first landing zone for all market data. Its role is critical: it must capture every insert with precision for downstream CDC.

Database Initialization

# docker-compose.yml excerpt
  postgres:
    image: debezium/postgres:15
    container_name: postgres
    ports:
      - "5432:5432"
    command: postgres -c wal_level=logical -c max_wal_senders=10 -c max_replication_slots=10
    environment:
      POSTGRES_USER: dbz
      POSTGRES_PASSWORD: dbz
      POSTGRES_DB: cap_stock_db

Critical setting: wal_level=logical enables Logical Replication, which is required for Debezium CDC to function.

3. Debezium PostgreSQL Connector: The CDC Engine

Debezium is the change data capture engine that transforms PostgreSQL into an event publisher.

How CDC Works

Logical Replication Slot — Debezium creates a slot that tracks WAL position
WAL Scanning — Reads Write-Ahead Log sequentially
Event Decoding — Converts log entries to JSON change events
Kafka Publishing — Sends events to topics matching table names

Docker setup

    connect:
    image: quay.io/debezium/connect:3.1
    container_name: connect
    depends_on: [kafka]
    ports: ["8083:8083"]
    environment:
      #Bootstrap
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      HOST_NAME: "connect"
      ADVERTISED_HOST_NAME: "connect"
      ADVERTISED_PORT: "8083"
      KAFKA_CONNECT_PLUGIN_PATH: /kafka/connect,/kafka/plugins,/kafka/plugins/kafka-connect-cassandra-sink-1.7.3,/debezium/connect

    volumes:
      - ./plugins/kafka-connect-cassandra-sink-1.7.3:/kafka/connect/kafka-connect-cassandra-sink-1.7.3

Connector Configuration

{
  "name": "cap-stock-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "dbz",
    "database.password": "dbz",
    "database.dbname": "cap_stock_db",
    "topic.prefix": "cap_stock",
    "slot.name": "cap_stock_slot",
    "publication.autocreate.mode": "filtered",
    "table.include.list": "public.kline_data, public.order_book, public.recent_trades, public.top_24hr",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changes.cap_stock_db"
  }
}

4. Apache Kafka: The Event Streaming Backbone

Kafka acts as the central event hub, decoupling data producers (PostgreSQL via Debezium) from consumers (Cassandra Sink Connector).

Kafka Setup

  zookeeper:
    image: quay.io/debezium/zookeeper:3.1
    container_name: zookeeper
    ports: ["2181:2181"]

  kafka:
    image: quay.io/debezium/kafka:3.1
    container_name: kafka
    depends_on: [zookeeper]
    ports: ["29092:29092"]
    environment:
      KAFKA_BROKER_ID: 1
      ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENERS: INTERNAL://0.0.0.0:9092,EXTERNAL://0.0.0.0:29092
      KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka:9092,EXTERNAL://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  akhq:
    image: tchiotludo/akhq:latest
    ports: [8080:8080]
    environment:
      AKHQ_CONFIGURATION: |
        akhq:
          connections:
            local:
              properties:
                bootstrap.servers: "kafka:9092"
              connect:
                - name: "connect"
                  url: "http://connect:8083"
    depends_on: [kafka, connect]

Four topics are created:

Topic	Partitions	Retention	Purpose
`cap_stock.public.kline_data`	5	7 days	Candlestick events
`cap_stock.public.recent_trades`	5	7 days	Trade events
`cap_stock.public.order_book`	5	7 days	Order book snapshots
`cap_stock.public.top_24hr`	5	7 days	24hr statistics

To verify, navigate to localhost:8080. Here you'll the kafka ui running with all the info you might need

5. Cassandra Database: The Analytical Store

Cassandra is the analytics powerhouse for our dashboard, purpose-built for time-series analytics with massive write throughput and fast range queries.
But before you do anything, don't forget to set up the database first, lest you want to spend hours debugging, why your queries work on the terminal but nowhere else

Cassandra Initialization

def setup_cassandra() -> None:

    logging.basicConfig(
        level=logging.INFO,
        format= '%(asctime)s|%(levelname)s|%(name)s|%(message)s',
        handlers = [
            logging.FileHandler(f"cassandra_setup{datetime.now().strftime('%Y%m%d')}.log"),
            logging.StreamHandler()
        ]
    )

    load_dotenv()

    logging.info('Creating the connection ...')
    # connecting to cassandra 
    cluster = Cluster(['cassandra'])
    session = cluster.connect()

    logging.info('Connected ...')
    logging.info('Creating the keyspace cap_stock ...')
    # create the key space 
    create_keyspace_query = """
    CREATE KEYSPACE IF NOT EXISTS cap_stock WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1 };
    """
    session.execute(create_keyspace_query)

    use_keyspace_query= """
    USE cap_stock;
    """
    session.execute(use_keyspace_query)

    logging.info('using the created keyspace ...')

    logging.info('Creating the tables')

    create_kline_query = """
    CREATE TABLE IF NOT EXISTS cap_stock.kline_data (
    symbol              text,
    k_open_time         timestamp,
    k_close_time        timestamp,
    open                double,
    high                double,
    low                 double,
    close               double,
    volume              double,
    quote_asset_volume  double,
    number_of_trades    int,
    tb_base_volume      double,
    tb_quote_volume     double,
    ranking             int,
    time_collected      timestamp,
    PRIMARY KEY ((symbol), k_open_time)
    ) WITH CLUSTERING ORDER BY (k_open_time DESC)
    AND compaction = {
        'class': 'TimeWindowCompactionStrategy',
        'compaction_window_unit': 'DAYS',
        'compaction_window_size': '1'
    };

    """
    create_OB_query =   """
    CREATE TABLE IF NOT EXISTS cap_stock.order_book (
    symbol          text,
    side            text,       -- 'bids' or 'asks'
    ranking         int,        -- rank of the crypto in top performers
    price           double,
    quantity        double,
    time_collected  timestamp,  -- when this snapshot was taken
    PRIMARY KEY ((symbol, side, time_collected), price)
    ) WITH CLUSTERING ORDER BY (price DESC);

    """

    create_recent_trades_query = """
    CREATE TABLE IF NOT EXISTS cap_stock.recent_trades(
    symbol          text,
    trade_time      timestamp,  
    trade_id        bigint,      
    price           double,
    qty             double,
    quote_qty       double,
    is_buyer_maker  boolean,
    is_best_match   boolean,
    ranking         int,
    time_collected  timestamp,  
    PRIMARY KEY ((symbol), trade_time, trade_id)
    ) WITH CLUSTERING ORDER BY (trade_time DESC, trade_id DESC)
    AND compaction = {
        'class': 'TimeWindowCompactionStrategy',
        'compaction_window_unit': 'HOURS',
        'compaction_window_size': '1'
    };
    """

    create_top_24hrs = """
    CREATE TABLE IF NOT EXISTS cap_stock.top_24hr (
    symbol                 text,
    time_collected         timestamp,    

    price_change           double,
    price_change_percent   double,
    weighted_avg_price     double,
    prev_close_price       double,
    last_price             double,
    last_qty               double,
    bid_price              double,
    bid_qty                double,
    ask_price              double,
    ask_qty                double,
    open_price             double,
    high_price             double,
    low_price              double,
    volume                 double,
    quote_volume           double,

    open_time              timestamp,    
    close_time             timestamp,    
    first_id               bigint,
    last_id                bigint,

    PRIMARY KEY ((symbol), time_collected)
    ) WITH CLUSTERING ORDER BY (time_collected DESC)
    AND compaction = {
        'class': 'TimeWindowCompactionStrategy',
        'compaction_window_unit': 'HOURS',
        'compaction_window_size': '1'
    };
    """
    session.execute(create_top_24hrs)
    session.execute(create_recent_trades_query)
    session.execute(create_OB_query)
    session.execute(create_kline_query)

6. Kafka Connect (The same DBZ connector repurposed)

Kafka Connect is the middleware that runs connectors to move data between systems.

Cassandra Sink Connector Configuration

// register_cassandra.json
{
  "name": "cassandra_sink",
  "config": {
    "connector.class": "com.datastax.oss.kafka.sink.CassandraSinkConnector",
    "tasks.max": "1",

    "topics": "cap_stock.public.kline_data, cap_stock.public.recent_trades ,cap_stock.public.top_24hr, cap_stock.public.order_book ",

    "contactPoints": "cassandra",
    "loadBalancing.localDc": "datacenter1",


    "topic.cap_stock.public.kline_data.cap_stock.kline_data.mapping": "symbol=value.symbol,k_open_time=value.k_open_time,k_close_time=value.k_close_time,open=value.open,high=value.high,low=value.low,close=value.close,volume=value.volume,quote_asset_volume=value.quote_asset_volume,number_of_trades=value.number_of_trades,tb_base_volume=value.tb_base_volume,tb_quote_volume=value.tb_quote_volume,ranking=value.ranking,time_collected=value.time_collected",

    "topic.cap_stock.public.recent_trades.cap_stock.recent_trades.mapping":"symbol=value.symbol,trade_time=value.time,trade_id=value.id,price=value.price,qty=value.qty,quote_qty=value.quote_qty,is_buyer_maker=value.is_buyer_maker,is_best_match=value.is_best_match,ranking=value.ranking,time_collected=value.time_collected",

    "topic.cap_stock.public.top_24hr.cap_stock.top_24hr.mapping":"symbol=value.symbol, time_collected=value.time_collected,price_change=value.price_change,price_change_percent=value.price_change_percent,weighted_avg_price=value.weighted_avg_price,prev_close_price=value.prev_close_price,last_price=value.last_price,last_qty=value.last_qty,bid_price=value.bid_price,bid_qty=value.bid_qty,ask_price=value.ask_price,ask_qty=value.ask_qty,open_price=value.open_price,high_price=value.high_price,low_price=value.low_price,volume=value.volume,quote_volume=value.quote_volume,open_time=value.open_time,close_time=value.close_time,first_id=value.first_id,last_id=value.last_id",

    "topic.cap_stock.public.order_book.cap_stock.order_book.mapping": "symbol=value.symbol, side=value.side, ranking=value.ranking, price=value.price, quantity=value.quantity, time_collected=value.time_collected",


    "key.converter":"org.apache.kafka.connect.storage.StringConverter",
    "transforms":"unwrap",
    "transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState"
  }
}

Connector Registration

This is a bash script that basically initiates the connections using the configuration files to ensure that Debezium is listening for changes to record to Kafka, and Kafka Connect is listening for changes to record to Cassandra

Verify the status of the connection at localhost:8080

#!/bin/bash

# Register Postgres and Cassandra connectors to Kafka Connect

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @register-postgres.json

echo "✓ Postgres connector registered"

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @register_cassandra.json

echo "✓ Cassandra connector registered"

7. Grafana: Real-Time Visualisation

Grafana provides the user-facing dashboards that display market data in real-time.

Grafana Configuration

  grafana:
    image: grafana/grafana:main-ubuntu
    container_name: grafana
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin
    ports:
      - "3001:3000"
    volumes:
      - ./grafana_data:/var/lib/grafana
    depends_on:
      - postgres

Once you get Grafana running, create a new dashboard, navigate to settings, add a variable symbol, then add panels to create visualisations with the sample queries below
This will allow you to switch between different crypto symbols like so

Sample Dashboard Queries

--- Klines
SELECT symbol,
  k_open_time AS time,
  open,
  high,
  low,
  close,
  volume
FROM cap_stock.kline_data
WHERE symbol = '$symbol'
  AND k_open_time >= ${__from}
  AND k_open_time <= ${__to}


--- Winners at the moment 
SELECT  symbol,
       price_change_percent,
       volume
FROM cap_stock.top_24hr 
WHERE time_collected >= ${__from}
  AND time_collected <= ${__to}
ALLOW FILTERING;

Results

Conclusion: A Data Engineering Success

The Binance Top Crypto Pipeline successfully implements a sophisticated, resilient, and performant data architecture. By strategically combining a relational anchor (PostgreSQL), an immutable log (Kafka), and a distributed analytical store (Cassandra) using Change Data Capture (Debezium).

Understanding Kafka Consumer Lag: Causes, Risks, and How to Fix It

Eric Kahindi — Mon, 10 Nov 2025 18:38:54 +0000

Apache Kafka has become one of the most widely adopted distributed streaming platforms in modern event-driven architectures. At its core, Kafka acts as a highly durable, scalable message queuing system that supports real-time data pipelines, event streaming, and system decoupling. It enables producers to continuously publish messages into topics, while consumers read those messages at their own pace.

Despite its distributed efficiency and fault tolerance, Kafka does not come without challenges. One of the most common performance bottlenecks encountered in Kafka systems is consumer lag.

What is Consumer Lag?

Consumer lag occurs when the consumer is reading messages slower than the producer is writing them.

In Kafka, each message is stored in a partition and assigned an offset (a sequential ID).
This means at any time:

Latest produced offset - The most recent message written to a partition by the producer

Latest Consumer offset - The latest message read and committed by a consumer

Consumer Lag = Latest Produced Offset − Latest Consumer Offset

If left unresolved, lag accumulates, delaying downstream processing, analytics, notifications, and system reactions.

This could be detrimental, especially in safety-critical systems that rely on reliable, accurate, and timely messaging.

Imagine sitting in a self-driving car and turning left, getting directions to turn left after you pass the turn.

Why Consumer Lag Happens and how to stop it.

Sudden Traffic Spikes

An abrupt increase in message production, such as from a viral event or sensor data surge, can overwhelm consumers if the system isn't scaled well. For instance, IoT applications might experience this during peak hours.
Symptom - Rapid rise in log-end offsets (uncommitted offsets)
Mitigation - Auto-scale consumers or use elastic resources like Kubernetes autoscaling.

Partition Imbalances and Skew

Usually, having more consumers is a good thing because of parallelism, but this is only if this is followed through by having multiple partitions. Without proper partitioning, this ironically becomes a problem as it increases the overhead of the Kafka broker when passing on messages to consumers
Mitigation - This is simply considering how many partitions you apply when having multiple consumers

Slow Consumer Processing

Inefficient code within consumers, such as waiting for external APIs, complex transformations, or bugs causing retries, slows down message handling. If processing logic isn't optimized (e.g., breaking tasks into unnecessary steps), idle time accumulates.
Symptom - Prolonged processing times per message
Mitigation - write better code :), implement asynchronous processing into the code, allowing multiple messages to be processed at the same time

Resource Constraints

Insufficient CPU, memory, or network bandwidth on consumer hosts can bottleneck performance. This doesn't only apply to local machines; Containerized environments with misconfigured limits exacerbate this.
Symptom - High system utilization metrics
Mitigation - Increase allocations; monitor with tools like Prometheus for CPU/memory usage, then scale up resources accordingly if need be

Configuration Issues

Suboptimal settings, such as small fetch sizes or improper offset management (e.g., auto-commit enabled without careful tuning), can lead to frequent but small polls, reducing throughput.
Symptoms - Frequent small fetches, commit failures
Mitigation

Consumer Side: Increase fetch.max.bytes and max.partition.fetch.bytes to fetch larger batches, reducing poll frequency. Adjust fetch.max.wait.ms to wait longer for data if needed. Use manual offset commits for better control.
Producer Side: Employ balanced partitioners like RoundRobinPartitioner to distribute messages evenly. Reduce batch.size to avoid overwhelming consumers with large bursts.
Broker Side: Tune num.network.threads and num.io.threads for better request handling. Set group.initial.rebalance.delay.ms to minimize unnecessary rebalances.

Partition Rebalancing

When consumers join or leave a group, Kafka reassigns partitions, temporarily halting processing and spiking lag. Frequent rebalances due to unstable consumers amplify this.
Mitigation - Use sticky assignors; stabilize consumer instances to reduce churn.

Conclusion

Kafka enables real-time streaming at massive scale — but real-time only holds true if consumers can keep up. Consumer lag is not a system failure; it is a signal that the pipeline has hit a scaling or processing bottleneck.

By monitoring offsets, scaling consumers, optimizing workload, and tuning Kafka configurations, you can transform lagging systems into high-performance streaming pipelines with predictable throughput.

Containerization for Data Engineering: A Practical Guide with Docker and Docker Compose

Eric Kahindi — Mon, 13 Oct 2025 08:52:51 +0000

Introduction

Docker is a very useful tool, not just for data engineers but for all developers alike. But before we get to it, let's try to understand the concept it was built on

What is Containerisation?

This is essentially packaging a piece of software alongside all its dependencies, including environment variables, and even an operating system, and running it as its own isolated instance, separate from your machine.
It's decoupling to the max, ensuring maximum portability for your application

Why containerise in data engineering

Simplified Dependancy Management - A data workflow usually depends on various Python libraries, JVM versions, and certain tools' versions. A container can encapsulate all this into a single image, eliminating the need to set up all these components manually.
Scalability for Data pipelines - Containerisation allows data pipelines to be scaled up dynamically when used with tools like Kubernetes. This can be done individually for each stage of the data pipeline or for the whole thing
Portability and easy integration with Cloud - Containers are running instances of a data pipeline completely decoupled from the machine running them, which means that they can run on any machine without that annoying, "It works on my machine", phrase. This means that it can easily be deployed on any cloud environment as well

Getting Started with Docker

OK, so here's a basic rundown of everything you need to know before we get our hands dirty.

Docker is one of the software that's used to perform containerisation.
Docker Engine is the part of Docker that lives on your machine; it accesses your file system and applications and converts them into images. It also facilitates the running of these containers.
Docker Hub is an online registry of container images. It's where people can share the images they created for others to reuse, similar to GitHub
Volumes are a way to ensure that the storage of a running container is persisted beyond the life span of the container. It's like mapping a directory on your local machine to one on the container, such that changes made in that directory happen on both sides. Docker simply picks up where it left off when the container is restarted
A Docker File - is a simple text file we use to give docker instructions on how to build the image for the container.
An Image is a snapshot of your application and its dependencies that can be called to create a Docker container, similar to an operating system image or a blueprint for a container
Caches Docker images are built layer after layer; the changes between layers and the original/base layers are stored locally so that they can be reused again when rebuilding a container. This means that the initial build might take a long time, but subsequent builds are lightning fast.
A Container is a running instance of the image. Docker uses an image to build and run a container.

Remember thats

We only have to worry about creating the Docker file, then we build the image and run the container using Docker commands

To install Docker, you can follow this article on their official website, or you can follow this Digital Ocean article.

Running a Data Processing Script

To understand the rest of Docker, we'll take a practical approach and learn as we go.
The goal is to create a simple data processing script that loads data into a pandas data frame, drops a column, and loads the result to a new, cleaned CSV.
We'll also use docker volumes to persist the data

Create a home folder for Docker and then move into it

mkdir docker
cd docker

In here, you can clone a repository I made on GitHub for this purpose

git clone https://github.com/kazeric/docker_for_dataengineering

This should install a folder with the following files
app.py file contains the code for the app you want to containerise. I

cars.csv file the data to be transformed

Docker file contains the instructions for Docker to create the image

Here is a quick rundown of what it's doing
FROM python:3.10 - this pulls a base Docker image from Docker Hub. You can search "python:3.10 docker" and you'll find it in Docker Hub. Anything after this line modifies the base image, the specifications we need to run our application

COPY requirements.txt . - this line copies the file requirements.txt to "." which is the current working directory

RUN pip install -r requirements.txt - this line tells the container to run the pip command in the terminal

The rest are well explained by the comments

Note
The convention for Docker is that whenever there is a mapping, it's usually local to the container, eg, in ports 5432:5432, in commands like COPY . ., in volumes ./data : opt/var/grafana/data

Now that we have it set up, let's run the container
Build the image

docker build -t my_app .

Run the container

docker run my_app

Volumes

Now, let's explore the new CSV created in Docker volumes

Find the specific container built from the image

docker ps -a

Get the container ID, use it to copy the container data to the host machine

docker cp <container_id>:/app/data/cleaned_cars.csv ./data/cleaned_cars.csv

Your new CSV should look like this

Finally, clean up all the unused resources to free up space on your machine

docker system prune -a

Best Practices

Use small base images: Start from lightweight images like python:3.10-slim or alpine instead of full OS images to reduce build size and speed up deployment.
Clean up unused resources: Regularly remove unused containers, images, and networks with:

docker system prune -a

This keeps your environment tidy and reduces disk usage.

Secure your containers: Avoid running containers as root. Use a non-root user and apply least-privilege principles. Keep your images updated to patch vulnerabilities.

Conclusion

Docker is a powerful tool that transforms how data engineers build, test, and deploy data pipelines — enabling consistent, reproducible environments across teams.

By containerizing ETL jobs, analytics tools, or machine learning models, teams save time and avoid “it works on my machine” issues.

For the next steps, you can check out Docker Compose, a simple yet powerful addition that makes running multi-container applications easy

A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark

Eric Kahindi — Mon, 29 Sep 2025 21:51:47 +0000

What is Apache Spark?

We all know about pandas data frames and how they make handling data so quick and easy. You can transform your data (drop columns, change data types, filter nulls) all in just a few lines of code.

But have you ever wondered what’s happening under the hood?
Where is this data actually stored while you’re manipulating it?
It’s definitely not in your database.
And once your Python script ends, where does all of it go?

The answer is simple: main memory (RAM).

All those transformations you run in pandas are only possible because your dataset is small enough to fit in your RAM.

Now here’s the problem: what happens when you’re working on massive projects, like training your own general-purpose LLM or crunching billions of rows of data?

The reality is—you can’t (at least not efficiently) with pandas. Sure, you can try streaming data or working in batches (I personally tried both for Lingua Connect), but it quickly becomes complex for no real reason.

And that’s where Apache Spark comes in.

Enter Spark

Apache Spark is the hero you call when your data is so massive that your machine can’t handle it all at once.

At its core, Apache Spark is an open-source, unified analytics engine designed for large-scale data processing.

In short: Just add more machines!

Here’s what I mean:

Instead of relying on one machine’s memory, Spark distributes your dataset across multiple machines (nodes). Each node processes a chunk of the data in parallel, and Spark combines the results. To you, it feels like working on one logical machine—but behind the scenes, it’s a cluster doing the heavy lifting.

The diagram bellow summarises its architecture in a nut shell

The Dimensions of Apache Spark

Spark isn’t just about running queries faster. It’s a whole ecosystem.
So after this article, you can decide which dimensions of Spark to follow, but here are the main ones.

Spark Core – The foundation that handles memory management, job scheduling, and distributed task execution.
Spark SQL – For working with structured data (tables, DataFrames, SQL queries).
Spark Streaming – For real-time data processing (think logs, IoT data, live feeds).
MLlib – Spark’s built-in machine learning library for scalable ML models.
GraphX – For graph computations (social networks, recommendations, relationships).

Now let's look into a Spark Core and Spark sql through the simple path pyspark

Apache Spark vs PySpark

Now you might be wondering—what’s the difference between Apache Spark and PySpark?

Put simply:

Apache Spark is the engine. Think of it like the car’s engine—the part that actually does the work and propels the car forward.
PySpark is the Python API for Spark. It’s the steering wheel you use to control the engine.

So while Spark itself is written in Scala and Java, PySpark gives Python developers the ability to harness Spark’s distributed power without leaving the comfort of Python.

Set up

I highly recommend using an environment variable while working on this. You can then set up the venv and activate it for PySpark at the base of your working directory.

We'll start by installing the latest version of Spark (This might take a while)

wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

We then unzip the Spark file

tar -xzf spark-4.0.1-bin-hadoop3.tgz

We then rename the Spark folder to make it easier to use and then move into it (the Spark home)

mv spark-4.0.1-bin-hadoop3/ spark/
cd spark

Run this command to ensure that your park version is downloaded

./bin/spark-submit --version

You should get an output like this.

Finally, let's install PySpark

pip install pyspark

Take Pyspark for a spin

Create a new folder at the base of your working directory named files. This is where your code and data go

touch spark.ipynb

Open it up in a text editor and connect to the virtual environment in which you installed PySpark
Now, let's get started

from pyspark.sql import SparkSession

Let's start the Spark session.

spark = SparkSession.builder.appName('demo').getOrCreate()

This is like starting up everything, ie, the entire architecture described above, like twisting the key to start the engine.

Create a dataframe
An interesting thing to note, unlike in pandas, dataframes in Spark are lists of tuples

data=[("Eric", 25), ("Jane", 29), ("Sam", 35)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()

Here are some useful commands

#identify the schema
df.printSchema()
#list out the columns 
df.columns
#count the number of rows 
df.count()
#creating data from CSV
df = spark.read.csv("demo.csv", header=True, inferSchema=True)

Spark SQL

Now we can interact with our data using SQL, which is really cool
First, we create a view from of dataframe (synonymous with database views)

df.createOrReplaceTempView("Demo")

That's it! Now you can write your queries here
Note that we're accessing the view from the Spark session we created above.

spark.sql("SELECT * FROM Demo"),show(5)

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Eric Kahindi — Sun, 21 Sep 2025 17:28:52 +0000

Welcome to the world of data streaming. The world of Kafka. Leave all of your previous data storage knowledge and prepare to "subscribe" to
a whole new line of thought

What is Kafka

Apache Kafka was founded by LinkedIn and was later adopted by Apache under the open source license. This means you can take the Kafka and modify it however you like to suit your needs.

It is an open-source data streaming platform that uses the publish and subscribe model to decouple applications and reduce dependency on them.
It does this by keeping logs of events between microservices in an application

Scenario

Imagine you are the owner of a certain business with the structure shown below

Everything might seem ok at first, but what happens when one node in the system fails?
Let's take, for instance, the payment microservice fails. This situation might leave your clients waiting on a loading screen once the order is placed for a payment fulfillment that will never come.
Worse still, this order might be logged onto the analytics section, corrupting your data

This is where Kafka shines
It places itself between these microservices to act as a middleman between these services, which,

Eliminates the single points of failure
Improves recoverability

Kafka core concepts

There are a few concepts you may need to wrap your head around when dealing with Kafka
Publish-subscribe model - a messaging architectural pattern where applications that generate data (publishers) send messages to an intermediary called a message broker without knowing who will receive them.

This leads to the decoupling advantage above
It also brings about interoperability as systems only need to talk to Kafka instead of creating custom integrations for each system

Event-first architecture - This way of thinking shifts the focus from requesting data or calling services to reacting to facts and business events as they occur.

Kafka Architectural elements

Ok, now that that's out of the way, let's peer into the inner workings of Kafka by briefly describing its constitution.

Record - This, also known as an event, is the actual message that is produced by the publishing microservice
Producer - This is the publishing microservice that creates and sends the message to Kafka
Consumer - This is the subscribing microservice that listens for and receives messages that come from Kafka
Topic - This is the Kafka equivalent of a database table. It is an immutable log of events that a consumer and producer subscribe to and publish to. You can have an orders topic for the orders microservice
Broker - This is the actual server on which Kafka runs
Kafka cluster - A single Kafka instance can run on multiple servers (nodes/brokers). These servers make up the Kafka cluster
Partitioning - This is the logical subdivision of a topic into various nodes in the Kadka cluster
Replication - Partitions can be copied across multiple nodes to create replicas that can act as backups in case one node fails
Zookeeper - Partitioning and replication can be tricky business, especially when it comes to issues of data consistency. Zookeeper is an external resource that solves this problem, handling the coordination and synchronization of Kafka.
Kraft - This is a newer internal part of Kafka that came out on Kafka 3.3, which handles the Zookeeper functions, eliminating Kafka's dependence on it.
Consumer Group - Suppose your system leverages replication of microservices to improve scalability and accessibility. Production may straight forward as each producer microservice replica just sends a different record to a topic. But then, if producer replicas subscribe to that topic, then they each receive all the messages in sequential order. This is a problem Consumer groups help solve this issue, here's how. Multiple consumer instances that belong to the same logical application or service are configured with the same group ID. This group ID identifies them as part of the same consumer group. This allows Kafka the ability to coordinate the consumption of messages such that they are processed in parallel by separate consumer microservice replicas.

Set up

Awesome, now that we're all caught up, let's get hands-on by installing and running an instance of Kafka in our terminal

Note
Kafka on the console should not be used in a production environment unless it is absolutely necessary to. AVOID in production at all costs

But since we're running a simple single-partition Kafka instance on our pc, it should be fine. Follow the steps below.

Install java

sudo apt install default-jre

Confirm Java installation

java --version

Make sure to use Java 11 or 17

The commands below download Kafka, unzip it, then rename the kafka folder (to a more readable name, Kafk, which will act as Kafka's home directory). Finally, move into the Kafka directory

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
tar -xzvf kafka_2.13-3.3.1.tgz 
mv kafka_2.13-3.3.1/ kafka/
cd kafka/

Start up zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

Start up Kafka itself (ie, Kafka server)

bin/kafka-server-start.sh config/server.properties

Now, let's create a test topic in Kafka

bin/kafka-topics.sh --create   --topic test-topic   --bootstrap-server localhost:9092   --partitions 1   --replication-factor 1

Create a producer

bin/kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092

This will open up an interactive console where you can type in messages on your created test topic

Create a consumer in a different terminal

bin/kafka-console-consumer.sh --topic test-topic --from-beginning --bootstrap-server localhost:9092

Now, when you type anything on the producer's interactive terminal, it will show up on the producer's side. You are now streaming data with Kafka!

Use cases

Kafka console - (special case)
Consumers and producers are usually applications or microservices; however, the Kafka console allows the developer to be the producer.

This is not ideal in production because any wrong or misspelled record made by a developer or producer becomes an irreversible record in that topic.
Furthermore, if multiple microservices subscribe to this topic, it may trigger an unwanted chain of events that may be catastrophic

Nevertheless, it's still a useful feature in some scenarios

Testing
If you need to confirm whether or not the Kafka service that you want is working as expected, just as we did in the setup section.
This means,

You have deployed a new cluster and want to try it out
You are debugging an issue with an existing cluster.

Backfilling data
Suppose your orders microservice crashed before it shut down, and a few orders were placed before they could be pushed to the Topic. Not to worry, you have a backup saved in a CSV file in case of failure, you can simply run

kafka-console-producer \
    --topic example-topic \
    --bootstrap-server localhost:9092 \
    < your_prepared_backorders.csv

Provided the schema aligns with the topic schema

Real-world use cases

Netflix
When you think about Netflix, you think about instant access to your favorite movie or series. But what happens behind the scenes when you hit that “Play” button?
Netflix uses Kafka to handle real-time monitoring and event processing across its entire platform.

Every time you play, pause, fast-forward, or even hover over a title, an event is generated.

Kafka acts as the middleman, transporting billions of these events per day into different services:

Recommendations engine – to suggest what you should watch next
Quality of Service (QoS) monitoring – to make sure the video resolution adjusts smoothly to your network
Operational alerts – so engineers can act if something breaks in the delivery pipeline

Without Kafka, the massive amount of real-time events would overwhelm individual services. By centralizing these events, Netflix achieves scalability, resilience, and real-time personalization.

Uber
Uber is not just a ride-hailing app. It’s a real-time logistics platform moving people, food, and even packages around cities worldwide.

Here’s how Kafka fits in:

Every trip generates a constant stream of GPS events from both driver and rider apps.

Kafka ingests and streams these events in real-time to different services:

Matching service – to connect you with the nearest driver
ETA calculation – to update arrival times dynamically as traffic changes
Surge pricing – to adjust fares instantly during high demand
Fraud detection – to flag suspicious activity as it happens

Kafka enables Uber to handle millions of concurrent, low-latency events across geographies, ensuring rides, deliveries, and payments work seamlessly without bottlenecks.

Production Best Practices

Running Kafka on your laptop is fun for demos, but in production the story is very different. Major companies have learned (sometimes the hard way) that to keep Kafka reliable at scale, you need to follow certain best practice

Partitioning and replication
In production, Companies run clusters of multiple brokers spread across different machines or even data centers.

Topics are partitioned for horizontal scalability and replicated (usually with a replication factor of 3) for fault tolerance.

This way, even if one broker crashes, the cluster can continue without data loss.

Retention Policy and capacity
Topics can grow endlessly in production, which is bad when your machine has finite memory. Therefore, it is good to have a sort of garbage collection device ready
In production teams, set retention policies (e.g., 7 days or 30 days) to automatically delete old messages.

They tune log compaction to retain only the latest value for each key when needed.

Storage, disk throughput, and network bandwidth are carefully planned before scaling up.

References

Kafka Tutorial for Beginners | Everything you need to get started, TechWorld with Nana
Kafka Tutorial—Multi-chapter guide, RedPanda
Featuring Apache Kafka in the Netflix Studio and Finance World, Confluent
Kafka retention—Types, challenges, alternatives, Red Panda

Creating your own data lake (MINIO+TRINO+GRAFANA)

Eric Kahindi — Fri, 19 Sep 2025 11:13:26 +0000

Welcome, the full article might be a bit long so Ill break it into 3 parts
For this one, we'll focus on miniO (the backbone of your datalake)

Getting Started with MinIO: Your Private S3-Compatible Data Lake

MinIO is an open-source, high-performance object storage system that is fully compatible with the Amazon S3 API. Instead of storing data as files in directories, MinIO stores them as objects inside buckets. This makes it incredibly flexible and scalable, especially when working with large datasets.

MinIO can be the storage layer of a data lake.
On top of it, you connect tools like:

Apache Spark / Dask for big data processing
Presto/Trino / Athena for SQL queries
Grafana / Superset for visualization
Airflow for orchestration

The setup

Install it

wget https://dl.min.io/server/minio/release/linux-amd64/minio

Make it executable

chmod +x minio

Move the executable so that it can be run from anywhere

sudo mv minio /usr/local/bin

Export the users and passwords (start from here to restart your server once installed)

export MINIO_ROOT_USER=<your-user-name>

export MINIO_ROOT_PASSWORD=<your-password>

Start the server

minio server ~/minio-data --console-address ":9001"

One thing to note is, if you want to connect to minio, it exposes an api at "localhost:9000" that's why we use ":9001" for the server
Otherwise go to "localhost:9001" for your web ui

Then log in to your server to get a browser console like this

Add data to the server

In this case, we can create an Airflow DAG that will load the data for us.
I'm using Coingeko to get data on a few coins(get your api key first, add it to an env, then load it as I did)
Then, transforming the data using pandas to put it in the right format
Finally saving it into MiniO in parquet files

If you're not familiar with airflow, you can just create a separate file and run the "coin_vis_etl" function
Notice I'm using port 9000, not 9001

from airflow import DAG
from airflow.operators.python import PythonOperator 
import requests
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import requests 
import pandas as pd
import os
from dotenv import load_dotenv
from datetime import datetime


def coin_vis_etl():
    load_dotenv


    MY_API_KEY = os.getenv("MY_API_KEY")
    crypto_list = ['bitcoin', 'solana', 'ethereum', 'hyperliquid', 'binancecoin']

    for crypto in crypto_list:
        #load the data from the api 
        url = f"https://api.coingecko.com/api/v3/coins/{crypto}/market_chart?vs_currency=usd&days=30"
        headers = {
            'accepts':'application/json',
            'x-cg-demo-api-key': MY_API_KEY    
            }
        response = requests.get(url, headers=headers)

        #define custom metatdata 
        # use encode if youre adding a formated string instead of a b 
        custom_metadata ={
            b"source": b"coingecko API",
            b"coin_name": f"{crypto}".encode()
        }

        if response.status_code == 200:
            data = response.json()
            temp = pd.DataFrame(data['prices'], columns=[f"{crypto}_timestamps", f"{crypto}_prices"])

            # we can just add the columns since the time stamps are the same
            temp[f"{crypto}_market_caps"]= [x[1] for x in data["market_caps"]]
            temp[f"{crypto}_total_volumes"] = [x[1] for x in data["total_volumes"]]

            # change timestamps to the real ones 
            temp[f"{crypto}_timestamps"] = pd.to_datetime(temp[f"{crypto}_timestamps"], unit='ms')

            # use pyarrow to change them parquets 
            table = pa.Table.from_pandas(temp)

            # adding the metadata the metadata 
            existing_metadata = table.schema.metadata or {}
            new_metadata = {**custom_metadata, **existing_metadata}
            table = table.replace_schema_metadata(new_metadata)

            # finally write the data to the minio using pyarrow
            fs = pa.fs.S3FileSystem(
                access_key="eric",
                secret_key='eric1234',
                endpoint_override ="http://localhost:9000"
            )

            pq.write_table(
                table, 
                f"coin-vis-automated/coin_vis_{crypto}.parquet",
                filesystem=fs
            )
            print(f'{crypto} is done')

        else:
            print(f"API error at {crypto}, {response.content}")


with DAG(
    dag_id = 'coin_vis_etl',
    start_date = datetime(2025, 9, 15),
    schedule_interval= '@hourly',
    catchup= False

) as dag:
    runetl=PythonOperator(
        task_id = 'coin_vis_etl',
        python_callable = coin_vis_etl
    )

    runetl

And there you go, you've just created your own S3-compatible object storage

Why I’m Switching to Parquet for Data Storage

Eric Kahindi — Mon, 15 Sep 2025 05:49:08 +0000

The first time I came across Parquet files was during my fourth-year project. I kept seeing Hugging Face recommend them whenever I uploaded a custom dataset, and I wondered: why are they so obsessed with this file format?

Fast forward to today, as I dive deeper into object storage and data lakes, Parquet shows up everywhere again. After some research and hands-on work, I finally get it: this format is not just hype. It’s genuinely better for large-scale data.

First, let's get the benefits out of the way

They’re simple – easy to read/write with common libraries.
They’re fast – optimized for data retreival queries.
They’re compact – storing the same data in less storage.
They’re schema-aware – built-in metadata and structure make them perfect for data lakes.

When I saw all this, my thoughts were, This can't be right, right?

So let's unpack the benefits

Simple to Use

Converting a CSV to Parquet takes just a few lines of Python:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Example dataframe
df = pd.DataFrame({
    "timestamp": ["2025-09-10", "2025-09-11"],
    "symbol": ["BTC", "BTC"],
    "close": [57800, 58800]
})

table = pa.Table.from_pandas(df)
pq.write_table(table, "crypto.parquet")

That’s it — you now have a Parquet file.

Faster Queries

Unlike CSVs, Parquet is a columnar storage format. Instead of organizing data row by row, it stores values by column, just like a pandas dataframe.
Think of it like this: CSV is like a text document; Parquet is like a database table optimized for analytics.

Need just one column? Parquet can read only that column instead of scanning the whole file.
Query engines (Spark, DuckDB, etc.) can skip irrelevant chunks entirely. This makes querying large datasets significantly faster.

More Compact

So they store more data in less space than if the data were in a CSV, and here's how they achieve it
Parquet is highly compressed by design. A few tricks it uses:

Integers - binary encoding (fewer bytes than text).
Strings - dictionary encoding (e.g., "BTC" stored once, then referenced by index).
Repeated values - run-length encoding (RLE).

Schema and Metadata

This is where data lakes come in
Parquet files are schema-aware. This means that data is stored in a relevant schema, which is basically the blueprint of the data inside the Parquet file
It describes:

Column names
Data types (int, float, string, timestamp, boolean, etc.)
Nullable or not
Nested structures (if any) You can even define your own schema:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
    ("timestamp", pa.timestamp("s")),
    ("symbol", pa.string()),
    ("price", pa.float64())
])

data = [
    ["2025-09-10 00:00:00", "BTC", 57800.0],
    ["2025-09-11 00:00:00", "ETH", 1800.5],
]

table = pa.Table.from_arrays(list(zip(*data)), schema=schema)
pq.write_table(table, "crypto_with_schema.parquet")

Parquet also supports rich metadata, which is essential in data lakes. Without metadata, a data lake quickly turns into a “data swamp.”
Every Parquet file already stores basic metadata:

Row count
Column count
Data types
Compression info
Column statistics (min/max values, null counts)

But you can also add custom metadata:

pq.write_table(
    table,
    "btc_prices.parquet",
    metadata={
        b"source": b"CoinGecko",
        b"pipeline": b"airflow-crypto-etl",
        b"tokens": b"BTC,ETH,SOL,HYPE,BNB"
    }
)

Querying in a Data Lake

Once your Parquet files are in object storage (e.g., S3), you can query them directly with modern engines:

SELECT timestamp, close
FROM 's3://crypto-data/*.parquet'
WHERE symbol = 'BTC' AND timestamp >= '2025-09-01'

This makes Parquet a natural fit for tools like Spark, DuckDB, and Presto.

Bottom Line

Parquet isn’t just another file format. It’s a compact, fast, and schema-aware way of storing data that plays perfectly with modern data lakes.
If you care about performance, storage efficiency, and long-term scalability, Parquet is a no-brainer.

Why you need to learn Apache Airflow - right now

Eric Kahindi — Mon, 08 Sep 2025 15:41:39 +0000

The standard

What it is and how we got here

Apache Airflow is an open-source workflow orchestration platform, used to author, schedule, and monitor complex data workflows in a reliable and scalable way.

Fun fact: Apache Airflow is a tool that was developed by Airbnb, yes, the one with the houses and apartments.

In a nutshell, Airflow allows you to define workflows as Directed Acyclic Graphs (DAGs) of tasks, written in Python. Each task might involve data extraction, transformation, loading (ETL), model training, reporting, or any other step in a data pipeline

Ever since its inception, it has only become more and more popular. It's become a staple for most companies and developers in data engineering alike. apartments
In my opinion, it's quite a bit of a learning curve, especially in the setup, but once you get past that, I felt the same, and here's why

Python-based – Workflows are defined in Python code, which means it's relatively easy to pick up, even for beginners
Flexibility – Because it’s Python-based, it integrates easily with existing systems and APIs.
Scalability – Suitable for startups and individual developers, to enterprises and large corporations.
Debugability – The UI and logs make debugging pipelines straightforward. It's really intuitive
Ecosystem Support – Many cloud providers (AWS MWAA, Google Cloud Composer, Astronomer) offer managed Airflow services.
Proven Track Record – Used by tech giants and enterprises for mission-critical pipelines.

A brief tour

Let's walk through how we set up and run a data pipeline. I'll explain more as we go along
The pipelines are defined in Directed Acyclic Graphs (DAGs)

So first, Apache has two sides: the web server UI and the Scheduler

Web server

The web server is the major GUI, which acts as command central. This is where we can do all sorts of things to the DAGs, like debugging and monitoring them.
Run

apache web-server

Once the server is started, you can view the GUI at the default port 8080. It should look something like this

Scheduler

This is the core of Apache Airflow. It decides when and what should be done

Think of the web server as the dashboard, a speedometer, and the scheduler would be that engine that causes the car to move, triggering the dashboard readings

It Parses the DAGs, Schedules Tasks, Manages Dependencies, Dispatches Work, Handles Catchup & Backfill

In a separate terminal, run

airflow scheduler

A simple Dag

Now, in the previous screenshots, the line "export AIRFLOW_HOME=$(pwd)/airflow" is a simple organisational step.
Aiflow automatically creates a folder airflow, but the line above tells Aiflow to create it in your current directory.

In a separate terminal, in this "aiflow" folder, create a folder named dags. Create a Python file to write your DAGs with the text editor of your choice.

Then, proceed to write your DAG.
This is a simple DAG for the Extraction step of the ETL process in Python.

Run your DAG

Now restart your scheduler and web server, and confirm that your DAG is present

The fetch load DAG is present, and you can click the play button to run your DAG
Click on your DAG name for closer inpsection or more options for that specific DAG

I find the Logs to be the most helpful part, especially when debugging
For instance, in the failed run below, I hadn't properly defined the schema before executing

Setting Up PostgreSQL on a Virtual Machine

Eric Kahindi — Sun, 03 Aug 2025 17:54:17 +0000

In this guide, we’ll walk through the steps to set up a PostgreSQL server on a virtual machine (VM). This is ideal for anyone learning backend development, data engineering or deploying small projects in a cloud environment.

Creating the Virtual Machine

Choose Your Cloud Provider

Before installing PostgreSQL, we need a VM running on a cloud provider.
While you can use any cloud provider of your choice (AWS, Google Cloud, DigitalOcean, etc.), we'll focus on Microsoft Azure for this tutorial.
Azure provides a generous free tier to test out their products, with a special offer to students who get about $100 in free credit for a whole year.

Create your account

Visit the Azure for Students page
Sign up using your student email address
Verify your student status
Complete the registration process

Create a Virtual Machine

Log into the Azure Portal
Click "Create a resource"
Search for "Virtual Machine" and select it
Choose your preferred Linux distribution (Ubuntu 20.04 LTS recommended)
Select an appropriate VM size (B1s or B2s for testing purposes)
Configure authentication (SSH public key recommended)
Configure networking settings
Review and create your VM

Connect and set up PostgreSQL

Once your VM is running, you can connect to it using SSH:

ssh username@your-vm-public-ip

Install PostgreSQL

Update your system packages first:

sudo apt update
sudo apt upgrade -y

Then install PostgreSQL with the following command

sudo apt install postgresql postgresql-contrib -y

This will install PostgreSQL along with the contrib package for additional functionality that allows us to customise our PostgreSQL:

Start PostgreSQL

First, we start the instance of PostgreSQL, then we enable the instance so that it starts automatically as the server boots, then we check the status to ensure that our changes have taken effect.

sudo systemctl start postgresql
sudo systemctl enable postgresql
sudo systemctl status postgresql

Log in to PostgreSQL instance

Switch to the default postgres user to use the PostgreSQL command-line interface:

sudo -i -u postgres

Launch the PostgreSQL interactive terminal:

psql

Navigate and Use PostgreSQL

Now that we're logged into psql, we can execute commands or sql queries
Here is a list of common ones:
List all available databases

\l

Create a new user:

CREATE USER lux WITH PASSWORD '****';

Give the user superuser privileges:

ALTER USER lux WITH SUPERUSER;

Exit the PostgreSQL shell:

\q

Exit back to your regular Linux user:

exit

Configure PostgreSQL for Remote Access

By default, PostgreSQL only accepts connections from localhost. Let’s change that.

Navigate to the PostgreSQL configuration directory:

cd /etc/postgresql/

Find your PostgreSQL version directory and navigate to it. Then locate the main configuration directory. Then open the postgresql.conf.

sudo nano postgresql.conf

Find the line:

listen_addresses = 'localhost'

Change it to:

listen_addresses = '*'

This allows any device to connect to the instance of PostgreSQL; however, you can list out the specific IPs you may want to use if you have a preference
Save and exit the file (Ctrl + X in nano).

Done

You now have a fully functioning PostgreSQL server running on a virtual machine.