Forem: Denzel Kanyeki

Building a Reddit Sentiment Pipeline using Python, PostgreSQL, VADER, Airflow, Grafana, Prometheus and StatsD

Denzel Kanyeki — Mon, 22 Sep 2025 09:39:34 +0000

Data is everywhere, but making sense of it requires collecting, cleaning, storing, and analyzing it effectively. In this blog, I’ll walk you through how I built a Reddit Sentiment Analysis pipeline that fetches Reddit posts, analyzes their sentiment, stores results in PostgreSQL, and visualizes insights in Grafana and orchestrated with Apache Airflow and containerized using Docker Compose.

If would want to check the repository out, here is the link.

Project Details

Extract Reddit posts from specific subreddits (using praw API).
Clean and analyze sentiment with VADER Sentiment Analyzer.
Load processed data into Postgres.
Orchestrate everything with Airflow DAGs.
Visualize insights on a Grafana Dashboard.
Get email notifications when DAGs fail or succeed via SMTP.
Monitoring metrics and service health using Prometheus, StatsD and Grafana.

The snapshot below shows the project architecture:

Project Setup

Step 1: Extract Reddit Data

Using the praw Python library, I pull the top 50 posts from subreddits like r/kenya and r/nairobi.

@task
def extract_data():
  reddit = praw.Reddit(
              client_id=CLIENT_ID,
              client_secret=SECRET_KEY,
              user_agent='data pipeline by u/user'
          )

          limit = 50
          data = []
          subs = ["kenya", "nairobi"]
          for sub in subs:
              subreddit = reddit.subreddit(sub)
              for s in subreddit.top(time_filter='day', limit=limit):
                  data.append({
                      'id': s.id,
                      'title': clean_text(s.title) if s.title else 'N/A',
                      'text': clean_text(s.selftext) if s.selftext else 'N/A',
                      'url': s.url,
                      'score': s.score,
                      'subreddit': sub,
                      'time_created': s.created_utc
                  })

          return data

Step 2: Transform & Analyze

Posts are cleaned (remove special characters, filter out N/A) and passed into VADER Sentiment Analyzer, which returns:

pos → Positive score
neu → Neutral score
neg → Negative score
compound → Overall sentiment

I then classify sentiment into Positive, Neutral, or Negative.

@task
    def analyze_text():
        """fetch posts from Postgres, analyze sentiment from posts, and save results back to a different psql table."""
        # Impoting these modules inside this task helps with run time
        from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
        from utils.extract import get_sentiment

        engine = create_engine(DB_URL)
        analyzer = SentimentIntensityAnalyzer()

        try:
            # Load from DB
            df = pd.read_sql_table("reddit_posts", con=engine, schema="reddit")

            # Clean text off the N/A tags
            df["text"] = df["text"].replace("N/A", np.nan)
            df.dropna(subset=["text"], inplace=True)

            # Apply sentiment
            scores = df["text"].apply(lambda x: analyzer.polarity_scores(x))
            df["compound"] = scores.apply(lambda x: x["compound"])
            df["neg"] = scores.apply(lambda x: x["neg"])
            df["neu"] = scores.apply(lambda x: x["neu"])
            df["pos"] = scores.apply(lambda x: x["pos"])
            df["sentiment"] = df["compound"].apply(get_sentiment)

            # Save results
            df.to_sql(
                "reddit_sentiment_analysis",
                con=engine,
                schema="reddit",
                if_exists="append",
                index=False,
            )
            return "Data loaded successfully!"
        except Exception as e:
            raise RuntimeError(f"Error analyzing or saving data: {e}")

Step 3: Load into PostgreSQL

The processed data is stored in a reddit_sentiment_analysis table, and the analysis data is stored in a reddit_posts table which separates concerns in a medallion architecture, that is a bronze layer for raw data without any transformations from PRAW that prevents any data loss, silver layer for semi-structured data and gold layer which serves as the "single source of truth" for decision-making and ready for dashboards in Grafana.

Step 4: Orchestration with Airflow

Two DAGs coordinate the workflow:

reddit_dag.py → Extract & store raw posts.
analyze_dag.py → Waits for the first DAG, runs sentiment analysis, and sends email notifications via Airflow's ExternalTaskSensor which pokes reddit_dag to check if it's complete then lets analyze_dag run.

Read more about ExternalTaskSensor in the documentation

Step 5: Visualize in Grafana

Grafana queries Postgres directly and visualizes metrics like sentiment trends, subreddit comparisons, and post volume. Also, visualizes Airflow metrics via Prometheus and Grafana

StatsD is integrated into this project as a metrics collector for Airflow. Airflow emits metrics such as dag_processing.processes and executor.open_slots, which capture scheduler and worker health, DAG processing times, and task execution details. These metrics are forwarded to monitoring backends and visualized in Grafana dashboards for observability.

Prometheus is used as the metrics storage and querying system. The prometheus service is configured in docker-compose.yml and prometheus.yml files scrapes metrics exposed by Airflow at the /metrics endpoint via a Prometheus exporter. Prometheus then makes these metrics queryable using PromQL and visualizable in Grafana.

To add Prometheus as a data source on Grafana and visualize metrics:

1. Head over to `localhost:3000` to access the Grafana service

Use admin & admin as your login details, and you will be prompted to change your password for security reasons

2. Adding Prometheus as a data source

On the left side, click on 'Add Data Source' and select Prometheus as the data source.

3. Import a custom dashboard

To use a custom dashboard, click on Import Dashboard option and paste the contents of dashboard.json.

Below is a snapshot of the metrics dashboard showing scheduler heartbeat rate, DAGs loaded, DAG processing duration, queued tasks and time taken in running tasks bt the executor.

Insights generated from this project

The dashboard below visualizes the data generated from this project based on sentiment scores. Here is the link to this dashboard.

66% of the data show positive sentiment, as 23% show negative sentiment and 7% show neutral sentiment.
r/Nairobi show more positive sentiment posts than r/Kenya, but with a higher amount of posts from users.
r/Nairobi has more posts has r/Kenya with 659 posts compared to r/Kenya's 616.
The highest sentiment was recorded on 15 September 2025 (0.483), and the lowest sentiment was recorded on 12 September 2025 (0.213).
Upvote score does not affect sentiment score, so lower upvote score doesn't show low sentiment and vice versa.

Conclusion

This project highlights how well-designed data pipelines can unlock valuable insights from unstructured data sources like Reddit, while following best practices in data engineering and analytics.

Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at denzelkinyua11@gmail.com or visit any of my social media platforms linked on my GitHub page.

Docker for Data Engineers: The Complete Beginner’s Guide

Denzel Kanyeki — Wed, 03 Sep 2025 11:33:43 +0000

Introduction

Imagine being able to package your entire data pipeline into a neat, portable box that runs anywhere without the dreaded “but it works on my machine” excuse. That’s the magic of Docker.

For data engineers, where workflows often span multiple environments which include local machines, cloud servers, and clusters, Docker provides consistency, scalability, and speed. But before we dive in, let’s get an idea of what Docker is and its history.

Docker is a virtualization tool and was first released in 2013 by Solomon Hykes at dotCloud, a Platform as a Service company (PaaS). It quickly gained popularity for making containerization mainstream. Instead of running heavy, resource-consuming virtual machines, Docker introduced lightweight, portable containers. Fast forward to today, and Docker has become a cornerstone of modern data engineering.

Functions of Docker in Data Engineering

Below are functions of Docker that make it indispensable to data engineers. They include:

Portability – Package your pipelines and dependencies into a single container that runs anywhere.
Consistency – No more environment conflicts between dev, staging, and production.
Isolation – Each container runs independently, so one failing service won’t break the rest.
Efficiency – Containers use fewer resources compared to virtual machines.
Scalability – Run multiple containers simultaneously and scale your pipelines with ease.

Docker vs Virtual Machines (VMs)

The table below shows the main differences between a Virtual Machine and Docker.

Feature	Docker (Containers)	Virtual Machines (VMs)
Startup Time	Seconds	Minutes, as it has to boot a full OS
Resource Usage	Low. Shares host OS kernel	High. Each VM runs its own OS
Isolation	Process-level isolation	Full OS-level isolation
Portability	Runs anywhere with Docker installed	Limited, requires hypervisor support
Efficiency	High, lightweight and fast	Lower, more overhead
Compatibility	Only compatible with Linux distros e.g. Ubuntu, Arch Linux	Supports multiple Operating Systems (OS)
Virtualization	Virtualizes the OS Application layer	Virtualizes the OS Application and Kernel layer

QUICK NOTE: Kernel allows for the comunication between the machine/hardware and the application layer, see Traditional OS architecture chart below.

Docker Compatibility, its Reservations and Solution

Initially, Docker was meant to run on Linux machines, hence does not run natively on Windows and Mac.

To sort this issue out, Docker introduced Docker Desktop much later after the release of Docker at PyCon 2013 with a hypervisor layer (its own lightweight Linux OS kernel) to support Windows and Mac machines, enabling them to build and run containers as if they were on Linux. It bridges the gap seamlessly for developers and data engineers.On Windows, this is powered by WSL2 (Windows Subsystem for Linux 2) or Hyper-V. On macOS, Docker Desktop uses a LinuxKit VM.

The chart below shows how the hypervisor layer helped Windows and Mac machines run Docker smoothly.

Docker Images and Docker Containers.

A Docker image is a read-only template containing an application and all its necessary components, including the application code, libraries, dependencies, and configuration files. It serves as a blueprint for creating Docker containers, which are isolated, runnable instances of the application.

A Docker container is a lightweight, isolated, and executable software package that bundles an application and all its dependencies, including the runtime, libraries, and configuration files, into a single unit. It is a running instance of a Docker image.

Docker and Docker Compose

Running one container is cool, but data engineers usually need multiple services for their pipelines to run effectively e.g., PostgreSQL + Airflow + Spark. That’s where Docker Compose comes in handy.

With a simple docker-compose.yml file, you can:

Define multiple services e.g., postgres, spark.
Configure networks and volumes.
Start everything with a single command, docker compose up

Below is an example of a docker-compose.yml file with defined services producer, trainer, fraud_consumer and transaction_consumer:

services:
    producer:
        build: .
        command: python3 scripts/producer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - TOPIC=${TOPIC}

    transaction_consumer:
        build: .
        command: python3 scripts/transaction_consumer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - DB_URL=${DB_URL}
            - DB_PASSWORD=${DB_PASSWORD}
            - DB_USER=${DB_USER}

    fraud_consumer:
        build: .
        command: python3 scripts/fraud_consumer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - DB_URL=${DB_URL}
            - DB_PASSWORD=${DB_PASSWORD}
            - DB_USER=${DB_USER}

    trainer:
        build: .
        command: python3 scripts/model.py
        environment:
            - DB_URL=${DB_URL}

Public and Private Docker Registries

A Docker Registry is where images are stored and shared.

Public Registry: Docker Hub is the most popular public Docker registry in the world. You can pull official images like postgres, python, or spark using docker pull IMAGE_NAME:TAG.

  docker pull postgres:13

Private Registry: Companies often host private registries for security and control. Example: AWS Elastic Container Registry (ECR), Google Container Registry (GCR), or self-hosted Harbor.

As a data engineer, you might pull public images from Docker Hub but pull/push your team’s custom images to a private registry.

Port Binding

Containers run in isolation. To make a service inside a container accessible from your machine, you bind container ports to host ports.

For example, PostgreSQL runs on port 5432. To access it from your laptop, run:

docker run -d -p 5432:5432 --name my_postgres postgres

Here:

The first 5432 is the host port (your machine).
The second 5432 is the container port (inside the container).

This way, you can connect to localhost:5432 on your machine, and Docker routes traffic into the container.

Building custom images using Dockerfile

Sometimes the official images aren’t enough; you’ll need to build your own image. That’s where a Dockerfile comes in.

The example shows a custom Python image with Pandas installed.

# start from official Python image
FROM python:3.10

# set working directory
WORKDIR /app

# copy project files
COPY . /app

# install dependencies
RUN pip install pandas

# run the script
CMD ["python", "main.py"]

Then, to build and run our Dockerfile:

docker build -t my-python-app .

docker run --rm pandas-prep

To learn more about Docker commands, visit this page to learn more.

Mini Project: PostgreSQL and pgAdmin4 User Interface with Docker and Docker Compose

Let's practice what we've learnt by building a simple Docker project. We’ll run a PostgreSQL database and pgAdmin, a web user interface that allows data engineers to access PostgreSQL from the web.

a. Create a `docker-compose.yml` file:

Let's define our services in our docker-compose.yml file. Since we are not building a custom Dockerfile in this mini project, we can pull images from Docker Hub using the image parameter inside the postgres and pgadmin services

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: demo_db
    ports:
      - "5432:5432"
    volumes:
     - pg_data:/var/lib/postgresql/data

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: admin
    ports:
      - "5000:80"

volumes:
  pg_data:

NOTE: Always remember to place your environment variables inside a .env file to avoid leaking sensitive data. But, for this mini project, there's no need to place the credentials inside a .env file, but do so for future projects. To reference an environment variable inside a docker-compose.yml file:

services:
    postgres:
        image: postgres:14
        container_name: postgres_db
        environment:
           - POSTGRES_USER=${POSTGRES_USER}
           - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
           - POSTGRES_DB=${POSTGRES_DB}

b. Start the services

To start our services, run the following command in the terminal:

docker compose up -d # -d runs the services in the background

Postgres will now be available on localhost:5432 and pgAdmin UI will be available on localhost:5000.

c. Access pgAdmin

You can now access your pgAdmin instance on localhost:5000.
Let's log in with the credentials we specified in our docker-compose.yml file, which are PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD.

We can now also add a new server in pgAdmin with the following credentials:

Host: postgres, the service name from Docker Compose
Username: admin
Password: secret

Congratulations! You now have a working PostgreSQL database with a nice web UI, all running inside Docker containers.

Conclusion

For data engineers, Docker is like a Swiss Army knife, it simplifies workflows, ensures consistency, and allows rapid experimentation without breaking your system.

Whether you’re spinning up a PostgreSQL database, deploying Airflow pipelines, or running Spark clusters, Docker makes it smooth and repeatable. So next time you hear “it worked on my machine”, you’ll know Docker could’ve saved the day. Where did you first use Docker, and how has it helped you containerize your application/pipelines? Please share in the comment section

Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at denzelkinyua11@gmail.com or visit any of my social media platforms linked on my GitHub page.

Building a Fraud Detection Pipeline using Python, PostgreSQL, Apache Kafka, PySpark, Grafana and Scikit-learn

Denzel Kanyeki — Mon, 18 Aug 2025 12:49:14 +0000

Introduction

Fraud doesn’t happen once in a while, it happens every second. Every card swipe, every online purchase, every wire transfer has a chance of being fraudulent. And by the time fraud is detected in batch reports, the damage is usually done.

I wanted to explore, What if fraud could be detected as it happens? Could a pipeline be built that not only processes financial transactions in real time, but also applies machine learning to flag suspicious activity on the fly?

In this project, I built a fraud detection system that:

Produces synthetic transaction data using Faker. Read more here about why synthetic data is mostly used in model training than real data.
Trains an Isolation Forest machine learning model on transactional data.
Uses Streamlit to visualize model performance and provide explainability.
Streams transactions into PostgreSQL, which integrates with Grafana for real-time dashboards and monitoring.
Is fully containerized with Docker for deployment.

Project Architecture

The flowchart below shows the project architecture used in this project

Project Workflow

1. Kafka Producer

The Kafka producer:

Uses Faker to generate synthetic users and transactions.

from faker import Faker

faker = Faker()
Faker.seed(42)

def generate_users():
    first_name = faker.first_name()
    last_name = faker.last_name()
    email = faker.email()
    location = faker.country()
    user_id = str(uuid.uuid4())
    user = {
        'user_id': user_id,
        'first_name': first_name,
        'last_name': last_name,
        'full_name': first_name + ' ' + last_name,
        'email': email,
        'location': location
    }

    return user

This is how data is structured into our Kafka topic:

{
  "user_id": "c5223ee0-c3b4-4149-ac2c-93daaedb16a7",
  "transaction_id": "8026c2d5-aaf6-4197-9242-0ecf633853f2",
  "timestamp": "2025-08-06T06:12:33.763770",
  "first_name": "Marie",
  "last_name": "Walton",
  "location": "Malaysia",
  "transaction_location": "Malaysia",
  "amount": 14129.53,
  "device": "Tablet",
  "is_fraud": 1
}

Simulates both legitimate and fraudulent events:
Fraud rules include high transaction amounts, location mismatches, and unusual devices.
Streams messages to the Kafka topic (transactions-data) with keys as hostnames and values as JSON payloads.
Configured with KafkaProducer and SASL/SCRAM authentication for secure cloud Kafka hosted on Redpanda..

2. Kafka Consumers,

a. `transaction_consumer`

Built on Spark Structured Streaming.
Reads data continuously via readStream() from the Kafka topic.
Defines a schema for the transaction JSON payload.
Transforms raw Kafka records into a structured DataFrame with:
- user_id, transaction_id, amount, device, timestamp, etc.
Writes all non-fraud transactions via writeStream() into the database table:
- shop.transactions, used for downstream analytics and model training.

b. `fraud_consumer`

Also powered by Spark Structured Streaming.
Reads from the same Kafka topic but filters only fraudulent events, is_fraud = 1, for counterchecking.
Converts timestamps into SQL timestamp type.
Persists flagged frauds into the database table:
- shop.fraud_transactions, used for fraud analysis, validation, and dashboards

For transaction_consumer, the code snippet below hows how data is written into Postgres using writeStream() from PySpark's Structured Streaming:

'''
transfrom_data() which reads streaming data from topic and transforms it
write_as_batch which streams data in micro batches suitable for writing to Postgres

'''
def write_to_transactions():
    df = transform_data()
    path = "/tmp/fraud_detection/transactions"
    try:
        return df.writeStream \
          .foreachBatch(write_as_batch) \
          .outputMode('append') \
          .option('checkpointLocation', path) \
          .start() \
          .awaitTermination()
    except Exception as e:
        print(f"Error streaming data to shop.transactions: {e}")

3. Database and visualization

PostgreSQL acts as the sink for both consumers.
Grafana connects to PostgreSQL for real-time analytics:
- Transaction trends
- Fraud vs non-fraud comparisons
- Top 5 fraudulent transactions
Streamlit application that provides ML model performance visualization.

Below are snapshots of the Grafana dashboard and Streamlit visualization. To access the dashboard, please follow this link

4. Machine Learning, Teaching the System to Spot the Odd Ones Out

Fraudulent behavior often hides in subtle patterns. To detect anomalies, Isolation Forest model is trained offline on features like:

Transaction amount
Device frequency (how common a device is across transactions)
Location change (user’s home vs transaction location)

The trained model is saved (jobs/isolation_forest.pkl) and fraud predictions are visualized in Streamlit, providing explainability alongside Grafana’s real-time metrics.

The code snippet shows how the model is trained offline by reading data from a PostgreSQL table and saves the jobs inside a .pkl file.

# imports 

def train_model():
    try:
        engine = create_engine(os.getenv("URL"))
        df = pd.read_sql_table('transactions', con=engine, schema='shop')
        device_freq = df["device"].value_counts(normalize=True)
        df['device_freq'] = df["device"].map(device_freq)
        df["location_change"] = (df["location"] != df["transaction_location"]).astype('int')
        threshold = 0.05
        rare_devices = df[df['device_freq'] < threshold]

        features = df[['amount', 'device_freq', 'location_change']].copy()
        features.columns = ['amount', 'device_freq', 'location_change']

        n_estimators = 100
        contamination = 0.1 # expecting that 10% is fraudulent 

        iso_forest = IsolationForest(
            n_estimators=n_estimators,
            contamination=contamination,
            random_state=42
        )
        iso_forest.fit(features)

        path = "jobs/isolation_forest.pkl"
        if os.path.exists(path):
            print(f"Model already exists at {path}. Overwriting...")
        else:
            joblib.dump(iso_forest, path)
        print(f"Job loaded into {path} successfully.")

    except Exception as e:
        print(f'Error training model: {e}')

if __name__ == '__main__':
    train_model()

Conclusion

Fraud detection is more than a data problem, it’s a streaming, scaling, real-time challenge.

This project showed how Kafka + Spark can handle the firehose of financial events, while PostgreSQL + Grafana turn raw data into live insights. On top of that, machine learning with Isolation Forest gives the system predictive power to flag anomalies before they cause damage.

What excites me most is how modular this architecture is. Want to replace Isolation Forest with a deep learning model? Easy. Want to scale Kafka across clusters? Done. Want to plug Grafana into Prometheus for infrastructure monitoring? Straightforward.

With open-source tools, fraud detection at scale is no longer locked away in banks’ back rooms, anyone can build it, understand it, and improve it.

You can access the GitHub repository here

If you’re curious, clone the repo, run it locally with Docker, and see fraud detection in action in real time. I'll appreciate your feedback!

Building a Real-Time Data Pipeline using Binance Websocket API, PySpark, Kafka and Grafana

Denzel Kanyeki — Mon, 04 Aug 2025 15:04:10 +0000

Introduction

Ever wondered how data is consumed in real-time via websockets? In this project, I built a real-time streaming pipeline that captures live market data from Binance Websockets, processes and stores it, and visualizes it on a Grafana dashboard, which auto-refreshes every 30 seconds. The flow is designed with scalability, containerization, and observability in mind.

Project Architecture

Kafka Producer: The producer connects to the Binance WebSocket API i.e., 1-minute Kline and order book streams, and pushes the data to Kafka topics hosted on Confluent Cloud.
Kafka Topics: Hosted on Confluent Cloud, topics named kline_data and order_book_data temporarily store the real-time streaming data.
Kafka Consumer with PySpark: The Spark jobs consume data from the Kafka topics using PySpark Structured Streaming, process it, and write it to a PostgreSQL database using readStream() and writeStream().
PostgreSQL Database: The Postgres database stores the cleaned and structured data for analytics.
Real-time Grafana Dashboard: Connects to the PostgreSQL database as a data source and displays real-time insights with a 30-second auto-refresh interval.

The Kafka producer and consumer are also containerized using Docker for efficient real-time data streaming, where the scripts run every minute.

For the project's code and more information, visit the GitHub repository.

Tech Stack Used

Kafka hosted on Confluent Cloud for message streaming
Docker for the containerization of all services.
Binance WebSockets as the real-time data source
PySpark Structured Streaming for transformation and ingestion
PostgreSQL for data storage
Grafana Cloud for visualization

Project Breakdown

Q: Why use websockets instead of the Binance REST APIs?

When building trading bots, real-time dashboards, or streaming data pipelines from cryptocurrency exchanges like Binance, developers/engineers are often faced with a key question,

Should I use the Binance REST API or the WebSocket API?

The differences between the two have been highlighted below:

Feature	REST API	WebSocket API
Communication	Client initiates one request per data need using `requests`	A persistent, real-time two-way connection using `websocket-client`
Latency	Higher latency, request → wait → response, meaning the client has to wait for some time to receive the response from the server	Ultra-low latency as data is pushed as it changes
Efficiency	Inefficient for frequent updates	Highly Efficient as there's a single connection for continuous updates
Best Use Case	One-time data queries, low-frequency polling	Real-time market data, price tracking, live feeds
Connection Type	Stateless (no ongoing link)	Stateful (persistent connection)
Network Load	High if polled frequently	Low, since updates are sent only when needed
Data Freshness	Snapshot of current state only	Stream of live updates as events occur

For more details on the WebSocket API, visit this link for more details.

I have published another blog where I used the Binance REST APIs to extract, transform, and load data into a Postgres and Cassandra databases using Change Data Capture (CDC), you can access the blog here

Kafka producer and consumer

Our Kafka producers and consumers are built in Python using the confluent-kafka package, which enables us to use the confluent_kafka.Producer and confluent_kafka.Consumer clients to connect to Confluent Cloud to produce and consume data to/and from our Kafka topics.

Let's produce data to a Kafka topic, kline_data.

First, install the confluent-kafka and other necessary packages.

pip install confluent-kafka websocket-client rel

Import the confluent_kafka.Producer to create a Producer client.

from confluent_kafka import Producer
from websocket import WebSocketApp

config = {
    'bootstrap.servers': 'YOUR_BOOTSTRAP_SERVER',
    'sasl.username': 'YOUR_CONFLUENT_API_KEY',
    'sasl.password': 'YOUR_CONFLUENT_SECRET_KEY',
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN'
}

producer = Producer(config)

To stream data from websockets, you need to use the websocket-client package. Define the helper functions on_open, on_close, on_error, and on_message. The on_message function will be used to consume data from the websocket, and send the resulting data to the Kafka topic as shown below. For more information on consuming data from websockets, visit the documentation

# helper functions
def on_open(ws):
    print("Order book websocket open for connections...")

def on_close(ws, close_status_msg, close_msg):
    print(f"Order book connection closed", close_status_msg, close_msg)

def on_error(ws, error):
    print(f"There is an error: {error}")

def on_message(ws, message):
    data = json.loads(message)
    order = data["data"]
    book_data = {
        "update_id": order["u"],
        "symbol": order["s"],
        "bestbidprice": order["b"],
        "bestbidqty": order["B"],
        "bestaskprice": order["a"],
        "bestaskqty": order["A"],
        "eventtime": order["E"],
        "transactiontime": order["T"]
    }

    try:
        producer.produce('order_book_data', json.dumps(book_data).encode('utf-8'))
        producer.poll(0)
        print("Data sent to order_book_data topic successfully!")
    except Exception as e:
        print(f"Error producing data to topic: {e}")

Let's define a function get_data() which configures WebSocketApp. The Registered Event Listener rel package is a cross-platform asynchronous event dispatcher primarily designed for network applications and is used with websocket-client to handle events as they occur during connection.

def get_data(BASE_URL, symbols):
    streams = '/'.join([f"{symbol}@bookTicker" for symbol in symbols]) # explained in the README 
    url = f'{BASE_URL}/stream?streams={streams}'

    ws = WebSocketApp(
        url,
        on_open=on_open,
        on_close=on_close,
        on_error=on_error,
        on_message=on_message
    )

    ws.run_forever(dispatcher=rel, reconnect=5)
    rel.signal(2, rel.abort)
    rel.dispatch()

NOTE: Using streams = '/'.join([f"{symbol}@bookTicker" for symbol in symbols]) is good practice as it combines all symbols into one request, hence reducing the number of requests that would be needed to get data for 10 symbols. We are using one request for all symbols as opposed to one request per symbol. Also, the WebSocket API supports multiple streams, hence making this more efficient. So the request URL to the API would look like this: wss://fstream.binance.com/stream?streams=btcusdt@bookTicker/dotusdt@bookTicker/avaxusdt@bookTicker... Check the Binance documentation for more information

Let's use if __name__ == '__main__' to invoke our script whenever the script is called. producer.flush() is used to flush all pending messages to the topic during connection termination to prevent loss of data.

if __name__ == '__main__':
    try:
        get_data(BASE_URL, symbols)
    finally:
        print("Flushing all pending messages...")
        producer.flush()

The data is received in our topic as shown in the snapshot below.

Let's consume data from the Kafka topic and stream data to a Postgres database using PySpark's Structured Streaming.

Install the necessary packages required.

pip install pyspark

Import necessary modules and functions, and configure Spark for our application using SparkSession from pyspark.sql.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import BooleanType, LongType, StructType, StringType

spark = SparkSession.builder \
            .appName("SparkPostgresConsumer") \
            .master('local[*]') \
            .config('spark.ui.port', '4041') \
            .config("spark.jars.packages",
                    "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1,"
                    "org.postgresql:postgresql:42.7.7") \
            .getOrCreate()

Define a schema and its data types for our DataFrame. This is important to let our schema know beforehand what data and types it is dealing with.

def transform_data():
    schema = StructType() \
        .add("close", StringType()) \
        .add("open", StringType()) \
        .add("high", StringType()) \
        .add("low", StringType()) \
        .add("interval", StringType()) \
        .add("klineClosed", BooleanType()) \
        .add("symbol", StringType()) \
        .add("volume", StringType()) \
        .add("numTrades", LongType()) \
        .add("closeTime", LongType()) \
        .add("startTime", LongType())

Define our DataFrame using readStream(), and configure Kafka settings

def transform_data():
    # code
    df = spark \
        .readStream \
        .format('kafka') \
        .option('kafka.bootstrap.servers', 'YOUR_CONFLUENT_SECRET_KEY') \
        .option('subscribe', 'YOUR_TOPIC') \
        .option('startingOffsets', 'earliest') \
        .option('kafka.security.protocol', 'SASL_SSL') \
        .option('kafka.sasl.mechanism', 'PLAIN') \
        .option('kafka.sasl.jaas.config', f'org.apache.kafka.common.security.plain.PlainLoginModule required username="YOUR_CONFLUENT_API_KEY" password="YOUR_CONFLUENT_SECRET_KEY";') \
        .load()

Our data coming from the topic is not in our ideal structure as our desired data in inside value. We need to extract this data and parse it into JSON format.

def transform_data():
    # code 
    json_df = df.selectExpr("CAST (value AS STRING) as parsed_json") \
                .select(from_json("parsed_json", schema).alias("data")) \
                .select("data.*")

Now, we can run our transformations, accessing our data from our json_df dataframe with the parsed JSON data. We can do so using SQL queries inside our Python code. After running transformations, return our transformed dataframe for micro-batching.

def transform_data():
    # code
    new_df = json_df.select(
        col("close").cast("float"), 
        col("high").cast("float"), 
        col("low").cast("float"), 
        col("open").cast("float"), 
        col("volume").cast("float"), 
        "symbol", "interval", 
        col("numTrades").alias("num_trades"), 
        col("klineClosed").alias("isKlineClosed"),
        (col("closeTime") / 1000).cast("timestamp").alias("closetime"),
        (col("startTime") / 1000).cast("timestamp").alias("starttime")
    )

    return new_df

Define a helper function write_each_batch() which writes data into the Postgres database in micro batches, ideal for streaming.

def write_each_batch(batch_df, epoch_id):
    properties = {
        "user": os.getenv("DB_USER"),
        "password": os.getenv("DB_PASSWORD"),
        "driver": "org.postgresql.Driver"
    }
    try:
        batch_df.write \
                .jdbc(url=os.getenv("DB_URL"), table="websocket.kline_1m_data", mode="append", properties=properties)
    except Exception as e:
        print(f"Error writing batch data: {e}")

Define a function write_to_db which streams data into the Postgres database using the writeStream() and foreachBatch(write_each_batch) methods, and appends the data if the table exists in our database.

def write_to_db():
    df = transform_data()
    path = f"/tmp/binance_streaming/kline_stream_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
    try:
        df.writeStream \
        .foreachBatch(write_each_batch) \
        .outputMode("append") \
        .option('checkpointLocation', path) \
        .start() \
        .awaitTermination()

        print("Data loaded into Postgres successfully!")
    except Exception as e:
        print(f"Error loading data to db: {e}")

The data is now streaming into Postgres successfully. Run a SELECT * FROM table query to check.

Containerization using Docker and Docker Compose.

Docker is a virtualization and open-source platform that lets you package your application and its dependencies into containers, which are lightweight, standalone units that can run anywhere, regardless of the system.

Think of a container like a "mini virtual machine", but faster and more efficient.

Docker Compose is a tool for defining and running multi-container Docker applications.

You need to install Docker for this project. Follow this link to install Docker.

For this project, different services have been defined into containers to run individually, which are:

kline-producer: Producer for the 1-minute klines
kline-consumer: Consumer for the 1-minute klines
book-producer: Producer for the order book data
book-consumer: Consumer for the order book data
First, define a Dockerfile in the root of your project. A Dockerfile contains the project's instructions and what's required for the project to run smoothly in Docker.

# imports the Python image from docker hub
FROM python:3.10-slim

# installs java for pyspark to run smoothly
RUN apt-get update && apt-get install -y --no-install-recommends \
openjdk-17-jre-headless \
curl \
gnupg \
apt-transport-https \
ca-certificates && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# sets java environment vars
ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ENV PATH="$JAVA_HOME/bin:$PATH"

# copies all contents into a folder app/
COPY . /app/

# sets app/ as the current working directory, same as cd in linux
WORKDIR /app

# installs all python packages
RUN pip install -r requirements.txt

Create a folder docker-compose.yml file, which defines our different services.

version: '3.8'

services:
    kline-producer:
        build: .
        command: python3 scripts/kline_producer.py
        environment:
          - KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}
          - KAFKA_API_KEY=${KAFKA_API_KEY}
          - BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}

    kline-consumer:
        build: .
        command: python3 scripts/kline_consumer.py
        environment:
          - KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}
          - KAFKA_API_KEY=${KAFKA_API_KEY}
          - BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}
          - DB_URL=${DB_URL}
          - DB_USER=${DB_USER}
          - DB_PASSWORD=${DB_PASSWORD}

    book-producer:
        build: .
        command: python3 scripts/book_producer.py
        environment:
          - KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}
          - KAFKA_API_KEY=${KAFKA_API_KEY}
          - BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}

    book-consumer:
        build: .
        command: python3 scripts/book_consumer.py
        environment:
            - KAFKA_SECRET_KEY=${KAFKA_SECRET_KEY}
            - KAFKA_API_KEY=${KAFKA_API_KEY}
            - BOOTSTRAP_SERVER=${BOOTSTRAP_SERVER}
            - DB_URL=${DB_URL}
            - DB_USER=${DB_USER}
            - DB_PASSWORD=${DB_PASSWORD}

To boot up our containers, run the following command in your terminal

docker compose up --build

If you need to stop the containers at any time, run the following command or press Ctrl + C on Windows or Cmd + C on Mac.

docker compose down

Below are some snapshots in Docker Desktop and the terminal

Visualization in Grafana

The data is then visualized on a Grafana dashboard with an auto-refresh rate of 30 seconds to ensure that the dashboard is displaying fresh data.

Below is a snapshot of the dashboard.

Conclusion

This project showcases how to build a real-time, containerized data pipeline using Binance WebSockets, Kafka, PySpark, PostgreSQL, and Grafana. By leveraging WebSockets for low-latency streaming and Docker for seamless deployment, we created a scalable and efficient system for live crypto data processing and visualization.

Be sure to check out the GitHub repository for the full codebase, and don’t forget to share your feedback or fork the project to make it your own.

Happy hacking!

Using Data Engineering to Track Food Prices and Inflation in Kenya from 2006 to 2025

Denzel Kanyeki — Wed, 23 Jul 2025 09:48:24 +0000

Rising food prices continue to be a critical issue affecting many households in Kenya. As a data engineer passionate about impactful projects, I built a data pipeline to collect, model, and analyze food price and inflation trends in Kenya, shedding light on various patterns from 2006 to 2024.

By building this pipeline, we can:

Monitor inflation’s impact on staple food prices in various areas.
Provide data-backed insights to policy makers, NGOs, and supply chain analysts.
Improve food security planning through trend forecasting.

Project Objectives

To build a scalable data pipeline and star-schema data warehouse to analyze historical food prices across Kenyan markets and derive insights into inflation trends, particularly for food categories such as vegetables, fruits, meat and eggs, milk and dairy among others.

The Github repository for this project can be accessed here

Tech Stack Used

Data Source, Food Prices: Humanitarian Data Exchange (HDX)
Data Source, Inflation: World Bank Indicator API v2
Pipeline: Python + pandas for extraction & transformation
Database: PostgreSQL with dimensional modeling
Orchestration and Automation: Apache Airflow
Visualization: Grafana
Slideshow presentation: MS Powerpoint

Project Architecture

The flowchart below shows the project architecture flow from start to finish, with the tech stack used in every step.

Database Architecture

The database design is in form of a star schema, having one central fact table and different dimension tables. The fact table contains the primary keys that map to a specific dimension table and measurements. Dimension tables contain the attributes or the units of analysis of the data. For more information on different schemas, visit this blog to learn more.

The database structure diagram below shows how the schema is set up.

Pipeline Overview

This project follows a batch data pipeline that extracts food price and inflation data from APIs, transforms it for consistency, and loads it into a PostgreSQL database for analysis and downstream use cases.

a. Data Extraction.

Monthly food price and inflation data is downloaded in CSV format from the relevant APIs.
Only relevant data are retained such as geodata (latitude, longitude), commodities, dates e.t.c.

Below is an example of a code snippet to extract data from the HDX Kenya data.

def extract_data():
    url = "https://data.humdata.org/dataset/e0d3fba6-f9a2-45d7-b949-140c455197ff/resource/517ee1bf-2437-4f8c-aa1b-cb9925b9d437/download/wfp_food_prices_ken.csv"

    response = requests.get(url)

    if response.status_code == 200:
        df = pd.read_csv(StringIO(response.text))
        return df
    else:
        print(f"Error loading data from CSV file: {response.status_code}, {response.text}")

b. Data Transformation

Below are some of the transformations used to transform data:

Clean inconsistent units and datatypes, e.g. kgs, liters, floats.
Normalize market names and commodity labels.
Fill in missing dates and reformat for time-series alignment.
Filter out years with no data e.g., 2006–2018 for the vegetables category.

Below is a code snippet for transforming food data:

def transform_data(df):
    try:
        df.drop(0, axis=0, inplace=True)
        df["date"] = pd.to_datetime(df["date"])

        df["latitude"] = df["latitude"].astype(float)
        df["longitude"] = df["longitude"].astype(float)
        df["price"] = df["price"].astype(float)
        df["usdprice"] = df["usdprice"].astype(float)
        df["market_id"] = df["market_id"].astype(int)
        df["commodity_id"] = df["commodity_id"].astype(int)

        df.rename(columns={'admin1': 'province', 'admin2': 'county'}, inplace=True)

        df.dropna(axis=0, how='any', inplace=True)
        return df
    except Exception as e:
        print(f"Error loading dataframe from previous task: {e}")

c. Loading data into a PostgreSQL database

After transformation, our data is ready for loading into the database. The dimension tables are loaded first, then the fact table is loaded for consistency.

Below is a code snippet for loading market data into the market dimension table.

def loading_market_data(clean_df):
    try:
        market_df = clean_df[['market_id', 'province', 'county', 'latitude', 'longitude']].drop_duplicates()
        market_df.dropna(axis=0, how='any', inplace=True)
        engine = create_engine(os.getenv("PSQL_URI"))
        try:
            market_df.to_sql('dim_market', con=engine, index=False, schema='foodprices', if_exists='append')
            print("Data loaded into PostgreSQL successfully")
        except Exception as e:
            print(f"Error loading into dim_market table: {e}")
    except Exception as e:
        print(f"Error loading clean dataframe from previous task: {e}")

d. Workflow Automation using Airflow

Airflow DAGs are used to schedule the pipeline to run monthly.
Each task is logged, and failure alerts are handled through retries and error logging.
Logs are also tracked to ensure data lineage in the pipeline.
Airflow is then moved to production on an Azure Virtual Machine.

Below is an example of a DAG used to automate the Inflation data pipeline

@dag(dag_id='inflation_pipeline_dag', default_args=default_args, start_date=datetime(2025, 7, 19), schedule_interval='@monthly', catchup=False)
def inflation_pipeline_dag():
    @task
    def extract_data():
        url = "https://api.worldbank.org/v2/country/ke/indicator/FP.CPI.TOTL?format=json&date=2006:2024"

        inflation_data = []
        response = requests.get(url)
        if response.status_code == 200:
            cpi_data = response.json()
            for record in cpi_data[1]:
                inflation_data.append(
                    {
                        'year': record['date'],
                        'value': record['value'],
                        'indicator': record['indicator']['value']
                    }
                )
        else:
            print(f"Error extracting CPI data: {response.status_code}, {response.text}")

        df = pd.DataFrame(inflation_data)
        return df

    @task
    def transform_data(df):
        df['year'] = pd.to_datetime(df['year'], format='%Y')
        return df

    @task
    def load_inflation_data(clean_df):
        engine = create_engine(os.getenv("PSQL_URI"))
        try:
            clean_df.to_sql('inflation_data', con=engine, schema='inflation', index=False, if_exists='append')
            print("Data loaded into inflation_data successfully!")
        except Exception as e:
            print(f"Error loading into inflation_data: {e}")

    @task
    def send_email():
        body = f"""
                Inflation pipeline DAG has run successfully, please check it out
                """
        subject = "Inflation Pipeline DAG Notification"
        sender = os.getenv("SENDER")
        receipients = [os.getenv('RECEIPIENT')]
        pwd = os.getenv("PASSWORD")

        msg = MIMEText(body)
        msg["Subject"] = subject
        msg["From"] = sender
        msg["To"] = receipients[0]

        with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp_server:
            try:
                smtp_server.login(sender, pwd)
                smtp_server.sendmail(sender, receipients, msg.as_string())
                print(f"Email sent successfully!")
            except Exception as e:
                print(f"Sending email to receipient error: {e}")

    df = extract_data()
    clean_df = transform_data(df)
    loading = load_inflation_data(clean_df)
    email_sent = send_email()

    loading >> email_sent

inflation = inflation_pipeline_dag()

e. Visualization and analysis

Transformed data is analyzed on Grafana.
Trends like monthly price inflation, food price difference per county, and average price change of different food categories over time are tracked.

Below is a snapshot of the Grafana dashboard. To access the dashboard, follow this link

INSERT DASHBOARD HERE!

Insights

For most food categories, prices hiked around the start of 2022 till the end of 2023. This might have been brought about by various factors which include:

Currency depreciation: As the inflation chart shows, the inflation in Kenya quadrupled between 2020 and 2023, and this might have led to the hike in food prices around 2022
Drought and other climatic conditions: In 2022, Kenya experienced below-average rainfall, particularly in the northeastern regions, which suffered from prolonged drought conditions. The year was also marked by significant temperature increases, making it one of the warmest on record. This might lead to low food production and increase in food prices, especially in arid and semi-arid areas.
High Input costs: The Russian-Ukrainian war negatively impacted some industries e.g. the fuel industry which is a key input factor in the agricultural sector as food stuff need to be transported from the farm to other areas, hence leading to an increase in food prices.

Challenges faced during building this pipeline

a. Missing data

Some food categories and commodities missed data from certain time periods, for example, for the vegetables and fruits category, there was missing data from 2006 to 2018.

b. Data Normalization

Inconsistent data had to be normalized into consistent units of mass, price e.t.c.
Some commodities had units in kilograms and 50 kilogram bags, hence need to normalize this data for easier analysis.

Lack of data sources

There was a lack of data sources for Kenyan counties, and other commodities, hence only analyzing the available data on commodities

Conclusion

This project was a deep dive into how data engineering can be used for social impact. If you're working on something similar or want to collaborate, feel free to connect!

Also, like, comment and follow me for more data engineering content.

A Real-Time Earthquake Monitoring Pipeline with Kafka, MySQL, PostgreSQL, and Grafana

Denzel Kanyeki — Mon, 14 Jul 2025 13:37:08 +0000

In this project, I designed and built an end-to-end real-time data pipeline that monitors earthquakes from the USGS Earthquake API. The pipeline extracts and loads data into a MySQL database, captures changes via a MySQL Debezium CDC connector, streams it through into a Kafka topic, sinks it into PostgreSQL, and visualizes it in Grafana Cloud. The workflow is automated using Apache Airflow to run hourly.

You can access the GitHub repository here

Project Workflow Chart

This pipeline ensures that earthquake data is extracted, streamed, stored, and visualized in near real-time, fully automated and orchestrated with Apache Airflow to run hourly.

Project Walkthrough

1. Earthquake Data Extraction from the USGS Earthquake API

The pipeline begins by calling the USGS Earthquake GeoJSON API, which provides up-to-date global earthquake data. The script:

Pulls earthquake events for the past 24 hours
Extracts key fields: magnitude, place, time, alert, coordinates, and more
Triggers an email alert to Gmail if the quake magnitude ≥ 4.5

This extraction script is written in Python using requests, smtplib for sending emails in Python via SMTP, sqlalchemy and pymysql.

2. Loading into MySQL Database

The extracted data is then cleaned and loaded into a MySQL database table called earthquake_data. This serves as the staging area where new and changed data is logged before flowing downstream.

The schema includes:

Event ID
Place and location coordinates
Magnitude and depth
Alert level
Timestamps (time, updated)
USGS metadata like URL, IDs, and status

Below is a snapshot of the data in a MySQL database:

3. Change Data Capture with MySQL Debezium Connector

To stream changes in the database in real time, the Debezium MySQL CDC Connector on Confluent Cloud listens to the binlog (binary log) of the MySQL database. It captures:

New earthquake events
Updates (e.g., revised magnitudes or alert levels)
Deletions

These changes are published to a Kafka topic in JSON_SR (JSON Schema Registry) format. This is very important as our Postgres Sink connector only accepts input in JSON_SR and not in JSON.

Below is a snapshot of the MySQL CDC connector.

4. Kafka Streams Earthquake Events to a PostgreSQL Sink

Next, the PostgreSQL Sink Connector subscribes to the same Kafka topic. It:

Reads the after field from the change event
Deserializes the JSON_SR format
Inserts or updates the data into a PostgreSQL table for analytics

TIP: You must configure the connector to match table naming formats and use primary key fields to support upserts.

Below is a snapshot of the Postgres Sink Connector and the data in the Postgres sink:

5. Real-Time Analytics & Visualization in Grafana

The final PostgreSQL database now holds real-time earthquake records, ready to be queried.

Using Grafana Cloud, several dashboards were created, including:

Quakes Per Minute – a time series of seismic activity
Alerts by Magnitude – categorize quakes into green/yellow/orange/red
Interactive World Map – plot quakes by longitude/latitude using heat maps or markers
Top 5 Quake Locations – based on frequency

If you want to interact with the dashboard, follow this link

Below is a snapshot of the dashboard on Grafana Cloud

6. Automated Hourly Execution with Apache Airflow

The entire process is scheduled and monitored using Apache Airflow:

A DAG (Directed Acyclic Graph) defines the workflow
Runs every hour to ensure near real-time data ingestion and sends an email whenever a DAG run is successful or not

Airflow provides visibility into task failures, retries, and pipeline health via its web UI.

Below is a successful DAG run for the pipeline and email alert task.

Why are the connectors running on Confluent Cloud?

Running our Debezium CDC Connector and Sink connector in Confluent offers a lot of advantages as opposed to managing our connectors manually with Kafka. Below is a table of the advantages that Confluent Cloud offers over self managed Kafka.

Advantage	Confluent Cloud	Self-Managed Kafka
Infrastructure Management	Fully managed by Confluent, no need to manage servers	Requires manual setup, scaling, and maintenance
Connector Management	Easy UI for deploying and managing connectors	Manual deployment and monitoring required
Scalability	Seamless automatic scaling based on workloads	Must manually configure and monitor scaling
Security	Built-in enterprise-grade security (encryption, ACLs)	Responsible for securing infrastructure yourself
Schema Registry	Managed Schema Registry with simple integration	Needs separate setup and maintenance
Monitoring & Observability	Rich monitoring dashboards and alerting included	Need to set up your own monitoring stack
Upgrades & Patching	Handled automatically by Confluent	You are responsible for patching/upgrading
High Availability & Reliability	SLA-backed uptime with high availability	Need to architect HA setups manually
Faster Time to Market	Quick setup, lets you focus on building data pipelines	Slower due to infrastructure and connector setup
Support	Access to Confluent support and managed services	Community support or paid enterprise support

Conclusion

This project enhanced my hands-on skills in API data extraction, real-time streaming with Kafka, and Change Data Capture (CDC) using Debezium. I gained practical experience building end-to-end data pipelines, automating workflows with Airflow, and creating real-time dashboards in Grafana. Overall, it was a great opportunity to apply data engineering concepts to a real-world monitoring use case.

If you have any questions, don't hesitate to reach out. Please leave a like, comment and follow for more informative data engineering content!

Building a Real-Time Crypto Pipeline with Binance APIs, PostgreSQL, Debezium, Kafka, Spark & Cassandra

Denzel Kanyeki — Fri, 04 Jul 2025 18:57:11 +0000

Introduction

In this blog post, I’ll walk you through a real-time crypto data pipeline I built using a powerful stack:

🔹 PostgreSQL for structured storage and a data source
🔹 Debezium + Confluent Kafka for Change Data Capture (CDC)

🔹 Apache Spark for stream processing

🔹 Apache Cassandra for fast NoSQL storage hosted on an Azure Ubuntu virtual Machine used as a data sink.

This architecture enables near real-time data extraction and transformation, ideal for live dashboards, analytics, distributed systems, and downstream uses e.g., building of machine learning models.

This project is divided into four phases:

Phase 1.

Extracting data from the Binance API endpoints, transforming and loading the data into a PostgreSQL database using Python scripts.

Phase 2.

Using a Debezium Source connector to connect into the PostgreSQL database to a Kafka topic to store change logs

Phase 3.

Using a Spark Streaming job to act as a sink to send our replicated data from the Kafka topics into a Cassandra keyspace binance_test

Phase 4.

Visualizing the data in Apache Cassandra on a Grafana dashboard using a PostgreSQL bridge to pull data from Cassandra.

For more technical details on the project, check the linked GitHub repository.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a technique used in data pipelines to identify and track changes made to data in a database. It captures these changes, INSERT, UPDATE, and DELETE, and makes them available for use in other systems, often in real-time or near real-time. CDC helps improve data efficiency and consistency across systems by only replicating the changed data, rather than the entire dataset.

There are three types of CDC mechanisms namely:

Log-based CDC: This captures changes by reading the database’s transaction logs, which are used internally to ensure durability and consistency. For example, PostgreSQL uses Write-Ahead Logs (WALs) to record all changes before they're applied to the actual database. Tools like Debezium tap into these logs using logical decoding and convert them into structured JSON messages. These include:
- before: The state of the record before the change.
- after: The new state after the change.
- source: Metadata about the origin of the change e.g., database, table, LSN, timestamp.

An example of a log captured from Postgres into a Kafka Topic via a Debezium connector in JSON:

{
  "before": null,
  "after": {
    "id": 101,
    "symbol": "BTCUSDT",
    "price": "61550.00",
    "time": 1724822831000
  },
  "source": {
    "version": "2.5.0.Final",
    "connector": "postgresql",
    "name": "binance",
    "ts_ms": 1724822831200,
    "snapshot": "false",
    "db": "crypto_db",
    "sequence": "[null,\"240300112\"]",
    "schema": "binance",
    "table": "latest_prices",
    "txId": 78321,
    "lsn": 240300112,
    "xmin": null
  },
  "op": "c",
  "ts_ms": 1724822831210,
  "transaction": null
}

Query-based CDC: This method involves periodically querying the source database to identify changes, usually by comparing the last updated timestamps or using ROWVERSION fields. Below is an example:

SELECT * FROM orders WHERE updated_at > '2025-07-02 10:00:00';

Trigger-based CDC: This method uses database triggers to detect changes and write them to an audit/log table. An example includes:

CREATE TRIGGER audit_changes
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION log_user_changes();

log_user_changes() would write the old and new row values to a separate user_audit_log table which will store the changes.

Project Requirements

This project requires the following:

python 3.x
pyspark 3.5.x
cassandra-driver
confluent-kafka
sqlalchemy
psycopg2-binary
Confluent Cloud cluster
Debezium's PostgresCDCConnectorV2 connector, available on Confluent Cloud
Microsoft Azure Virtual Machine to host our Apache Cassandra. (You can host it locally if you want to)

Project Workflow

Project Walkthrough

Phase 1.

The Python script extracts and transforms 5 types of data from the Binance REST API endpoints:

Recent Trades
Order Book Depth
30-minute Klines / Candlesticks
24h Daily Ticker prices
Latest Prices

Here's an example of the extraction, transformation and loading of the latest_prices prices data:

def get_latest_prices(BASE_URL, symbols, DB_URI):
    coin_data = []
    url = f"{BASE_URL}/api/v3/ticker/price"
    for s in symbols:
        params = {'symbol': s}
        response = requests.get(url, params=params)
        if response.status_code == 200:
            data = response.json()
            coin_data.append({
                'symbol': s,
                'price': data["price"],
                'time': int(time.time() * 1000)
            })
        else:
            print(f"Error fetching recent trades: {response.status_code}, {response.text}")
    df = pd.DataFrame(coin_data)
    df["price"] = df["price"].astype(float)
    df["time"] = pd.to_datetime(df["time"], unit='ms')
    try:
        engine = create_engine(DB_URI)
        df.to_sql('latest_prices', con=engine, schema='binance', index=False, if_exists='append')
        print("Latest price data loaded successfully!")
    except Exception as e:
        print(f"Loading latest prices to db: {e}")

After transformation, it loads the data into a PostgreSQL database's Binance schema. The PostgreSQL tables created are:

binance.recent_trades
binance.order_book
binance.klines
binance.daily_ticker
binance.latest_prices

Phase 2

Head over to Confluent Cloud and if you do not have an account, please create a free account to claim free $400 credits and create an environment and cluster. When prompted to add a payment method.

After creating a cluster, on the left-hand side, select Connectors and search for Postgres CDC Source V2 (Debezium) connector and start the setup.

Here's the connector receiving data from the Postgres database and storing it into Kafka topics.

Phase 3

A script containing a Spark Streaming job that acts as a data sink into Apache Cassandra.

It consumes the logs from the various Kafka topics into JSON format, flattens after, source, op, ts_ms, and transaction fields, transforms them, and loads them into Cassandra hosted on an Azure Ubuntu VM

The data received in Cassandra is as follows for the klines table:

Phase 4.

After streaming data into Cassandra, we need to visualize this data on a Grafana dashboard to derive insights from the data.

For this project, I have used Grafana Cloud. There is a plugin that is used as a data source, but it only supports Cassandra 3.0.0 and older, and I am running 4.1.9 hence coming into some issues with connecting Grafana

Hence, I built a simple Postgres-Cassandra bridge to send data to a PostgreSQL database which Grafana natively supports to draw insights from the data.

NOTE: By default, Cassandra comes with an authenticator and authorizer set to AllowAllAuthenticator, which disables authentication and allows any user to access Cassandra's shell cqlsh. Remove the default configuration, and restart the Cassandra service as so:

sudo nano /etc/cassandra/cassandra.yaml # Cassandra's config file

authenticator: PasswordAuthenticator

authorizer: CassandraAuthorizer

Then exit the editor and run the following comand to restart the Cassandra service:

sudo systemctl restart cassandra

Insights gained from the data

The dashboards reveal several valuable insights into crypto trading patterns. The first chart shows that XRPUSDT, AVAXUSDT, BTCUSDT, and DOGEUSDT are among the most frequently traded symbols in the trading day, indicating high liquidity or market interest.

The ETHUSDT volume trend indicates significant fluctuations, with notable spikes between 13:00 and 14:00, suggesting peak trading hours possibly influenced by market news or institutional actions.

The candlestick chart for BTCUSDT displays hourly open, high, low, and close prices, offering a useful tool for technical analysis and identifying market trends or reversal patterns.

Additionally, the quantity of coins ordered throughout the trading day highlights DOGEUSDT and ADAUSDT as dominant in trade volume, with DOGE exceeding 270,000 units, likely due to its lower price and retail trader appeal. Overall, the visualizations suggest strong market activity around a few key coins, with volume and price data providing critical insights for real-time monitoring, trading strategy, or machine learning applications.

Conclusion

This project was a rewarding journey through the world of real-time data engineering. It demonstrates how Change Data Capture can save time and resources while updating our databases for streaming applications and data pipelines. This stack can easily be used for downstream use cases e.g. data analytics or even real-time ML predictions for crypto coins.

Do you have any questions? Don't forget to like, comment and follow for more data engineering tutorials and content!

Tracking Kenya’s External Debt Using Python, PostgreSQL, and Grafana

Denzel Kanyeki — Fri, 20 Jun 2025 21:35:21 +0000

We always hear about Kenya’s external rising debt in headlines, but how fast is it growing? And what are the trends year over year?

As a data engineer, I wanted to answer these questions with data, not just opinions. So I built a pipeline that connects the dots: from pulling debt data via the World Bank API, transforming it using pandas, storing it in a PostgreSQL database, and visualizing the story through Grafana.

You can get the project code in [GitHub].(https://github.com/dkkinyua/KenyanExternalDebtAnalysis)

Project requirements

Python for scripting and data processing
Pandas for data transformation
PostgreSQL for data storage
Grafana for interactive dashboards
World Bank API as the data source

Project Breakdown

1. Extracting Data from the World Bank API

I used the World Bank API to fetch Kenya’s external debt stock from 2010 to 2023 using the requests library.

def extract():
    url = "https://api.worldbank.org/v2/country/KE/indicator/DT.DOD.DECT.CD"
    params = {
        'date': '2010:2024',
        'format': 'json'
    }

    response = requests.get(url, params=params)
    if response.status_code == 200:
        rel_data = []
        data = response.json()
        for item in data[1]:
            rel_data.append({
                'countryName': item['country']['id'],
                'year': item['date'],
                'debtValue': item['value'],
                'indicator': item['indicator']['value']
            })
        return rel_data
    else:
        print(f"Error fetching data: {response.status_code}, {response.text}")

2. Transforming Data with Pandas

I cleaned and processed the data, by handling and dropping NaN/missing values,

def transform_load(data):
    df = pd.DataFrame(data)
    df.dropna(inplace=True)

3. Loading into a PostgreSQL database:

Load into the data into a PostgreSQL database for easier integration with Grafana for the dashboards:

def transform_load(data):
    # code
    try:
        engine = create_engine(url)
        df.to_sql('external_data', con=engine, index=False, if_exists='replace')
        print('Data loaded successfully!')
    except Exception as e:
        print(f"Error loading into db: {e}")

4. Building Grafana dashboards

A. Creating and configuring Grafana.

Go to Grafana Cloud and create a new account.
Head over to Dashboards and click on Data Sources
On your right-hand side, click on Add Data Source and connect your PostgreSQL database
Click on Create a dashboard

B. Dashboards

Insights Derived

There has been steady growth of external debt between 2010 and 2023, which reflects a 383% increase in external debt over a period of 13 years, showing an overreliance on external borrowing to finance development

From the bar chart in the dashboard, 2014-2015 and 2017-2018 stand out as periods of high borrowing. The 2017 spike was the highest of them all, which coincided with the 2017 election period.

There has been slow debt growth between 2021 and 2023, which shows efforts to slow down external borrowing or external pressure from debt servicing.

Technical or logical challenges encountered

Some of the technical or logical challenges encountered include:

Inconsistent time formats
Dealing with missing/NaN values
The API returns nested JSON or XML, not always straightforward for Pandas ingestion
This data is just from external debts and does not use any other indicators to analyze any other debt patterns

Conclusion

This project gives hands-on experience with end-to-end ETL pipeline design, from data extraction from the World Bank API using requests to transformation using pandas, loading, and visualization using Grafana. It's a solid foundation for building more robust data pipelines.

For more blogs like this, please like, comment and follow me to stay updated in the data engineering world!

Extracting Data from the Premier League YouTube Channel Using `googleapiclient`, PySpark, Airflow, PostgreSQL and Grafana

Denzel Kanyeki — Thu, 12 Jun 2025 14:34:56 +0000

Introduction

In this post, I’ll walk you through how I built a scalable ETL data pipeline to extract and analyze video content from the Premier League YouTube Channel, using the YouTube Data API, PySpark, Airflow, and PostgreSQL, and visualized the data on a Grafana dashboard.

Here is the GitHub repository.

1: Extract Data from YouTube API

Using googleapiclient, I connected to the YouTube Data API and extracted:

The channel's upload playlist ID
Processed all video IDs (in batches of 50)
Metadata for each video: title, view count, like count, comment count, and published date

This data was written to a JSON file for further processing and transformation.

2: Transform Data with PySpark and Load into Postgres

Next, I used PySpark to clean and process the data by creating a Spark dataframe and using a temporary view to transform data.

df = spark.read.option("multiLine", True).json(filepath)

    videos = df.createOrReplaceTempView("videos") 
    new_df = spark.sql("""
        SELECT 
            title,
            videoId,
            CAST(viewCount as INT) as viewCount,
            CASE
                WHEN CAST(viewCount AS INT) >= 100000 THEN 'very viral'
                WHEN CAST(viewCount AS INT) BETWEEN 10000 AND 99999 THEN 'viral'
                ELSE 'normal viewing'
            END AS video_virality,
            CAST(commentCount AS INT) AS commentCount,
            CAST(likeCount AS INT) AS likeCount,
            CAST(publishedAt AS DATE) AS date
        FROM videos
    """)

The transformations include:

Casted viewCount, likeCount, and commentCount to integer values for easier analysis.
Calculated engagement rate whose formula is: (likes + comments) / views

Loading into Postgres

To load the data into a Postgres database, we should use the df.write.jdbc() method from Pyspark.

NOTE: Make sure that the spark.jars package is installed to avoid running into errors while loading data into the database. Also, use the JDBC URI to connect to the database (it doesn't have the user and password as they have been set up in the properties variable)

properties = {
    'user': 'avnadmin',
    'password': DB_PWD,
    "driver": "org.postgresql.Driver"
}

# code

try:
     new_df.write.jdbc(url=DB_URL, table='pl_analytics', mode='overwrite', properties=properties)
     print("Data loaded into Postgres DB successfully!")
except Exception as e:
     print(f"Loaded Spark Dataframe into db error: {e}")

3: Orchestrate with Airflow

To automate the workflow weekly, I created two Python modules:

extract_data.py: Extracts data from YouTube and writes to JSON
transform_and_load.py: Reads the JSON, processes with PySpark, and loads into PostgreSQL

Then I defined a DAG in extract_transform.py and defined the extract, transform and load tasks using the PythonOperator:

with DAG(
    dag_id='pl_analytics_dag',
    description='This DAG contains the pipeline for extracting, transforming and loading data from the Youtube API',
    default_args=default_args,
    start_date=datetime(2025, 6, 12),
    schedule='@weekly',
    catchup=False
) as dag:
    extract = PythonOperator(
        task_id = 'extract_data',
        python_callable=extract_task
    )

    transform = PythonOperator(
        task_id = 'transform_and_load',
        python_callable=transform_task
    )

    extract >> transform

4. Visualizing with Grafana

I connected the PostgreSQL database to Grafana Cloud and created a dashboard to draw insights from the collected data. Some of the insights include:

Line chart of views over time
Top 10 most viewed videos (bar chart)
Histogram of engagement rates
Scatter plot of video length vs. views/likes/comments

Conclusion

This project gives hands-on experience with end-to-end ETL pipeline design, from data extraction from Google APIs using googleapiclient to transformation using Pyspark, loading, and visualization. It's a solid foundation for building more robust data pipelines.

For more blogs like this, please like, comment and follow me to stay updated in the data engineering world!

Testing Airflow DAGs 101: A Practical Guide for Modern Data Teams.

Denzel Kanyeki — Fri, 06 Jun 2025 15:58:00 +0000

Apache Airflow is an essential tool for data engineers. Data engineers use Airflow to automate, schedule and manage tasks within Directed Acyclic Graphs (DAGs) to ensure that data pipelines are being run effectively.

When pushing our DAGs to production, we, as data engineers, run the risk of pushing broken DAGs to production, breaking data pipelines and costing the data team and the whole organization time and money in the process. This is the reason why testing our DAGs before pushing them to production is key in maintaining robust and scalable data pipelines, while saving time and money for the organization.

Testing DAGs on Airflow before was a hard task as engineers had to create many DAGs on the production server to test their code. This takes up a lot of space and wastes a lot of time.

Why Test Airflow DAGs?

Testing Airflow DAGs is not just about catching bugs. It is about ensuring data integrity, reducing operational risk, and speeding up development. Without tests, a small change in your pipeline could have unintended consequences downstream ranging from data loss to duplicated reports and dashboards.

The following are reasons why testing DAGs is important in the data lifecycle:

Prevent failures before they happen. This helps in identifying breaking points in our code even before pushing the DAG to production
Test logic and dependencies without waiting for a full run
Catch regressions and breaking changes early
To save time and resources for the data team and whole organization in the long run.

Types of tests that can be done on Airflow DAGs.

There are various tests that can be conducted on Airflow DAGs which include:

Unit tests using the pytest or unittest modules from Python
Integration tests by setting up test DAGs and mock DAGs
End-to-end tests by setting up test environments and conducting full DAG runs using the Airflow CLI or breeze.

1. Unit tests.

Unit tests are all about separating your logic from the DAG and test each task or logic independently.
This allows us to identify as many breaking points as possible in our code and fix them before pushing to production.

Example of a unit test in Airflow.

Assume we have a task in our DAG that calculates the average age of a user, and it takes in a list of user data as so:

from airflow.decorators import dag, task

default_args = {
  ...
}

@dag(dag_id='user_data_dag', default_args=default_args, schedule='@daily', start_date=datetime(2025, 6, 4))
def user_data_dag():
  ... # code and other tasks
  @task
  def calculate_avg_age(users):
    if not users:
      return 0

    total_age = sum(user["age"] for user in users)
    return total_age / len(users)

Let's run unit tests on this task, based on various logic.

import pytest
from dags import calculate_avg_age

def test_calculate_average_age_with_valid_users():
    users = [
        {"name": "Alice", "age": 30},
        {"name": "Bob", "age": 40}
    ]
    result = calculate_average_age(users)
    assert result == 35

def test_calculate_average_age_with_empty_list():
    users = []
    result = calculate_average_age(users)
    assert result == 0

def test_calculate_average_age_with_one_user():
    users = [{"name": "Charlie", "age": 22}]
    result = calculate_average_age(users)
    assert result == 22

Pros

Catches bugs early
Broad angle of logic testing
Faster feedback
Encourages writing modular code
Easier to debug

Cons

Doesn't cover the entire scope of the project/DAG
It is time consuming
Doesn't uncover DAG related bugs and errors
Provides a false sense of security as it doesn't guarantee that the DAG will run successfully in production

2. Integration tests.

While unit tests focus on isolated logic, integration tests verify how multiple components e.g. tasks interact with each other such as tasks passing data via Airflow XCom, API calls, or database operations.

These tests simulate a portion (or all) of your DAG logic to ensure everything works together as expected.

Example of an integration test in Airflow

Let's have two tasks extract() which returns a list of dictionaries containing user data and transform() which calculates the average age of users in the data set.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule_interval=None, start_date=datetime(2025, 6, 4), catchup=False)
def integration_dag():

    @task()
    def extract():
        return [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 40}]

    @task()
    def transform(users):
        avg_age = sum([u["age"] for u in users]) / len(users)
        print(f"Average age is: {avg_age}")
        return avg_age

    users = extract()
    transform(users)

dag = integration_dag()

Here's an integration test for this DAG:

# tests/test_integration_dag.py

from airflow.models import DagBag, TaskInstance
from airflow.utils.state import State
from datetime import datetime

def test_taskflow_xcom_passes_correctly():
    dag_bag = DagBag(dag_folder="dags/", include_examples=False)
    dag = dag_bag.get_dag("integration_dag")

    assert dag is not None
    assert dag.tasks

    extract_task = dag.get_task("extract")
    transform_task = dag.get_task("transform")

    exec_date = datetime(2023, 1, 1)

    extract_ti = TaskInstance(task=extract_task, execution_date=exec_date)
    extract_ti.run(ignore_ti_state=True)

    # XCom value from extract
    users_from_xcom = extract_ti.xcom_pull(task_ids="extract", key="return_value")
    assert isinstance(users_from_xcom, list)
    assert users_from_xcom[0]["name"] == "Alice"

    transform_ti = TaskInstance(task=transform_task, execution_date=exec_date)
    transform_ti.run(ignore_ti_state=True)

    assert transform_ti.state == State.SUCCESS

Pros

Validates real behaviour during DAG runs
More realistic scope than unit tests
Useful in test environments

Cons

Harder to maintain
Complex setup
Slower in execution

3. End-to-end tests.

End-to-end (E2E) testing is the practice of executing an entire DAG or a significant portion of it in an environment that mimics a production environment. It tests not just your task logic, but also task dependencies, retries, data movement, and overall orchestration.

It simulates a DAG run using real data or test data and ensures the entire flow behaves as expected from task execution to XComs, file movement, database writes, or API calls.

Pros

Tests the full workflow
It mimics the production environment, hence exhibiting behaviour as it would in a production environment
Uncovers integration issues

Cons

Slow to run
Harder to set up and debug

4. Testing using `airflow dags test`

The airflow dags test command allows you to manually run a DAG from the command line for a given execution date — task by task without waiting for a scheduler or triggering a full DAG run in the database.

It simulates a DAG run without scheduling, backfilling, or logging metadata to the database.

To use this command in the CLI:

airflow dags test sample_etl_dag 2023-01-01
# Syntax: airflow dags test <dag_id> <execution_date>

NOTE: This only runs locally so it doesn't affect any external sources e.g. APIs

5. Using `dag.test()` method.

We can use dag.test() method at the end of our code to test our code when run in the terminal.

It is handy for local development testing, as our DAG can be automatically tested when the file is run.

Example.

from airflow.decorators import dag, task
from datetime import datetime, timedelta

default_args = {
    'owner': 'deecodes',
    'retries': 5,
    'retry_delay': timedelta(minutes=1)
}

@dag(dag_id='test_dag', default_args=default_args, start_date=datetime(2025, 5, 30), schedule='@hourly', catchup=False)
def test_dag():
    @task
    def say_hello():
        print("Hello there, Airflow is running!")

    say_hello()

dag_test = test_dag()

# let's test this DAG
if __name__ == '__main__':
    dag_test.test()

Then, run our file in the terminal to test.

python3 test_dag.py

Our result

[2025-06-06T18:51:51.238+0300] {dag.py:4435} INFO - dagrun id: test_dag
[2025-06-06T18:51:51.278+0300] {dag.py:4451} INFO - created dagrun <DagRun test_dag @ 2025-06-06 15:51:51.044158+00:00: manual__2025-06-06T15:51:51.044158+00:00, state:running, queued_at: None. externally triggered: False>
[2025-06-06T18:51:51.349+0300] {dag.py:4396} INFO - [DAG TEST] starting task_id=say_hello map_index=-1
[2025-06-06T18:51:51.350+0300] {dag.py:4399} INFO - [DAG TEST] running task <TaskInstance: test_dag.say_hello manual__2025-06-06T15:51:51.044158+00:00 [scheduled]>
[2025-06-06 18:51:53,606] {taskinstance.py:3134} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='deecodes' AIRFLOW_CTX_DAG_ID='test_dag' AIRFLOW_CTX_TASK_ID='say_hello' AIRFLOW_CTX_EXECUTION_DATE='2025-06-06T15:51:51.044158+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2025-06-06T15:51:51.044158+00:00'
[2025-06-06T18:51:53.606+0300] {taskinstance.py:3134} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='deecodes' AIRFLOW_CTX_DAG_ID='test_dag' AIRFLOW_CTX_TASK_ID='say_hello' AIRFLOW_CTX_EXECUTION_DATE='2025-06-06T15:51:51.044158+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2025-06-06T15:51:51.044158+00:00'
Task instance is in running state
 Previous state of the Task instance: queued
Current task name:say_hello state:scheduled start_date:None
Dag name:test_dag and current dag run status:running
[2025-06-06T18:51:53.611+0300] {taskinstance.py:732} INFO - ::endgroup::
Hello there, Airflow is running!
[2025-06-06 18:51:53,614] {python.py:240} INFO - Done. Returned value was: None
[2025-06-06T18:51:53.614+0300] {python.py:240} INFO - Done. Returned value was: None
[2025-06-06T18:51:53.624+0300] {taskinstance.py:341} INFO - ::group::Post task execution logs
[2025-06-06T18:51:53.625+0300] {taskinstance.py:353} INFO - Marking task as SUCCESS. dag_id=test_dag, task_id=say_hello, run_id=manual__2025-06-06T15:51:51.044158+00:00, execution_date=20250606T155151, start_date=, end_date=20250606T155153
Task instance in success state
 Previous state of the Task instance: running
Dag name:test_dag queued_at:None
Task hostname:Denzo.localdomain operator:_PythonDecoratedOperator
[2025-06-06T18:51:53.696+0300] {dag.py:4410} INFO - [DAG TEST] end task task_id=say_hello map_index=-1
[2025-06-06T18:51:53.707+0300] {dagrun.py:854} INFO - Marking run <DagRun test_dag @ 2025-06-06 15:51:51.044158+00:00: manual__2025-06-06T15:51:51.044158+00:00, state:running, queued_at: None. externally triggered: False> successful
Dag run in success state
Dag run start:2025-06-06 15:51:51.044158+00:00 end:2025-06-06 15:51:53.708913+00:00
[2025-06-06T18:51:53.710+0300] {dagrun.py:905} INFO - DagRun Finished: dag_id=test_dag, execution_date=2025-06-06 15:51:51.044158+00:00, run_id=manual__2025-06-06T15:51:51.044158+00:00, run_start_date=2025-06-06 15:51:51.044158+00:00, run_end_date=2025-06-06 15:51:53.708913+00:00, run_duration=2.664755, state=success, external_trigger=False, run_type=manual, data_interval_start=2025-06-06 14:00:00+00:00, data_interval_end=2025-06-06 15:00:00+00:00, dag_hash=None

Pros

Allows for faster local testing
Simple debugging
No Airflow environment required to run

Cons

This method is not scalable
It doesn't test DAG's integrity
It doesn't simulate a true DAG run.

Final thoughts

Testing your Airflow DAGs may feel like overkill when you are moving fast, but the long-term stability and confidence it brings are well worth the effort. Treat your data pipelines like production code, because they are.

By following the practical steps in this guide, you and your team will be better equipped to deliver robust, scalable pipelines that do not break when it matters most, during production.

To learn more about testing in Airflow, visit the following links to learn more:

I hope this post has enriched your knowledge in testing your Airflow DAGs. Please leave a like, comment, and follow for more data engineering posts just like this one.

Do you have any other methods that you use to test your Airflow DAGs before pushing them to production? Let me know down in the comments!
For any comments or communication, please email me at denzelkinyua11@gmail.com to get in touch.

Building a Stock Data Pipeline with requests, Apache Airflow and PostgreSQL

Denzel Kanyeki — Mon, 26 May 2025 07:30:21 +0000

In this tutorial, I will walk you through how I built a fully functional ETL pipeline using Apache Airflow to extract stock data from the AlphaVantage API, transform it using pandas, and load it into a PostgreSQL database. The pipeline is modular and scalable and is a great introduction to Airflow DAGs, task scheduling and orchestration.

You can access the GitHub repository here.

Project Overview

The pipeline pulls daily, weekly and monthly stock data from AlphaVantage for IBM’s daily time series and performs three key steps:

Extract: Retrieve data from the API using the requests library.
Transform: Clean and convert the data to a structured pandas DataFrame.
Load: Store the cleaned data into a PostgreSQL database using SQLAlchemy's create_engine method.

Requirements

The following packages are required to run this project:

- apache-airflow==2.10.5
- python
- pandas
- requests
- sqlalchemy
- psycopg2-binary
- python-dotenv

Project Structure

stock-data-pipeline/
├── dags/
│   ├── daily_pipeline.py
│   ├── weekly_pipeline.py
│   ├── monthly_pipeline.py
├── airflow.cfg
├── airflow.db
├── webserver_pid.py
├── .gitignore
├── .env
├── requirements.txt
└── README.md

Setup Instructions

NOTE for Windows users: Airflow does not support Windows natively. Use WSL (Windows Subsystem for Linux) to run this project.

1. Clone the repository

To run this repository, open your terminal and run this command

git clone https://github.com/dkkinyua/stock-data-pipeline.git
cd stock-data-pipeline

2. Set up a virtual environment

Install and activate your virtual environment using venv or any other virtual environment provider, such as virtualenv.

Installing virtual environments for each project is essential and a good practice to prevent package mixup and allows each project to work with compatible package versions.

python -m venv your_env
source your_env/bin/activate   # Because Windows users will be 
running Airflow on WSL, use this command to activate your environment.

When your virtual environment is active, the environment name will show next to your directory in the terminal as shown in the snapshot below:

3. Install dependencies

The project structure has a requirements.txt file containing all the packages required for this project.

To install these packages, run the following command in your terminal.

pip install -r requirements.txt

4. Configure Apache Airflow

a. Set your Airflow home path and initialize Airflow's SQLite database:

export AIRFLOW_HOME=$(pwd)/airflow
airflow db init

The airflow db init command creates a folder airflow/ in the project's root directory.

b. Create an admin user:

airflow users create \
    --username admin \
    --firstname YourName \
    --lastname Admin \
    --role Admin \
    --email admin@example.com \
    --password admin

c. Start the webserver and scheduler:

airflow webserver & airflow scheduler

Airflow UI will run on http://localhost:8080

Exploring the DAG: `daily_pipeline.py`

This DAG extracts IBM’s daily stock data using AlphaVantage's API and runs at midnight daily.

DAG setup

default_args = {
    'owner': 'deecodes',
    'retries': 5,
    'retry_delay': timedelta(minutes=5)
}
@dag(dag_id='daily_data_pipeline_dag', default_args=default_args, start_date=datetime(2025, 5, 13), schedule='@daily')
def extract_data():
    # code

a. `extract_daily_data`

def extract_daily_data():
    url = f"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey={API_KEY}"
    response = requests.get(url)
    return response.json() if response.status_code == 200 else None

TIP: Store your API keys and credentials in a .env file and add it to .gitignore. A .gitignore file instructs Git not to push certain files and folders that might have sensitive information

b. `transform_daily_data`

def transform_daily_data(data):
    df = pd.DataFrame(...)
    df["Date"] = pd.to_datetime(df["Date"])
    df.set_index("Date", inplace=True)
    ...
    return df

c. `load_to_db`

def load_to_db(df):
    engine = create_engine(DB_URL)
    df.to_sql(name='daily_stock_data', con=engine, if_exists='append')

Setting Task Sequence with Airflow's XCom

The tasks return data values when complete. We will use Airflow's XCom feature to share this data between tasks, and this will help us in setting up the task sequence.

data = extract_daily_data()
df = transform_daily_data(data)
load_to_db(df)

Trigger the DAG from the Airflow UI:

Check success status:

Inspect task flow in the graph view:

Confirm the data loaded in PostgreSQL:

✅ Conclusion

We’ve walked through:

How to extract and clean stock data from AlphaVantage.
How to configure and run DAGs with Apache Airflow.
How to load structured data into a PostgreSQL database.

If you have questions or suggestions, feel free to open a pull request or email me at denzelkinyua11@gmail.com.

Also, please like, share, comment and follow me to get more data engineering blogs and projects and stay in the know-how of the data engineering world.

dbt for Normies - A Complete Guide.

Denzel Kanyeki — Sun, 11 May 2025 15:43:35 +0000

Whether you are new to data engineering or you are quite conversant with the dataverse, you should have heard or used Data Build Tool (dbt).

Then, what is dbt? dbt is a transformation tool that uses 3 languages to be able to help data engineers transform data efficiently in ETL and ELT pipeline development. These three languages are:

Structured Query Languages(SQL): This is the main language used in dbt. It is used in transforming data efficiently in databases.
YAML: This is the language used to configure dbt in configuration files i.e. dbt_project.yml file which contains configuration settings for our models, tests among other dbt essentials.
Jinja: This is a powerful templating engine used in dbt. It can also be used in other languages i.e. Python (Django, Flask) where it is combined to work with HTML files to create robust and functional web applications. Using Jinja in dbt allows for reusability for our code in various files within our project, making it very important for users. Read more about Jinja here.

Instead of writing massive SQL scripts, dbt helps you:

Write modular SQL
Run version-controlled transformations
Test and document your models with easy to navigate UI.
Embrace software engineering best practices for data analytics

dbt (data build tool) works exceptionally well with CTEs (Common Table Expressions) because they align closely with dbt’s core principles:

clarity
modularity
reusability
testability.

There is a small introduction for what dbt is and what it entails, let's dive deeper into it and get to know its basics, install dbt-core, write our first dbt model and learn more about this powerful tool for data engineers.

1. Installing and initializing `dbt`.

To install dbt, we need to either install dbt-core or dbt-cloud. For this blog, we will install dbt-core and we will install dbt-cloud in another post.

To work with dbt, you need to have the following prerequisites:

Installing Python packages in ther terminal
Beginner to advanced SQL knowledge
Eagerness to learn!

Before installing dbt make sure you have installed and activated your virtual environment. This helps with preventing package version mixups and errors.
To install and activating your virtual environment according to your operating system:


python3 -m venv your_env

source your_env/bin/activate # Linux / MacOS
your_env\Scripts\activate # Windows

To install dbt, run the following commands, following the database/data source/data warehouse that you would like to link to e.g. postgres, redshift, snowflake, bigquery.


pip install dbt-core dbt-postgres # If you choose postgres
pip install dbt-core dbt-redshift # If you choose redshift
pip install dbt-core dbt-snowflake # If you choose snowflake
pip install dbt-core dbt-bigquery # If you choose bigquery

Run the following command to initialize our dbt project:


dbt init your_project_name
cd your_project_name

This will prompt you to enter some data source credentials e.g. port, host, user and password. By running this command for the first time, it will set up .dbt/ folder in your HOME directory for your global configuration files e.g. profiles.yml that dbt uses to connect to your data warehouse.

This also creates very important files for us, like models which contains our models, dbt_project which contains our project's configuration settings, tests for our tests among other folders.

2. Creating our first `dbt` model.

To create our first model, head our to our models/ folder and create a file, name it as you like. Let's name it high_cost_houses.sql. My data warehouse contains house data in Kenya from low range to high range cost houses. I needed to get all the high cost houses.

In dbt, when running a model, it creates a materialized view in our tables where we can go and view the results in our database. We can change that in dbt_project.yml file, I'll show you how to do that later on, don't worry!

For our first model:

with high_cost_houses as (
    select titles, locations, prices
    from public.house_data
    where prices >= 50000000
    order by prices desc
)
select * from high_cost_houses

This query fetches the houses whose price is above 50 million. To run this model:

dbt run

This creates a view in our data warehouse and you can go check it out under Materialized views.

a. Modularity in `dbt`

dbt uses both Jinja templates and SQL to ensure that we can use our models wherever in our models without repeating ourselves. This ensures that our models are reusable and maintainable. For example, to use our high_cost_houses.sql in another model, we can use ref() to reference to that model as so:


-- Let's get houses that are in Runda from our high_cost_houses

SELECT *
FROM {{ ref('high_cost_houses') }}
WHERE locations = 'Runda'

We have reused our model in our new model using {{ ref('high_cost_houses') }}.

b. Materialization in `dbt`

Remember when I said I would help you override dbt's default materialization from a view to a table? Let's do it.
Head over to your dbt_project and change this from view to table

models:
  dbt_practice:
# change this from view to table
    high_cost_houses:
      +materialized: table # previously: view

Tables are easier to access but take up more space than views which are cheaper to store and easier to run.

3. dbt Sources, Seeds and Analyses.

Sources are data sources from our data warehouse, seeds are csv files that we would like to load for analysis while analyses are SQL queries that are one off queries and are not materialized into our data warehouse as tables or views. If we would want to materialize them, we should move them to the models/ folder.

To reference our source in dbt, we use the source('db_name', 'table_name') method to do so. Make sure you have set the source in our dbt_project.yml file as so

sources:
    - name: database_loading
    tables:
        - name: house_data_loading

Then reference it as so:

SELECT *
FROM {{ source('database_loading', 'house_data_loading') }}

4. Testing in `dbt`

In dbt, there are two types of tests which include:

a. Singular tests - These are tests that are run in one model and don't need to be repeated in other models. This tests actually validate outcomes of models. For example, in our high_cost_houses.sql model, our houses should not have a price less than 50 million, so let's write a test to validate this.

Head over to the tests/ folder and create a new test file called check_prices.sql and write the following test query:


SELECT *
FROM {{ ref(''high_cost_houses'') }}
WHERE prices < 50000000

After writing our test, run this to run our test:

dbt test

If our test fails, it means our data has some values that have a price below 50 million and should check that.

b. Generic tests - These are tests that can be performed on many tables. These tests follow the Don't Repeat Yourself (DRY) principle.
There are built-in and pre built-in generic tests in dbt. Examples of generic tests include:

not_null
unique
accepted_values
relationships

For example, we would to check if there are any locations that are empty in the high cost and low cost tables in our data warehouse. Let's write a generic test to check for any empty values using SQL and Jinja:

{% test empty_locations(model, column_name) %}
    SELECT {{ column_name }}
    FROM {{ model }}
    WHERE TRIM{{ column_name }} = ''
{% endtest %}

To enter the test for these tables, create a test.yml file and load the test inside as so:

models:
  - name: low_cost_houses
    columns:
      - name: locations
        tests:
          - empty_locations

  - name: high_cost_houses
    columns:
      - name: locations
        tests:
          - empty_locations

Then run:

dbt test

This will run our tests in our models and validate our output.

We can also load built-in generic tests in our test.yml file without writing much code. To do so:

models:
  - name: low_cost_houses
    columns:
      - name: locations
        tests:
          - not_null

  - name: high_cost_houses
    columns:
      - name: locations
        tests:
          - not_null

And run:

dbt test

This will work the same way!

5. `dbt` Macros

Macros are functions in dbt. They make code reusable, easier to use, modular, scalable and easier to maintain.

They are constructed using the Jinja templates:

{% macro macro_name(params) %}
--code
{% endmacro %}

Let's create a macro of calculating the average house prices per location in Kenya.

In the macros/ folder, create a file avg_house_price.sql and write the following code:

-- Macro to get avg prices of houses per location
{% macro get_avg_prices(locations, price, table_name) %}
SELECT {{ locations }}, AVG({{ price }}) AS avg_price
FROM {{ table_name }}
WHERE {{ price }} IS NOT NULL
GROUP BY {{ locations }}
{% endmacro %}

To use our macro in our model, create a new model avg_houses_per_location.sql in our models folder and write the following code:

{{ get_avg_prices('locations', 'prices', 'house_data') }}

Check how easy it is to create models with macros! It makes work easier as I can reuse that macro in other models.

6. `dbt` packages.

Imagine what languages would be like without packages, e.g. Python without requests, pandas? It would be tiring for any engineer or developer to call APIs or clean data using Python.

dbt has packages that makes our work easier with running models, testing and building data pipelines. One of the widely used dbt package is dbt expectations and dbt utils.

Go through the dbt Hub for more dbt packages.

Let's import dbt utils from dbt Hub into our project.
In the packages.yml file, configure our package as so:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0

Then run the following command in your terminal to install the package and use them in our code:

dbt deps

Now we can use dbt utils in our code!

{{ dbt_utils.star(from=ref('users')) }}

Conclusion

dbt is quite diverse and we've covered only the fundamentals in this blog. For more information on dbt check the following links:

Thank you for reading this post, like, comment and share this post. This will be much appreciated!

Forem: Denzel Kanyeki

Building a Reddit Sentiment Pipeline using Python, PostgreSQL, VADER, Airflow, Grafana, Prometheus and StatsD

Project Details

Project Setup

Step 1: Extract Reddit Data

Step 2: Transform & Analyze

Step 3: Load into PostgreSQL

Step 4: Orchestration with Airflow

Step 5: Visualize in Grafana

To add Prometheus as a data source on Grafana and visualize metrics:

1. Head over to localhost:3000 to access the Grafana service

2. Adding Prometheus as a data source

3. Import a custom dashboard

Insights generated from this project

Conclusion

Docker for Data Engineers: The Complete Beginner’s Guide

Introduction

Functions of Docker in Data Engineering

Docker vs Virtual Machines (VMs)

Docker Compatibility, its Reservations and Solution

Docker Images and Docker Containers.

Docker and Docker Compose

Public and Private Docker Registries

Port Binding

Building custom images using Dockerfile

Mini Project: PostgreSQL and pgAdmin4 User Interface with Docker and Docker Compose

a. Create a docker-compose.yml file:

b. Start the services

c. Access pgAdmin

Conclusion

Building a Fraud Detection Pipeline using Python, PostgreSQL, Apache Kafka, PySpark, Grafana and Scikit-learn

Introduction

Project Architecture

Project Workflow

1. Kafka Producer

2. Kafka Consumers,

a. transaction_consumer

b. fraud_consumer

3. Database and visualization

4. Machine Learning, Teaching the System to Spot the Odd Ones Out

Conclusion

Building a Real-Time Data Pipeline using Binance Websocket API, PySpark, Kafka and Grafana

Introduction

Project Architecture

Tech Stack Used

Project Breakdown

Q: Why use websockets instead of the Binance REST APIs?

Kafka producer and consumer

Let's produce data to a Kafka topic, kline_data.

Let's consume data from the Kafka topic and stream data to a Postgres database using PySpark's Structured Streaming.

Containerization using Docker and Docker Compose.

Visualization in Grafana

Conclusion

Using Data Engineering to Track Food Prices and Inflation in Kenya from 2006 to 2025

Project Objectives

Tech Stack Used

Project Architecture

Database Architecture

Pipeline Overview

a. Data Extraction.

b. Data Transformation

c. Loading data into a PostgreSQL database

d. Workflow Automation using Airflow

e. Visualization and analysis

Insights

Challenges faced during building this pipeline

a. Missing data

b. Data Normalization

Lack of data sources

Conclusion

A Real-Time Earthquake Monitoring Pipeline with Kafka, MySQL, PostgreSQL, and Grafana

Project Workflow Chart

Project Walkthrough

1. Earthquake Data Extraction from the USGS Earthquake API

2. Loading into MySQL Database

3. Change Data Capture with MySQL Debezium Connector

4. Kafka Streams Earthquake Events to a PostgreSQL Sink

5. Real-Time Analytics & Visualization in Grafana

6. Automated Hourly Execution with Apache Airflow

Why are the connectors running on Confluent Cloud?

1. Head over to `localhost:3000` to access the Grafana service

a. Create a `docker-compose.yml` file:

a. `transaction_consumer`

b. `fraud_consumer`

4. Testing using `airflow dags test`

5. Using `dag.test()` method.

Exploring the DAG: `daily_pipeline.py`

a. `extract_daily_data`

b. `transform_daily_data`

c. `load_to_db`

1. Installing and initializing `dbt`.