Forem: Rotich Kelly

From Binance to Grafana: Building a Real-Time Crypto Dashboard with CDC & Cassandra

Rotich Kelly — Thu, 03 Jul 2025 13:10:11 +0000

Introduction
In this project, I set out to solve a common challenge in crypto analytics: how do you reliably stream, store, and visualize fast-moving market data in real time?

To tackle this, I built a production-grade real-time data pipeline that extracts data from the Binance API, loads it into a relational database like PostgreSQL, replicates updates into Cassandra using Change Data Capture (CDC) via Debezium, and finally visualizes key metrics using Grafana.

The goal? Create a seamless, scalable pipeline capable of tracking market trends, spotting top-performing tokens, and powering real-time dashboardswith zero manual refreshes.

Why This Project Matters
Cryptocurrency markets operate 24/7 with volatile price action and massive data velocity. Traditional batch pipelines or manual scripts simply can’t keep up. I wanted to build something that could:

Continuously ingest price, volume, and trade data from Binance
Store data in a queryable format for structured analytics
Replicate changes across systems without breaking consistency
Enable live, real-time dashboards to support fast decisions

This documentation walks through each phase, from API extraction to CDC to visualization and shares lessons learned along the way.

Data Extraction
Here are the five endpoints I used for this project, each chosen to cover a critical slice of the crypto market landscape:

Latest Prices – /api/v3/ticker/price
This endpoint returns the latest market price for every trading pair. It’s lightweight and perfect for fast polling.

24h Ticker Stats – /api/v3/ticker/24hr
Provides daily stats like price change %, volume, and high/low prices, essential for tracking the top gainers and market trends.
fields:

priceChangePercent
highPrice
volume

Order Book – /api/v3/depth
Returns bid/ask levels for a given trading pair, offering insights into market depth and liquidity. Useful for advanced analytics like order flow tracking.

Recent Trades – /api/v3/trades
Gives a rolling list of individual trade executions, including price, quantity, and timestamps — key for volatility and momentum indicators.

Klines (Candlesticks) – /api/v3/klines
Returns historical OHLCV candlestick data for any symbol over a chosen interval (e.g., 1m, 5m, 1h). This powers most visual time-series analysis in Grafana.

Binance Extraction Script Workflow

Send API request to a specific Binance endpoint (e.g., /ticker/price).
Parse JSON response and extract key fields like symbol, price, volume, timestamp.
Format and clean data (add timestamps, remove duplicates).
Insert into PostgreSQL staging tables with proper indexing.
Log activity and handle API failures with retries.

Clone the Project Repository
We start by cloning the CDC project files including table schemas, Debezium config, and consumer scripts:

git clone https://github.com/yourusername/crypto-cdc-pipeline.git
cd crypto-cdc-pipeline

Install Java, Kafka, and Zookeeper
Kafka and Debezium require Java, so we install OpenJDK:

sudo apt update
sudo apt install -y default-jdk

Next, download and extract Kafka:

wget https://downloads.apache.org/kafka/3.6.0/kafka_2.13-3.6.0.tgz
tar -xzf kafka_2.13-3.6.0.tgz
mv kafka_2.13-3.6.0 kafka && cd kafka

Start Zookeeper and Kafka in two separate terminal windows or tmux panes:

# Terminal 1
bin/zookeeper-server-start.sh config/zookeeper.properties

# Terminal 2
bin/kafka-server-start.sh config/server.properties

Install & Configure Debezium Kafka Connect
Debezium captures Postgres changes via Kafka Connect. We install it like this:

cd ~/kafka
mkdir plugins && cd plugins
wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/2.5.1.Final/debezium-connector-postgres-2.5.1.Final-plugin.tar.gz
tar -xvf debezium-connector-postgres-2.5.1.Final-plugin.tar.gz

Update Kafka Connect config to load this plugin:

nano ../config/connect-distributed.properties

# Add or edit:
plugin.path=/home/youruser/kafka/plugins

Start Kafka Connect:

cd ~/kafka
bin/connect-distributed.sh config/connect-distributed.properties

Create Required Internal Kafka Topics
These are used by Kafka Connect to manage connector state:

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic connect-configs --partitions 1 --replication-factor 1 --config cleanup.policy=compact

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic connect-offsets --partitions 1 --replication-factor 1 --config cleanup.policy=compact

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic connect-status --partitions 1 --replication-factor 1 --config cleanup.policy=compact

Prepare PostgreSQL for CDC
In Dbeaver, log in to your database and create a publication:

CREATE PUBLICATION my_publication FOR TABLE binance.binance_latest_prices, binance.binance_24h_stats, binance.binance_klines, binance.binance_order_book, binance.binance_recent_trades;

Make sure your user has replication rights.

Register the Debezium Connector
Create the config file postgres-connector.json:

{
  "name": "postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "<aiven-host>",
    "database.port": "25060",
    "database.user": "<username>",
    "database.password": "<password>",
    "database.dbname": "defaultdb",
    "topic.prefix": "crypto_pg",
    "plugin.name": "pgoutput",
    "publication.name": "my_publication",
    "slot.name": "cdc_slot",
    "table.include.list": "binance.binance_latest_prices,binance.binance_24h_stats,binance.binance_klines,binance.binance_order_book,binance.binance_recent_trades",
    "database.sslmode": "require"
  }
}

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  --data @postgres-connector.json

Check status:

curl http://localhost:8083/connectors/postgres-connector/status | jq

Install Cassandra
Cassandra doesn’t support Python 3.12+. You must use Python 3.10 or 3.11.

Check your messages in the topic

bin/kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic <your-topic-name> \
  --from-beginning

Install Cassandra:

echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt update
sudo apt install cassandra

Start the Cassandra service:

sudo systemctl start cassandra
sudo systemctl status cassandra

Create Cassandra Keyspace & Tables
Enter the CQL shell:

cqlsh

Create a keyspace:

CREATE KEYSPACE crypto_data WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE crypto_data;

Create each table one by one(cassandra schema in the repo):

Create Python Consumers for CDC
Each script listens to a Kafka topic and writes to a Cassandra table.
Run each script:

python consumer_latest_prices.py
python consumer_24h_stats.py
python consumer_klines.py
python consumer_order_book.py
python consumer_recent_trades.py

Connect Cassandra to Grafana

grafana-cli plugins install hadesarchitect-cassandra-datasource
systemctl restart grafana-server

Add datasource: In Grafana UI, go to Configuration → Data Sources and add “Apache Cassandra” (community).

Configure: Set Contact point to hostname:9042 (e.g. localhost:9042 if on the same VM), fill in Keyspace, User, and Password

Test & save: Click Save & Test. Once “Database Connection OK” appears, you can create dashboards querying Cassandra.

Grafana Dashboard

Check out the repo:
https://github.com/KellyKiprop/Binance-CDC-Project

“Who Will Win Formula 1 in 2025? I Asked the Data.”

Rotich Kelly — Mon, 16 Jun 2025 14:02:44 +0000

(Spoiler: It’s Not Just About Speed, It’s About Python and Prediction Power.)

What do you get when you cross a Formula 1 obsession with a machine learning brain? A Saturday spent not watching races, but building models to predict the next F1 world champion. And yes, I let the data do the talking, because who needs crystal balls when you’ve got RandomForestRegressor and Linear Regression models?

From Track to Tech: The Data Science Pitstop
Every great F1 race begins in the garage, and so did mine, with historical data on drivers, races, and standings. I engineered features like:
• Age & Experience (because wisdom can beat raw speed)
• Wins per Season (obviously)
• Team Affiliation, color-coded for storytelling flair
It wasn’t just about crunching numbers. I wanted to tell a story, one where the data reveals the drama.

The Models: Linear vs Random Forest
I started simple: a Linear Regression model. It was decent, but something was missing, it just couldn’t capture the twists and turns of an F1 season.
Then I unleashed the big gun: Random Forest Regressor
Model RMSE R² Score
Linear Regression ~1499 0.28
Random Forest ~1284 0.47
The improvement? Like going from a pit stop in 6 seconds to a clean 2.1.

Predicting the 2025 Grid: Who Tops the Podium?
After training and testing, here’s what my model predicted for 2025:

Max Verstappen – The king stays king.
Lando Norris – The new kid has talent.
Charles Leclerc – The Ferrari fire is still burning. And here's the fun part, visualized with team colors, so you don’t just see the points, you feel the rivalry.

Data is the New Commentator
I didn’t just want to model the data, I wanted to speak its language. With visualizations like:
• Driver performance over years
• Age vs points (yes, experience matters)
• Team-colored bar charts of predicted standings
…it became less about predictions and more about painting the future of Formula 1.

What’s Next?
This is just Lap 1. I’m exploring:
• Real-time dashboards
• Weather & track condition variables
• Constructor vs Driver impact modeling

Still Building!!
Check it out -> https://github.com/KellyKiprop/Formula1_2025_Prediction

From Podcasts to Pipelines: Building a YouTube Analytics Engine Inspired by Mic Cheque

Rotich Kelly — Sat, 07 Jun 2025 20:01:55 +0000

"Now when you watch it you'll understand!"

Ever been so hooked on a podcast that you ended up building a full-blown data pipeline because of it? No? Just me? Cool. Let me tell you the story anyway.

It all started with the Mic Cheque Podcast—a brilliant blend of humor, deep takes, and real talk that kept popping up on my YouTube feed. As a data engineering enthusiast and a fan of the pod, I had one question buzzing in my head:

What makes some podcast episodes go viral while others stay in the shadows?

The Idea
What if I could track, analyze, and visualize the performance of the podcast episodes using actual YouTube data?
Boom—project idea locked. I decided to build a fully automated YouTube Data Pipeline with the goal of creating a live analytics dashboard to answer burning questions like:

Which episodes are going viral?

What days do high-performing episodes drop?

Is there a pattern between guest appearances and views?

The Stack
To make it real, I pulled out the big guns:

Python + Airflow: For automating the entire pipeline from extract to load

YouTube API: For fetching episode metadata and stats

PostgreSQL (Aiven): As the data warehouse

Apache Spark: For heavy lifting (a.k.a. transforming the raw data)

Grafana: For visualizing performance trends that even a podcast guest would appreciate

The Flow
Here’s what the pipeline does:

Extract data from YouTube using the YouTube Data API

Transform it using Spark (cleaning, enriching with time-based insights, classifying performance)

Load it into a PostgreSQL instance hosted on Aiven

Visualize the trends using Grafana—complete with charts showing view counts, likes, comments, publishing patterns, and a "performance class" metric

And the best part? All this runs automatically thanks to Airflow.

The Result
After plugging it into Grafana, the dashboard popped! Pie charts for performance class, time series of views by month, and even weekday publishing trends. It's like giving a brain to your favorite podcast channel.

What I Learned
Airflow can be your best friend or your worst enemy (don’t fight it—configure it properly!)

PostgreSQL on Aiven is smooth, but Grafana’s port configs can mess you up if you're not careful

Data pipelines are lit when they bring your passions and skills together

The Wrap-Up
This started as a fun side project but turned into a powerful learning experience. Whether you're a Mic Cheque stan, a data nerd, or both—I'd highly recommend building a pipeline around something you genuinely love.

Your data has a story. You just need to build the mic for it.

Want to see the code? Check out the GitHub Repo
https://github.com/KellyKiprop/Youtube-Data-Pipeline

How I Automated Crypto Price Tracking with Apache Airflow & CoinGecko

Rotich Kelly — Sat, 07 Jun 2025 13:20:01 +0000

"Because waking up at 2am to check Bitcoin prices isn’t scalable."

The Idea Behind the Project
Crypto markets never sleep, and neither should your data pipeline. I built a fully automated ETL pipeline using Apache Airflow that extracts hourly snapshots of crypto data from CoinGecko, stores them in a PostgreSQL database (on Aiven), and sets the stage for real-time analysis, dashboards, and trading models.

Whether you're tracking Bitcoin moonshots or studying volume dips in altcoins, this pipeline's got your back.

What It Actually Does
Every hour, my pipeline:

Pulls real-time data from the CoinGecko API for 15 top
cryptocurrencies (BTC, ETH, SOL, etc.)
Captures price, market cap, and trading volume
Stores the data in a time-series friendly PostgreSQL table: crypto.crypto_prices
Retries automatically on failure and logs each step in Airflow’s UI

No manual refresh. No scripts to rerun. Just clean, structured data—hour after hour.

Stack Breakdown
Tools
Apache Airflow DAG orchestration + scheduling
Python 3.12 Core logic and scripting
CoinGecko API Crypto market data
PostgreSQL (Aiven) Cloud-hosted database storage
psycopg2 PostgreSQL database connector
python-dotenv Secure secret management via .env

How I Built It
This project lives in my GitHub repo, and it’s fully reproducible:

Clone the repo
git clone https://github.com/KellyKiprop/Crypto-price-pipeline.git
cd crypto-etl-pipeline
Create a virtual environment & install dependencies

python -m venv venv source venv/bin/activate # Or venv\Scripts\activate on Windows pip install -r requirements.txt

Set up environment variables
Create a .env file in the root folder with your PostgreSQL config:

DB_NAME=defaultdb DB_HOST=your-db-host.aivencloud.com DB_USER=avnadmin DB_PASSWORD=yourpassword DB_PORT=17440

Initialize and start Airflow
airflow db init airflow webserver --port 8080 airflow scheduler
Trigger the DAG from the UI
Visit localhost:8080, toggle coin_price_etl_dag, and trigger a run.

Testing the Results
Check the crypto.crypto_prices table in your database — you should see hourly updates like:

name symbol price market_cap total_volume timestamp
Bitcoin BTC 71,231 ... ... 2025-06-07T12:00:00Z

You can now plug this into:

Grafana dashboards

ML models for price prediction

Long-term trend analysis

Why This Matters
It’s a hands-on intro to Airflow and ETL pipelines

Crypto data is noisy and dynamic — perfect for learning real-time data workflows

You’ll walk away with a reusable pattern for any public API + DB project

About the Author
Kelly Kiprop
kipropkelly4@gmail.com

This project uses Airflow to extract hourly crypto data from CoinGecko and store it in PostgreSQL. It’s cloud-ready, beginner-friendly, and built for real-time insight.

Starting my Data Analyst Journey at LuxDev bootcamp

Rotich Kelly — Sun, 13 Apr 2025 13:46:38 +0000

I’m super excited to share that I’ve officially kicked off my journey into the world of Data Analytics .Only two weeks ago, I enrolled in the Lux Dev Bootcamp, and it has been a enriching experience so far its been an enriching ride. I'm currently deep into by week 2, with a focus on solidifying the foundations of Excel and SQL the pillars of any strong data analysis workflow.

Why data analysis?
Data surrounds us, and I've always asked myself how uncooked data can tell such strong stories. Is it finding insights, making more informed decisions, or just addressing real issues—data is the key to them all. That's what I was drawn to it.

What I'm Learning Right Now
Excel: From pivot tables to formulas, I'm learning to manipulate and analyze data like a pro.
SQL: I’m building up my querying skills—getting comfortable with SELECT statements, filtering data, and joining tables to extract meaningful insights.
Power BI: I’ve started exploring how to bring data to life through interactive dashboards and visual storytelling—it's a game changer!

Putting Skills Into Practice
Right now, I’m working on a sales data project, a fast food shop selling pizza to be specific, which is helping me apply what I’ve learned in a real-world context. I’m analyzing trends, identifying key metrics, and working my way toward building my very first dashboard!

It's exhilarating , but there's nothing quite like seeing raw data come to life visually. I can't wait to post the finished product soon!

What I'm Finding Enjoyable about the Bootcamp is the hands-on learning not theory only—we're doing hands-on exercises and real case studies. Additionally, we're all in the same situation, and there's just such encouragement and collaboration. Furthermore, each week is an improvement on the last one, and I can already feel my confidence growing.

My Goals

Mastering Excel, SQL, and data visualization tools
Completing more real-world projects and case studies
Building a good data portfolio
Landing my first job as a junior data analyst
Stay curious, continue learning, and enjoy the ride with this amazing tech community

Let's make data fun!

DataAnalytics #SQL #Excel #PowerBI #DataDashboard #SalesProject #LuxDev #BootcampJourney #CareerSwitch #DevCommunity