Hey Devs đź‘‹,
If you’re starting out in data engineering or curious how real-world data pipelines work, this post is for you.
As an Associate Data Engineer Intern, I wanted to go beyond watching tutorials and actually build a working pipeline — one that pulls real-world data daily, processes it, stores it, and is fully containerized.
So I picked something simple but meaningful: global COVID-19 stats.
Here’s a breakdown of what I built, how it works, and what I learned.
📊 What This Pipeline Does
This mini-project automates the following:
âś… Pulls daily global COVID-19 stats from a public API
âś… Uses Airflow to schedule and monitor the task
âś… Stores the results in a PostgreSQL database
âś… Runs everything inside Docker containers
It's a beginner-friendly, end-to-end project to get your hands dirty with core data engineering tools.
đź§° The Tech Stack
- Python — for the main fetch/store logic
- Airflow — to orchestrate and schedule tasks
- PostgreSQL — for storing daily data
- Docker — to containerize and simplify setup
- disease.sh API — open-source COVID-19 stats API
âś… Want this as a .md
file to post on Dev.to?
Let me know — I can prep and format it for you in one go.
Also happy to help with a LinkedIn summary or visual for carousels if you're planning to cross-post.
⚙️ How It Works (Behind the Scenes)
- Airflow DAG triggers once per day
- A Python script sends a request to the COVID-19 API
- Parses the JSON response
- Inserts the cleaned data into a PostgreSQL table
- Logs everything (success/failure) into Airflow's UI
Everything runs locally via docker-compose
— one command and you're up and running.
🗂️ Project Structure
airflow-docker/
├── dags/ # Airflow DAG (main logic)
├── scripts/ # Python file to fetch + insert data
├── docker-compose.yaml # Setup for Airflow + Postgres
├── logs/ # Logs generated by Airflow
└── plugins/ # (Optional) Airflow plugins
You can check the full repo here:
👉 GitHub: mohhddhassan/covid-data-pipeline
đź§ Key Learnings
âś… How to build and run a simple Airflow DAG
âś… Using Docker to spin up services like Postgres & Airflow
âś… How Python connects to a DB and inserts structured data
âś… Observing how tasks are logged, retried, and managed in Airflow
This small project gave me confidence in how the core parts of a pipeline talk to each other.
🔍 Sample Output from API
Here’s a snippet of the JSON response from the API:
{
"cases": 708128930,
"deaths": 7138904,
"recovered": 0,
"updated": 1717689600000
}
And here’s a sample SQL insert triggered via Python:
INSERT INTO covid_stats (date, total_cases, total_deaths, recovered)
VALUES ('2025-06-06', 708128930, 7138904, 0);
🔧 What’s Next?
I’m planning to:
🚧 Add deduplication logic (so it doesn’t insert same data daily)
📊 Maybe create a Streamlit dashboard on top of the database
⚙️ Play with sensors, templates, and XComs in Airflow
⚡ Extend the pipeline with ClickHouse for OLAP-style analytics
📌 Why You Should Try Something Like This
If you're learning data engineering:
- Start small, but make it real
- Use public APIs to practice fetching and storing data
- Wrap it with orchestration + containerization — it’s closer to the real thing
This project taught me way more than passively following courses ever could.
🙋‍♂️ About Me
Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub
🚀 Learning in public, one pipeline at a time.
Top comments (0)