Grants to Investments Part 2-3: Models and Pipelines

Ali Sher — Thu, 09 Apr 2026 11:59:32 +0000

🚀 Grants ETL Pipeline — Rust + Transformer-Based Classification

📌 Overview

I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:

⚡ High-performance data extraction using Rust
🧠 Semantic classification using BERT (zero-shot)
📊 Structured output ready for downstream analytics and dashboarding

This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.

🧩 Extraction Layer (Rust)

The Problem

The Grants Canada portal has no accessible API — only an HTML-rendered search interface. I needed a way to extract structured data at scale.

The Solution

I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc

Key Decisions

I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:

scraper — for HTML parsing
csv — for structured output

Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.

Outcome

✅ Successfully extracted structured grant data into CSV
✅ Significantly faster ingestion vs. the prior Python-based workflow

📄 Sample Record

Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 → Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA

🧠 Transformation + Classification

Objective

Categorize grants into meaningful sectors for analytics and discovery — making the data explorable beyond raw fields.

🤖 Model Choice

I evaluated two approaches:

Approach	Verdict
Traditional ML (clustering)	Requires labeled data, less semantic
BERT via Hugging Face (zero-shot)	✅ Selected

Why zero-shot BERT?

No labeled dataset required
Strong semantic understanding out-of-the-box
Fast to implement and iterate

⚙️ Inference Pipeline

print("Running classification...")
predictions = []

for text in df['text']:
    result = classifier(text, candidate_labels=CATEGORIES)

    predictions.append({
        'predicted_category': result['labels'][0],
        'confidence_score': result['scores'][0]
    })

Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.

🧼 Data Quality

The source data was highly structured and clean, which meant:

Minimal preprocessing required
Faster iteration on modeling and pipeline integration
No time lost on data wrangling before getting to the interesting parts

📦 Next Steps

The pipeline is actively being extended:

🗄️ Load Layer → Persist classified data in a database
📊 Analytics Dashboard → Visualize funding trends by category, region, and time
⏱️ Pipeline Orchestration → Automate ingestion + inference end-to-end

💡 Key Takeaways

Rust is a legit choice for ETL scraping — not just systems programming. The performance gains over Python are real and measurable.
Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
Modular pipeline design pays off early — separating extraction, transformation, and load made iteration much faster.
Don't over-engineer — the right tool for each layer matters more than using a single stack.

🔗 Links

📁 GitHub: github.com/Sher213/GrantsInvestments

Open to opportunities in Data Science, ML Engineering, and Data Engineering — feel free to reach out at alisher213@outlook.com

Grants to Investments Part 1: The Data

Ali Sher — Tue, 08 Jul 2025 20:09:59 +0000

I was brainstorming some ideas for my next project while browsing the resources I have at my current co-op at the Ontario Public Service when an idea struck me.

Now, everyone knows that Government Grants are a huge opportunity for companies to jumpstart their journeys, and the amount of attention that AI is getting just makes it all the more hot of a topic. Just browsing The Government of Canada's Grants and Contributions Page at: https://search.open.canada.ca/grants/ and you will see a myriad of different listings.

So I thought, why not use this wealth of resources to create a solution that helps people judge which public sectors are getting the most funding?

This is where my current project is stepping in - I am ideating a solution that helps people find investment opportunities using publicly available grant knowledge put simply.

The steps are simple (in theory). It will feature an ETL pipeline where I will ingest data and feed it to a model that determines which sectors are the hottest of the week (or some timeframe). They will provide a quick summary via LLM as to what the AI opportunities are in each sector and whether that sector (and a listing of grants/companies) are worth looking into. Ideally, I will also extract data from another API source such as market data to support these findings (Making a note to myself to include that these are NOT Investment Advice, etc, etc.).

The data pulled will help people know where government spending (their money!) is going as well as give an opportunity to see which companies are benefitting as a result to not only themselves, but as good companies do, to the people as well.

But, first things first - how do I create a model to determine which grant fits into which category? Well, I have selected the following categories/sectors to look at:

CATEGORIES = [
"Housing & Shelter",
"Education & Training",
"Employment & Entrepreneurship",
"Business & Innovation",
"Health & Wellness",
"Environment & Energy",
"Community & Nonprofits",
"Research & Academia",
"Indigenous Programs",
"Public Safety & Emergency Services",
"Agriculture & Rural Development",
"Arts, Culture & Heritage",
"Civic & Democratic Engagement"
].

These are focal points for the Canadian government and will serve as a good basis to build a classification model. The flow is simple:

Extract grants of a given time period.
A classification model will determine which sector the grants belong to.
Use an LLM/algorithm to determine which sectors are hottest.
Compare data to market data (extracting recipient names, descriptions, etc.).
Provide a summary to the user via frontend and save weekly results to the database.

The question is now, how to train the model? Well, let's do it ourselves! I have created a Python script that uses the open.canada.ca API to download a CSV of grants. They are then categorized by an LLM. This dataset will serve to train the model down the road. For now, you can find the datamining and collecting script here: https://github.com/Sher213/GrantsInvestments/tree/main

To really challenge myself, the ETL (and model) will all be done in Rust! I think it will be a really fun and novel experience.

More to come!

Ali

Forem: Ali Sher