<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ali Sher</title>
    <description>The latest articles on Forem by Ali Sher (@sher213).</description>
    <link>https://forem.com/sher213</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3323320%2Ffb2c0e28-a9a8-4024-bad7-6d1137e1e40b.jpeg</url>
      <title>Forem: Ali Sher</title>
      <link>https://forem.com/sher213</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sher213"/>
    <language>en</language>
    <item>
      <title>Grants to Investments Part 2-3: Models and Pipelines</title>
      <dc:creator>Ali Sher</dc:creator>
      <pubDate>Thu, 09 Apr 2026 11:59:32 +0000</pubDate>
      <link>https://forem.com/sher213/grants-to-investments-part-2-3-models-and-pipelines-35ij</link>
      <guid>https://forem.com/sher213/grants-to-investments-part-2-3-models-and-pipelines-35ij</guid>
      <description>&lt;h1&gt;
  
  
  🚀 Grants ETL Pipeline — Rust + Transformer-Based Classification
&lt;/h1&gt;

&lt;h2&gt;
  
  
  📌 Overview
&lt;/h2&gt;

&lt;p&gt;I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ &lt;strong&gt;High-performance data extraction&lt;/strong&gt; using Rust&lt;/li&gt;
&lt;li&gt;🧠 &lt;strong&gt;Semantic classification&lt;/strong&gt; using BERT (zero-shot)&lt;/li&gt;
&lt;li&gt;📊 &lt;strong&gt;Structured output&lt;/strong&gt; ready for downstream analytics and dashboarding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Extraction Layer (Rust)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;The Grants Canada portal has &lt;strong&gt;no accessible API&lt;/strong&gt; — only an HTML-rendered search interface. I needed a way to extract structured data at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;I built a custom scraper targeting the paginated search endpoint:&lt;br&gt;
&lt;a href="https://search.open.canada.ca/grants/?page=%7B%7D&amp;amp;sort=agreement_start_date+desc" rel="noopener noreferrer"&gt;https://search.open.canada.ca/grants/?page={}&amp;amp;sort=agreement_start_date+desc&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Decisions
&lt;/h3&gt;

&lt;p&gt;I initially started with Python but &lt;strong&gt;switched to Rust&lt;/strong&gt; for performance at scale. The Rust scraper uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scraper&lt;/code&gt; — for HTML parsing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;csv&lt;/code&gt; — for structured output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outcome
&lt;/h3&gt;

&lt;p&gt;✅ Successfully extracted structured grant data into CSV&lt;br&gt;
✅ Significantly faster ingestion vs. the prior Python-based workflow&lt;/p&gt;

&lt;h3&gt;
  
  
  📄 Sample Record
&lt;/h3&gt;

&lt;p&gt;Agreement:        European Space Agency (ESA)'s Space Weather Training Course&lt;br&gt;
Agreement Number: 25COBLLAMY&lt;br&gt;
Date Range:       Mar 11, 2026 → Mar 27, 2026&lt;br&gt;
Description:      Supports Canadian students attending international space training events&lt;br&gt;
Recipient:        Canadian Space Agency&lt;br&gt;
Amount:           $1,000.00&lt;br&gt;
Location:         La Prairie, Quebec, CA&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Transformation + Classification
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Objective
&lt;/h3&gt;

&lt;p&gt;Categorize grants into &lt;strong&gt;meaningful sectors&lt;/strong&gt; for analytics and discovery — making the data explorable beyond raw fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  Categories
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CATEGORIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Housing &amp;amp; Shelter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Education &amp;amp; Training&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment &amp;amp; Entrepreneurship&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Business &amp;amp; Innovation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Health &amp;amp; Wellness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Environment &amp;amp; Energy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Community &amp;amp; Nonprofits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research &amp;amp; Academia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Indigenous Programs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Public Safety &amp;amp; Emergency Services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agriculture &amp;amp; Rural Development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Arts, Culture &amp;amp; Heritage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Civic &amp;amp; Democratic Engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🤖 Model Choice
&lt;/h3&gt;

&lt;p&gt;I evaluated two approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional ML (clustering)&lt;/td&gt;
&lt;td&gt;Requires labeled data, less semantic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERT via Hugging Face (zero-shot)&lt;/td&gt;
&lt;td&gt;✅ Selected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why zero-shot BERT?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No labeled dataset required&lt;/li&gt;
&lt;li&gt;Strong semantic understanding out-of-the-box&lt;/li&gt;
&lt;li&gt;Fast to implement and iterate&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️ Inference Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running classification...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CATEGORIES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;predicted_category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧼 Data Quality
&lt;/h2&gt;

&lt;p&gt;The source data was &lt;strong&gt;highly structured and clean&lt;/strong&gt;, which meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimal preprocessing required&lt;/li&gt;
&lt;li&gt;Faster iteration on modeling and pipeline integration&lt;/li&gt;
&lt;li&gt;No time lost on data wrangling before getting to the interesting parts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📦 Next Steps
&lt;/h2&gt;

&lt;p&gt;The pipeline is actively being extended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🗄️ &lt;strong&gt;Load Layer&lt;/strong&gt; → Persist classified data in a database&lt;/li&gt;
&lt;li&gt;📊 &lt;strong&gt;Analytics Dashboard&lt;/strong&gt; → Visualize funding trends by category, region, and time&lt;/li&gt;
&lt;li&gt;⏱️ &lt;strong&gt;Pipeline Orchestration&lt;/strong&gt; → Automate ingestion + inference end-to-end&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rust is a legit choice for ETL scraping&lt;/strong&gt; — not just systems programming. The performance gains over Python are real and measurable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-shot BERT punches above its weight&lt;/strong&gt; for classification tasks without labeled data. It's a great first-pass model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modular pipeline design pays off early&lt;/strong&gt; — separating extraction, transformation, and load made iteration much faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't over-engineer&lt;/strong&gt; — the right tool for each layer matters more than using a single stack.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🔗 Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📁 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Sher213/GrantsInvestments" rel="noopener noreferrer"&gt;github.com/Sher213/GrantsInvestments&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Open to opportunities in Data Science, ML Engineering, and Data Engineering — feel free to reach out at &lt;a href="mailto:alisher213@outlook.com"&gt;alisher213@outlook.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>etl</category>
      <category>ai</category>
      <category>datascience</category>
      <category>rust</category>
    </item>
    <item>
      <title>Grants to Investments Part 1: The Data</title>
      <dc:creator>Ali Sher</dc:creator>
      <pubDate>Tue, 08 Jul 2025 20:09:59 +0000</pubDate>
      <link>https://forem.com/sher213/grants-to-investments-part-1-the-data-334h</link>
      <guid>https://forem.com/sher213/grants-to-investments-part-1-the-data-334h</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ve28vtwi4mvsqcasfmt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ve28vtwi4mvsqcasfmt.png" alt=" " width="466" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was brainstorming some ideas for my next project while browsing the resources I have at my current co-op at the Ontario Public Service when an idea struck me. &lt;/p&gt;

&lt;p&gt;Now, everyone knows that Government Grants are a huge opportunity for companies to jumpstart their journeys, and the amount of attention that AI is getting just makes it all the more hot of a topic. Just browsing The Government of Canada's Grants and Contributions Page at: &lt;a href="https://search.open.canada.ca/grants/" rel="noopener noreferrer"&gt;https://search.open.canada.ca/grants/&lt;/a&gt; and you will see a myriad of different listings.&lt;/p&gt;

&lt;p&gt;So I thought, why not use this wealth of resources to create a solution that helps people judge which public sectors are getting the most funding?&lt;/p&gt;

&lt;p&gt;This is where my current project is stepping in - I am ideating a solution that &lt;strong&gt;helps people find investment opportunities using publicly available grant knowledge&lt;/strong&gt; put simply. &lt;/p&gt;

&lt;p&gt;The steps are simple (in theory). It will feature an ETL pipeline where I will ingest data and feed it to a model that determines which sectors are the &lt;em&gt;hottest&lt;/em&gt; of the week (or some timeframe). They will provide a quick summary via LLM as to what the AI opportunities are in each sector and whether that sector (and a listing of grants/companies) are worth looking into. Ideally, I will also extract data from another API source such as market data to support these findings (Making a note to myself to include that &lt;strong&gt;these are NOT Investment Advice, etc, etc.&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;The data pulled will help people know where government spending (&lt;em&gt;their&lt;/em&gt; money!) is going as well as give an opportunity to see which companies are benefitting as a result to not only themselves, but as good companies do, to the people as well.&lt;/p&gt;

&lt;p&gt;But, first things first - how do I create a model to determine which grant fits into which category? Well, I have selected the following categories/sectors to look at:&lt;/p&gt;

&lt;p&gt;CATEGORIES = [&lt;br&gt;
    "Housing &amp;amp; Shelter",&lt;br&gt;
    "Education &amp;amp; Training",&lt;br&gt;
    "Employment &amp;amp; Entrepreneurship",&lt;br&gt;
    "Business &amp;amp; Innovation",&lt;br&gt;
    "Health &amp;amp; Wellness",&lt;br&gt;
    "Environment &amp;amp; Energy",&lt;br&gt;
    "Community &amp;amp; Nonprofits",&lt;br&gt;
    "Research &amp;amp; Academia",&lt;br&gt;
    "Indigenous Programs",&lt;br&gt;
    "Public Safety &amp;amp; Emergency Services",&lt;br&gt;
    "Agriculture &amp;amp; Rural Development",&lt;br&gt;
    "Arts, Culture &amp;amp; Heritage",&lt;br&gt;
    "Civic &amp;amp; Democratic Engagement"&lt;br&gt;
].&lt;/p&gt;

&lt;p&gt;These are focal points for the Canadian government and will serve as a good basis to build a classification model. The flow is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract grants of a given time period.&lt;/li&gt;
&lt;li&gt;A classification model will determine which sector the grants belong to.&lt;/li&gt;
&lt;li&gt;Use an LLM/algorithm to determine which sectors are hottest.&lt;/li&gt;
&lt;li&gt;Compare data to market data (extracting recipient names, descriptions, etc.).&lt;/li&gt;
&lt;li&gt;Provide a summary to the user via frontend and save weekly results to the database.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The question is now, how to train the model? Well, let's do it ourselves! I have created a Python script that uses the open.canada.ca API to download a CSV of grants. They are then categorized by an LLM. This dataset will serve to train the model down the road. For now, you can find the datamining and collecting script here: &lt;a href="https://github.com/Sher213/GrantsInvestments/tree/main" rel="noopener noreferrer"&gt;https://github.com/Sher213/GrantsInvestments/tree/main&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To really challenge myself, the ETL (and model) will all be done in &lt;strong&gt;Rust&lt;/strong&gt;! I think it will be a really fun and novel experience.&lt;/p&gt;

&lt;p&gt;More to come!&lt;/p&gt;

&lt;p&gt;Ali&lt;/p&gt;

</description>
      <category>programming</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
