<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: David Bean</title>
    <description>The latest articles on Forem by David Bean (@dave_bean).</description>
    <link>https://forem.com/dave_bean</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3538472%2Fbd89546f-a9f5-4cb9-b6ef-7f3b5b43f69b.png</url>
      <title>Forem: David Bean</title>
      <link>https://forem.com/dave_bean</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dave_bean"/>
    <language>en</language>
    <item>
      <title>Building My First ML Data Pipeline</title>
      <dc:creator>David Bean</dc:creator>
      <pubDate>Tue, 21 Oct 2025 22:18:54 +0000</pubDate>
      <link>https://forem.com/dave_bean/building-my-first-ml-data-pipeline-three-days-one-deployed-dashboard-and-a-lesson-about-letting-2dif</link>
      <guid>https://forem.com/dave_bean/building-my-first-ml-data-pipeline-three-days-one-deployed-dashboard-and-a-lesson-about-letting-2dif</guid>
      <description>&lt;h2&gt;
  
  
  Three Days, One Deployed Dashboard, and a Lesson About Letting Data Drive Business Questions
&lt;/h2&gt;

&lt;p&gt;I just finished my first complete machine learning project—a renewable energy investment analysis dashboard that's now live on Streamlit Cloud. Three days of work. 181,915 rows of data. And one really important lesson: your initial business problem is probably wrong.&lt;/p&gt;

&lt;p&gt;I'm a software engineer learning ML with Claude designing my course. This project clarified a lot about how data science work actually happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1: When Your Business Problem Meets Reality
&lt;/h2&gt;

&lt;p&gt;I started with a plan: build a tool to help optimize fossil fuel plant modernization schedules based on renewable production patterns. Sounded reasonable. Turned out to be impossible with my data.&lt;/p&gt;

&lt;p&gt;I had a renewable energy dataset covering 52 countries from 2010-2022. Six energy types. Good coverage. But after loading it into the interactive EDA dashboard I'd built the previous week, reality hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dataset showed production, not capacity or demand&lt;/li&gt;
&lt;li&gt;Renewables depend on weather—you can't schedule them&lt;/li&gt;
&lt;li&gt;No grid data, no regional breakdowns&lt;/li&gt;
&lt;li&gt;Historical trends can't predict modernization timing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My business problem didn't match what the data could actually answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pivot:&lt;/strong&gt; I asked a different question. Instead of "when should plants modernize," I asked "which countries represent the best opportunities for battery storage investments based on renewable penetration, growth rates, and energy mix diversity?"&lt;/p&gt;

&lt;p&gt;That question? The data could answer it perfectly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Learned: Validate Before You Commit
&lt;/h3&gt;

&lt;p&gt;The EDA dashboard from Week 2 was useful here. Twenty minutes of exploration showed me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scale mismatches (totals mixed with individual sources)&lt;/li&gt;
&lt;li&gt;Missing data patterns (expected in first-year entries)&lt;/li&gt;
&lt;li&gt;Distribution issues (couldn't fix with log transforms)&lt;/li&gt;
&lt;li&gt;Time coverage worked for trend analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude pointed out the business problem didn't match the data. You deal with the situation you're in, so we pivoted to a question the data could actually answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1 Continued: The Preprocessing Pipeline
&lt;/h2&gt;

&lt;p&gt;Coming from C++ where I think about data flow and single responsibilities, I built a five-function pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;load_and_clean&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nb"&gt;filter&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;aggregate&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;calculate_metrics&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function takes a DataFrame, returns a DataFrame, has one clear job, prints progress, and handles edge cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Scale Problem I Almost Got Wrong
&lt;/h3&gt;

&lt;p&gt;Early on, my visualizations looked terrible. Some categories showed values 100x larger than others. My first instinct: log transformation.&lt;/p&gt;

&lt;p&gt;Wrong.&lt;/p&gt;

&lt;p&gt;The real issue: my data mixed individual renewable sources (Hydro = 1,000 GWh) with aggregate totals (Total Electricity = 200,000 GWh). These shouldn't be on the same chart at all.&lt;/p&gt;

&lt;p&gt;Solution: Filter out aggregates entirely. Keep only the discrete renewable sources.&lt;/p&gt;

&lt;p&gt;This wasn't a math problem—it was a data structure problem. No transformation fixes a fundamental category mismatch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 2: When Your Model Is "Wrong" (But Actually Right)
&lt;/h2&gt;

&lt;p&gt;I trained a Random Forest model to predict storage infrastructure scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; Percentages of Hydro, Wind, Solar, Geothermal, Other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Storage need score (0-100)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; R² = 0.948&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model worked. Then I tested extreme cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;100% Hydro:&lt;/strong&gt; Score 56.21&lt;br&gt;&lt;br&gt;
&lt;strong&gt;100% Wind:&lt;/strong&gt; Score 31.37&lt;/p&gt;

&lt;p&gt;Wait. Wind is intermittent—shouldn't it need MORE storage than stable hydro? Why was my model backwards?&lt;/p&gt;

&lt;p&gt;I debugged for 15 minutes before realizing: the model wasn't wrong. My assumption was.&lt;/p&gt;

&lt;p&gt;My Day 1 scoring formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;storage_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;renewable_share&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;growth_rate&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;diversity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This measured &lt;strong&gt;investment opportunity&lt;/strong&gt;, not &lt;strong&gt;technical storage need&lt;/strong&gt;. Countries with high hydro (Norway, Iceland) scored high because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very high renewable penetration (27-30%)&lt;/li&gt;
&lt;li&gt;Mature markets ready for more storage&lt;/li&gt;
&lt;li&gt;High penetration signals strong renewable commitment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model learned exactly what I trained it on. I just forgot what I'd actually built versus what I thought I was building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Models optimize for your training signal, not your intentions. When behavior seems wrong, check what you actually trained it on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 3: Production Deployment Teaches Fast
&lt;/h2&gt;

&lt;p&gt;I built a four-tab Streamlit dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Overview: Top 10 investment opportunities&lt;/li&gt;
&lt;li&gt;Country Analysis: Interactive comparisons&lt;/li&gt;
&lt;li&gt;Predictions: ML model with input sliders&lt;/li&gt;
&lt;li&gt;Technical Details: Full methodology&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Building for production exposed design flaws I'd never catch in a Jupyter notebook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Path Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Local:&lt;/strong&gt; &lt;code&gt;model = joblib.load('storage_model.pkl')&lt;/code&gt; worked fine&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Streamlit Cloud:&lt;/strong&gt; &lt;code&gt;FileNotFoundError&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Why? My dashboard lived in a &lt;code&gt;src/&lt;/code&gt; subfolder, models in the parent directory. Relative paths resolved from where the code runs, not where the file lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;current_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;parent_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;storage_model.pkl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 2: Requirements File Location
&lt;/h3&gt;

&lt;p&gt;Streamlit Cloud looks for &lt;code&gt;requirements.txt&lt;/code&gt; at repository root, not in subdirectories. Took two deployment failures to figure this out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: Feature Scaling
&lt;/h3&gt;

&lt;p&gt;Almost made a critical mistake: feeding raw percentages directly to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;input_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="n"&gt;hydro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;geo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Wrong!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Right:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;input_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="n"&gt;hydro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;geo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;span class="n"&gt;input_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Scale first!
&lt;/span&gt;&lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_scaled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Models trained on scaled features expect scaled inputs. Skip this and predictions don't work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Development and production environments have different problems. Same issues I deal with in systems work—environment differences, dependencies, synchronization—show up in ML deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Three Days Produced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Live dashboard&lt;/strong&gt; with public URL&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub repo&lt;/strong&gt; with professional README&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Trained ML model&lt;/strong&gt; (three deployment patterns: batch/API/edge)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Complete data pipeline&lt;/strong&gt; with reproducible preprocessing&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; with screenshots&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top investment opportunities identified:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Netherlands (63.08) - 838% growth rate&lt;/li&gt;
&lt;li&gt;Iceland (62.05) - 29.5% renewable penetration&lt;/li&gt;
&lt;li&gt;Norway (59.47) - Strong baseline, steady growth&lt;/li&gt;
&lt;li&gt;Hungary (52.82) - 658% growth, emerging market&lt;/li&gt;
&lt;li&gt;UK (48.90) - Large market, 504% growth&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Technical stats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;181,915 data points processed&lt;/li&gt;
&lt;li&gt;52 countries analyzed&lt;/li&gt;
&lt;li&gt;156 months of time series&lt;/li&gt;
&lt;li&gt;8,033 predictions/second (batch)&lt;/li&gt;
&lt;li&gt;89.4 KB model (ONNX edge deployment)&lt;/li&gt;
&lt;li&gt;R² = 0.948&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Actually Surprised Me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Preprocessing Takes Most of the Time
&lt;/h3&gt;

&lt;p&gt;In C++, optimization takes most of the time. In ML, data cleaning and feature engineering dominated. Good preprocessing makes modeling straightforward. Bad preprocessing makes it impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Production Deployment Shows Problems Fast
&lt;/h3&gt;

&lt;p&gt;Jupyter notebooks hide issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Path dependencies&lt;/li&gt;
&lt;li&gt;Environment differences&lt;/li&gt;
&lt;li&gt;Feature scaling synchronization&lt;/li&gt;
&lt;li&gt;Input validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deploying early forced me to deal with these.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The README Matters
&lt;/h3&gt;

&lt;p&gt;I spent 30 minutes writing a professional README:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business problem clearly stated&lt;/li&gt;
&lt;li&gt;Technical approach explained&lt;/li&gt;
&lt;li&gt;Setup instructions&lt;/li&gt;
&lt;li&gt;Screenshots&lt;/li&gt;
&lt;li&gt;Live demo URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Project looks more complete with good documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. End-to-End Matters More Than Depth
&lt;/h3&gt;

&lt;p&gt;I could've spent three days optimizing model accuracy from 0.948 to 0.952. Instead I built a complete pipeline: data → model → deployment → documentation.&lt;/p&gt;

&lt;p&gt;For real job hunting, I hope this matters more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Bugs I Hit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bug 1:&lt;/strong&gt; Streamlit Cloud couldn't find plotly module&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; &lt;code&gt;requirements.txt&lt;/code&gt; in wrong directory&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Moved to repo root, specified &lt;code&gt;plotly&amp;gt;=5.0.0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2:&lt;/strong&gt; Model files not loading&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; Relative paths broken in cloud environment&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Used &lt;code&gt;os.path.dirname(__file__)&lt;/code&gt; for portable paths&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 3:&lt;/strong&gt; "Random Forest" truncated in UI columns&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; Text too long for column width&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Made it a subheader instead of metric in column&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 4:&lt;/strong&gt; Predictions looked weird&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; Forgot to scale input features&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Applied scaler before model.predict()&lt;/p&gt;

&lt;p&gt;Claude caught most of these during code review. I understand the patterns now—scoping issues, path management, feature preprocessing flow. I'm delegating implementation details and focusing on understanding architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This was Portfolio Project 1 of 6. Each project adds new capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project 1 (Done):&lt;/strong&gt; Data analysis dashboard, traditional ML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 2:&lt;/strong&gt; Traditional ML pipeline with feature engineering
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 3:&lt;/strong&gt; Deep learning computer vision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 4:&lt;/strong&gt; Generative AI with LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 5:&lt;/strong&gt; MLOps with CI/CD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 6:&lt;/strong&gt; ML systems engineering specialization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Goal isn't just learning ML—it's building a portfolio proving I can deliver production ML systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools That Helped
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; (dashboard framework)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plotly&lt;/strong&gt; (interactive viz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scikit-learn&lt;/strong&gt; (Random Forest, preprocessing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; (data manipulation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit Cloud&lt;/strong&gt; (deployment)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; (course design, code review, debugging partner)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Live Demo &amp;amp; Code
&lt;/h2&gt;

&lt;p&gt;🔗 &lt;strong&gt;Live Dashboard:&lt;/strong&gt; &lt;a href="https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/" rel="noopener noreferrer"&gt;https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/&lt;/a&gt;&lt;br&gt;&lt;br&gt;
💻 &lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/bean2778/ai_learning_2025" rel="noopener noreferrer"&gt;https://github.com/bean2778/ai_learning_2025&lt;/a&gt;&lt;br&gt;&lt;br&gt;
📊 &lt;strong&gt;Dataset:&lt;/strong&gt; Global Renewable Energy Production (2010-2022)&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About this series:&lt;/strong&gt; I'm a software engineer learning machine learning with Claude designing my curriculum. Week 3 done: EDA, problem formulation, first portfolio project deployed. More posts coming on traditional ML, deep learning, and production systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LinkedIn: &lt;a href="http://www.linkedin.com/in/bean2778" rel="noopener noreferrer"&gt;www.linkedin.com/in/bean2778&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/bean2778/ai_learning_2025" rel="noopener noreferrer"&gt;https://github.com/bean2778/ai_learning_2025&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Previous: &lt;a href="https://dev.to/dave_bean/blog-post-2-numpy-through-a-c-programmers-eyes-3fam"&gt;blog 2&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Next:&lt;/strong&gt; Traditional ML fundamentals—supervised learning, evaluation metrics, bias-variance tradeoff.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Time: 3 days (Days 19-21 of 270-day roadmap)&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Status: Portfolio Project 1 complete ✅&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Coffee consumed: Enough&lt;/em&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Blog Post 2: NumPy Through a C++ Programmer's Eyes</title>
      <dc:creator>David Bean</dc:creator>
      <pubDate>Sat, 11 Oct 2025 01:54:30 +0000</pubDate>
      <link>https://forem.com/dave_bean/blog-post-2-numpy-through-a-c-programmers-eyes-3fam</link>
      <guid>https://forem.com/dave_bean/blog-post-2-numpy-through-a-c-programmers-eyes-3fam</guid>
      <description>&lt;h1&gt;
  
  
  Blog Post 2: NumPy Through a C++ Programmer's Eyes
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Week Two: Finally Writing Code That Feels Fast
&lt;/h2&gt;

&lt;p&gt;Week two of my ML learning journey, and I'm starting to see why Python dominates machine learning despite being "slow."&lt;/p&gt;

&lt;p&gt;The secret? Most of the time, you're not actually running Python.&lt;/p&gt;

&lt;p&gt;This week was all about NumPy and pandas - the foundations of pretty much every ML library. And as someone who's written a lot of C++ code focused on performance, watching NumPy operations run was genuinely satisfying. These aren't slow Python loops. They're compiled C code operating on contiguous arrays, using SIMD instructions where possible.&lt;/p&gt;

&lt;p&gt;It's basically everything I love about C++ performance, wrapped in Python's convenience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 8: Building Image Transformations Without Image Libraries
&lt;/h2&gt;

&lt;p&gt;The first challenge: implement image transformations (rotate, flip, crop, brightness adjustment) using &lt;strong&gt;only NumPy&lt;/strong&gt;. No OpenCV, no PIL for the actual transformations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rotation algorithm was the fun part.&lt;/strong&gt; I knew I needed to rotate 90° clockwise, but which operations exactly? After some debugging with test patterns (red left half, blue right half), I figured it out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rotate_90&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Transpose (swap rows and columns)
&lt;/span&gt;    &lt;span class="n"&gt;transposed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Flip vertically
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transposed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Transpose alone doesn't give rotation - you need transpose + flip. I only really understood this after printing intermediate steps and tracing through what should happen to each quadrant.&lt;/p&gt;

&lt;p&gt;When Claude suggested &lt;code&gt;np.transpose(image, (1, 0, 2))&lt;/code&gt;, I made myself stop and ask: what does that tuple actually mean? Turns out &lt;code&gt;(1, 0, 2)&lt;/code&gt; means "put axis 1 first, axis 0 second, keep axis 2 third." So columns become rows, rows become columns, color channels stay unchanged. The debugging process of creating test patterns and visualizing transformations taught me more than just reading documentation would have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The performance difference is wild.&lt;/strong&gt; Every operation works on entire arrays at once. No loops over millions of pixels. &lt;code&gt;image * brightness_factor&lt;/code&gt; multiplies every single pixel value in one vectorized operation. This is the SIMD parallelism I'm used to from C++, but I didn't have to write it myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Days 9-10: Pandas Element-Wise Operators Are Not Python Operators
&lt;/h2&gt;

&lt;p&gt;Pandas threw me for a loop because it looks like regular Python but behaves completely differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The element-wise operator confusion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I kept trying to write conditionals like normal Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This doesn't work:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# ERROR!
&lt;/span&gt;
&lt;span class="c1"&gt;# You need element-wise operators:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  &lt;span class="c1"&gt;# Works!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;|&lt;/code&gt; for OR, &lt;code&gt;&amp;amp;&lt;/code&gt; for AND, &lt;code&gt;~&lt;/code&gt; for NOT. Always. This tripped me up for a solid day until it finally clicked: these operators work on entire columns at once, not single values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The groupby-aggregate pattern is everywhere:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This pattern appears constantly in ML preprocessing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Calculate total spending per customer
&lt;/span&gt;&lt;span class="n"&gt;customer_totals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Map those totals back to every row
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_totals&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Split the data into groups, apply some aggregation, combine the results back. Once I understood this pattern, tons of feature engineering operations made sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The CSV string conversion gotcha:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My favorite bug of the week: I had integration tests failing because pandas reads CSV columns as strings, not numbers. My unit tests all passed (they used real Python numbers), but when I tested the complete pipeline reading from a file, everything broke.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CSV gives you strings:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# ['1', '2', '3'] - all strings!
&lt;/span&gt;
&lt;span class="c1"&gt;# Need explicit conversion:
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# [1, 2, 3]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exactly why you need integration tests, not just unit tests. Different test types catch different bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 12: The 150x Speedup
&lt;/h2&gt;

&lt;p&gt;This was the most satisfying day. I had a function that processed transactions using &lt;code&gt;.apply()&lt;/code&gt; with lambdas and some iterrows loops. It worked. It was slow. Claude challenged me to optimize it using vectorization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow version: 0.46 seconds for 10k rows (21,559 rows/second)&lt;/li&gt;
&lt;li&gt;Fast version: 0.003 seconds for 10k rows (3,249,635 rows/second)
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speedup: 150x faster&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same input. Same output (verified with &lt;code&gt;pd.testing.assert_frame_equal()&lt;/code&gt;). Just replaced Python loops with vectorized NumPy operations.&lt;/p&gt;

&lt;p&gt;The key transformations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# SLOW - apply with lambda
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# FAST - vectorized multiplication  
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# SLOW - apply with if/elif/else function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;categorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;small&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categorize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# FAST - np.select with conditions
&lt;/span&gt;&lt;span class="n"&gt;conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;choices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;small&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; &lt;code&gt;.apply()&lt;/code&gt; and &lt;code&gt;.iterrows()&lt;/code&gt; are 150x slower because they're Python loops in disguise. Every iteration has interpreter overhead. Vectorized operations run in compiled C code with no per-element overhead.&lt;/p&gt;

&lt;p&gt;This isn't "premature optimization." This is fundamental to how you write pandas code. You can't just "optimize later" - you need to think vectorized from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Days 13-14: Making Data Problems Visible
&lt;/h2&gt;

&lt;p&gt;The weekend project was building a data quality dashboard. I took the matplotlib visualizations from Day 13 and wrapped them in a Streamlit app.&lt;/p&gt;

&lt;p&gt;The result: upload any CSV, instantly see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amount distribution (with outliers highlighted in red)&lt;/li&gt;
&lt;li&gt;Time series (with missing data periods shaded)&lt;/li&gt;
&lt;li&gt;Age distribution (valid vs impossible values)&lt;/li&gt;
&lt;li&gt;Category balance (class imbalance visualization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus automated detection of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Statistical outliers&lt;/li&gt;
&lt;li&gt;Invalid ages (negative or &amp;gt;120)&lt;/li&gt;
&lt;li&gt;Negative amounts&lt;/li&gt;
&lt;li&gt;With specific counts and recommendations for each issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What I learned about Streamlit:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's refreshingly simple. The entire script reruns on every user interaction, which sounds inefficient but makes the programming model dead simple. No state management, no callbacks, no frontend/backend separation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;uploaded_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;file_uploader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Choose a CSV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;uploaded_file&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uploaded_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Show visualizations...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Upload → Process → Display. No web development required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "calculate once, use twice" pattern:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I caught myself calling the same detection functions multiple times:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inefficient - calls function twice:
&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;find_missing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;find_missing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found missing values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My C++ performance instincts kicked in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Better - calculate once:
&lt;/span&gt;&lt;span class="n"&gt;missing_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_missing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;missing_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;missing_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found missing values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not a huge deal for small datasets, but good habits matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What My C++ Background Got Right and Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What transferred well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance awareness:&lt;/strong&gt; I instinctively noticed when operations might be slow and looked for vectorized alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory layout intuition:&lt;/strong&gt; Understanding that NumPy arrays are contiguous in memory made sense immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type thinking:&lt;/strong&gt; Python's type hints feel natural. When pandas operations convert uint8 to float64, I notice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging mindset:&lt;/strong&gt; Add logging, test edge cases, isolate the problem systematically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What I had to unlearn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loops are fine → Loops are death:&lt;/strong&gt; In C++, loops are normal. In pandas, they're 150x slower. This is a fundamental mental shift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control flow is explicit → Control flow is vectorized:&lt;/strong&gt; Can't use &lt;code&gt;if/elif/else&lt;/code&gt; on arrays. Must use &lt;code&gt;np.select()&lt;/code&gt; or &lt;code&gt;np.where()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build from scratch → Use the ecosystem:&lt;/strong&gt; C++ culture is "roll your own." Python ML culture is "there's definitely a library for that."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest surprise: &lt;strong&gt;NumPy gives me C++ performance without writing C++&lt;/strong&gt;. Most of the time. When I eventually need even more speed, the roadmap has me implementing custom C++ extensions later. But for now, vectorized NumPy is fast enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Discovery-Based Learning Struggle
&lt;/h2&gt;

&lt;p&gt;The hardest part of this week wasn't the code - it was staying curious instead of copying solutions.&lt;/p&gt;

&lt;p&gt;When Claude suggested using &lt;code&gt;np.transpose(image, (1, 0, 2))&lt;/code&gt; for rotation, I had to force myself to stop and ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does the &lt;code&gt;(1, 0, 2)&lt;/code&gt; tuple actually mean?&lt;/li&gt;
&lt;li&gt;Why those specific numbers?&lt;/li&gt;
&lt;li&gt;What happens if I change the order?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns a 5-minute "just make it work" into a 20-minute learning session where I actually understand axis manipulation.&lt;/p&gt;

&lt;p&gt;Same with &lt;code&gt;pd.to_numeric(..., errors='coerce')&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does 'coerce' do?&lt;/li&gt;
&lt;li&gt;What are the alternatives?&lt;/li&gt;
&lt;li&gt;When would I use 'raise' or 'ignore' instead?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's slower. Sometimes frustrating. But it's the difference between having code that works vs understanding why it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Tripped Me Up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The "Rumpelstiltskin problem" is real.&lt;/strong&gt; The hardest part of learning pandas isn't understanding concepts - it's knowing what operations exist and what they're called. &lt;/p&gt;

&lt;p&gt;I can't use &lt;code&gt;.mask()&lt;/code&gt; if I don't know it exists. I can't search for "how to do X" if I don't know X is called "broadcasting." This is where having Claude as a guide helps - it can suggest the right operation for the problem, then I go understand how it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NaN propagation is weird.&lt;/strong&gt; Coming from languages where NULL works differently, pandas' NaN behavior took getting used to. It silently propagates through operations in ways that break boolean logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Without na=False, NaN breaks filtering:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Returns [True, False, NaN, True]
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  &lt;span class="c1"&gt;# ERROR!
&lt;/span&gt;
&lt;span class="c1"&gt;# Must handle explicitly:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Returns [True, False, False, True]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Week 2 vs Week 1
&lt;/h2&gt;

&lt;p&gt;Week 1 was about development practices (testing, error handling, packaging). Week 2 was about the actual data manipulation tools (NumPy, pandas, visualization).&lt;/p&gt;

&lt;p&gt;Both feel essential. You can't build production ML without both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean code that doesn't crash (Week 1)&lt;/li&gt;
&lt;li&gt;Fast data processing that scales (Week 2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination is what makes ML engineering work in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Week 3 starts traditional machine learning - linear models, decision trees, ensemble methods. Still using Claude's discovery-based approach: here's the problem, here's the documentation, now figure it out.&lt;/p&gt;

&lt;p&gt;I'm getting more comfortable with this pattern. The first few days I wanted explicit instructions. Now I appreciate the struggle - it's where the learning happens.&lt;/p&gt;

&lt;p&gt;Also: I've told Claude to start writing most of my tests because I understand the patterns now. Learning to delegate to AI is part of learning with AI.&lt;/p&gt;

&lt;p&gt;Two weeks in. Still no neural networks. Just data engineering foundations. And honestly? I'm starting to understand why everyone says data engineering is 80% of ML work.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About this series:&lt;/strong&gt; I'm a software engineer learning ML using a custom roadmap designed by Claude. The approach focuses on production skills and problem-solving over tutorials. Week 2 complete: NumPy, pandas, and an interactive data quality dashboard. All code and daily summaries on [GitHub link].&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Feedback welcome: Did the C++ perspective add value or just clutter? Should I include more code examples or keep it high-level?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why a C++ Systems Engineer is Learning Machine Learning</title>
      <dc:creator>David Bean</dc:creator>
      <pubDate>Fri, 03 Oct 2025 20:44:40 +0000</pubDate>
      <link>https://forem.com/dave_bean/why-a-c-systems-engineer-is-learning-machine-learning-3ffn</link>
      <guid>https://forem.com/dave_bean/why-a-c-systems-engineer-is-learning-machine-learning-3ffn</guid>
      <description>&lt;p&gt;&lt;em&gt;A senior systems programmer's journey into AI/ML - Week 1 reflections&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision
&lt;/h2&gt;

&lt;p&gt;After spending over a decade building high-performance C++ systems in defense and aerospace, I've made a decision: I'm learning machine learning. Not casually browsing tutorials on weekends, but committing to a structured 12-month roadmap with one hour of focused work every single day.&lt;/p&gt;

&lt;p&gt;Why? Because the intersection of systems engineering and ML represents one of the most valuable skill combinations in tech right now. MLOps engineers see 9.8× demand growth with salaries averaging $122k-$167k. More importantly, most ML practitioners lack deep systems knowledge, while most systems engineers don't understand ML. I'm betting that bridging this gap is worth the investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Feels Different
&lt;/h2&gt;

&lt;p&gt;I've looked at ML courses before. They all seem to follow the same pattern: install Anaconda, run some scikit-learn examples, train a model on the Iris dataset, celebrate. That's fine for getting started, but it doesn't prepare you for production systems where models fail silently, data pipelines break, and performance matters.&lt;/p&gt;

&lt;p&gt;So I chose a different approach: a &lt;a href="https://github.com/bean2778/ai_learning_2025" rel="noopener noreferrer"&gt;discovery-based roadmap&lt;/a&gt; that prioritizes production skills from day one. Instead of copying tutorial code, I solve problems by reading documentation, debugging issues independently, and building understanding through experimentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mindset shift:&lt;/strong&gt; I'm not learning to run ML models. I'm learning to build ML systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 1: Building Something Real
&lt;/h2&gt;

&lt;p&gt;Most "Week 1 ML" tutorials have you print "Hello World" and maybe plot a graph. My Week 1 looked different.&lt;/p&gt;

&lt;p&gt;I built a &lt;a href="https://github.com/bean2778/ai_learning_2025/tree/main/day_02" rel="noopener noreferrer"&gt;data quality checker&lt;/a&gt;. Sounds boring, right? But here's the thing - I have no idea what makes good ML data. I'm literally learning this from an AI assistant (Claude) in real-time, using a roadmap designed to make me figure things out rather than copy-paste solutions.&lt;/p&gt;

&lt;p&gt;The framework analyzes numeric, categorical, and temporal data. It detects outliers, finds missing values, identifies data quality issues. It has 44 tests because I spent two full days just writing tests.&lt;/p&gt;

&lt;p&gt;But honestly? I don't know if these are the &lt;em&gt;right&lt;/em&gt; checks for ML. I'm a C++ guy who knows about memory management and thread safety. Data quality for machine learning? That's completely new territory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 1: Just Make It Not Crash
&lt;/h3&gt;

&lt;p&gt;First day, I wrote a function to check data quality. Coming from C++, my instinct was to write something that handles edge cases without dying.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_data_quality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;clean_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;clean_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;no valid data points&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Continue with analysis...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI teaching me asked: "Why not just let it crash with an error?"&lt;/p&gt;

&lt;p&gt;Because in my world, if your distributed system crashes because someone passed it bad data, you've failed. You handle errors gracefully, you log what happened, you return something useful.&lt;/p&gt;

&lt;p&gt;Apparently that's also important for ML pipelines. Who knew? (Everyone who does ML, probably. But I didn't.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 2-3: Making It Installable
&lt;/h3&gt;

&lt;p&gt;While I was setting up proper Python packaging with &lt;code&gt;pyproject.toml&lt;/code&gt;, I kept thinking "this seems like overkill for a learning project."&lt;/p&gt;

&lt;p&gt;But the roadmap insisted: documentation, logging, proper module structure from day one. Not because the code is complex, but because production habits need to be habitual.&lt;/p&gt;

&lt;p&gt;Fine. I wrote docstrings. I set up logging. I made it pip installable.&lt;/p&gt;

&lt;p&gt;Two days later when I had to debug why my tests were failing, those logs saved me 30 minutes of confusion. The docstrings reminded me what I was trying to do. Point taken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 4-5: Testing Like My Career Depends On It
&lt;/h3&gt;

&lt;p&gt;I spent two days writing tests. Not "does it run" tests. Real tests. Unit tests, integration tests, property-based tests using a library called Hypothesis that generates random inputs to find bugs.&lt;/p&gt;

&lt;p&gt;Hypothesis found actual bugs I never would have caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Floating-point precision issues with large numbers&lt;/li&gt;
&lt;li&gt;Numerical overflow with extreme values
&lt;/li&gt;
&lt;li&gt;CSV type conversion errors where pandas read numbers as strings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where my C++ background actually helped. I know what edge cases look like. I know that "works on my machine" isn't good enough. I know that systems fail in weird ways when you least expect it.&lt;/p&gt;

&lt;p&gt;Turns out that's useful for ML too. Data is messy. Edge cases are everywhere. Tests catch problems before they break production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 6-7: The "I Have No Idea" Moment
&lt;/h3&gt;

&lt;p&gt;Weekend project: add temporal data analysis. Dates, timestamps, time series stuff.&lt;/p&gt;

&lt;p&gt;I built gap detection - finding missing dates in time series data. The algorithm calculates time deltas between dates, finds the most common one, flags anything bigger as a gap.&lt;/p&gt;

&lt;p&gt;Then Claude (the AI helping me learn (and right these blog post)) asked: "What temporal quality checks matter most for ML?"&lt;/p&gt;

&lt;p&gt;My answer: "I really have no idea. I'm doing this whole course to find that out."&lt;/p&gt;

&lt;p&gt;And you know what? That was the right answer.&lt;/p&gt;

&lt;p&gt;Claude's response: "Start simple, document your assumptions, make it observable, iterate later. This is how real ML engineering works. Even senior engineers build V1 without knowing all requirements."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That&lt;/strong&gt; was valuable. Not because I learned some ML best practice, but because I learned it's okay to not know. You build something reasonable, you see how it's used, you improve it.&lt;/p&gt;

&lt;p&gt;This actually feels familiar. People think defense/aerospace work is all upfront specs and formal requirements. Reality? You get dropped into a mess of legacy systems, vague requirements, and contradictory stakeholder demands, then you hack your way through until something works. ML engineering sounds similar, just with different tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Integer Problem
&lt;/h3&gt;

&lt;p&gt;Here's a fun debugging story. I wrote a dispatcher that automatically figures out if your data is numeric, categorical, or temporal (dates/times).&lt;/p&gt;

&lt;p&gt;Initial version routed &lt;code&gt;[1, 2, 3, 4, 5]&lt;/code&gt; to the temporal analyzer. Why? Because pandas interprets small integers as Unix epoch days. Day 1 after Unix epoch is January 1, 1970. So &lt;code&gt;[1, 2, 3, 4, 5]&lt;/code&gt; looked like a date sequence to pandas.&lt;/p&gt;

&lt;p&gt;That's... not what anyone would expect.&lt;/p&gt;

&lt;p&gt;Solution: Only test large integers (&amp;gt;946684800, roughly year 2000) as potential timestamps. Small integers default to numeric.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;946684800&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Large integers: might be Unix timestamps
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raise&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;temp_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;span class="c1"&gt;# Small ints: skip temporal test, treat as numeric
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I have no idea if this is how production ML systems handle this. But it makes sense, tests pass, and it solves the immediate problem. V2 can be smarter if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pandas is kind of amazing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coming from C++ where you manually manage everything, pandas feels like cheating:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Frequency distribution in one line
&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Date parsing with error handling
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What would be 20-30 lines of careful C++ becomes a method call. I can see why everyone uses this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not knowing is fine&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That moment when Claude asked what ML engineers need for temporal data and I said "I have no idea" - that felt vulnerable. Like admitting I don't know what I'm doing.&lt;/p&gt;

&lt;p&gt;But it led to the best insight of the week: nobody knows everything upfront. You build something reasonable, document your assumptions, ship it, learn from how it's used, improve it later.&lt;/p&gt;

&lt;p&gt;That's actually freeing. I can stop trying to make perfect decisions with incomplete information and just... build something that works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Systems thinking transfers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My C++ experience helped with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture decisions (I used the Strategy pattern without even thinking about it)&lt;/li&gt;
&lt;li&gt;Understanding when to optimize vs. when good enough is fine&lt;/li&gt;
&lt;li&gt;Knowing that defensive programming matters&lt;/li&gt;
&lt;li&gt;Writing code that won't confuse me in six months&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I'm learning entirely new patterns: how pandas works, why statistical validation matters, what makes data "good" for ML (still figuring this one out).&lt;/p&gt;

&lt;p&gt;It's weirdly complementary. Systems knowledge gives me structure. ML is teaching me to think about data differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built (In Plain English)
&lt;/h2&gt;

&lt;p&gt;The data quality framework has three analyzers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Numeric:&lt;/strong&gt; Checks numbers - calculates mean, standard deviation, finds outliers using a 2-sigma rule. I don't know if 2-sigma is the right threshold for ML, but it's what I learned in college and it seems reasonable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Categorical:&lt;/strong&gt; Checks text/category data - counts unique values, finds frequency distribution, identifies the most and least common items. Warns you if you accidentally passed it numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal:&lt;/strong&gt; Checks dates/times - finds the date range, detects gaps in time series (like missing days of sensor data), tries to figure out if data is regular (daily readings) or irregular (random events).&lt;/p&gt;

&lt;p&gt;Plus a dispatcher that looks at your data, figures out which type it probably is, and routes it to the right analyzer. Uses something called Yamane's formula for sampling so it doesn't have to look at every single item in huge datasets.&lt;/p&gt;

&lt;p&gt;Is this what professional ML engineers use? I have literally no idea. But it works, it has tests, and it solves problems I can understand: don't let bad data silently break your stuff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reality Check
&lt;/h2&gt;

&lt;p&gt;Here's what I don't know yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What actually makes data "good" for ML models&lt;/li&gt;
&lt;li&gt;When my outlier detection would help vs. hurt&lt;/li&gt;
&lt;li&gt;Whether these are the right data quality checks&lt;/li&gt;
&lt;li&gt;How real ML pipelines handle this stuff&lt;/li&gt;
&lt;li&gt;Literally anything about neural networks, transformers, or the AI stuff people talk about&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what I do know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to write code that handles errors gracefully&lt;/li&gt;
&lt;li&gt;How to test thoroughly&lt;/li&gt;
&lt;li&gt;How to structure projects so they don't become unmaintainable messes&lt;/li&gt;
&lt;li&gt;How to read documentation and figure stuff out&lt;/li&gt;
&lt;li&gt;That pandas is really handy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Week 1 taught me that systems engineering skills transfer to ML tooling, even when I don't know the ML part yet. The fundamentals are the same: handle errors, test thoroughly, document clearly, build things that won't break six months from now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Week: NumPy
&lt;/h2&gt;

&lt;p&gt;Week 2 is about NumPy - arrays, vectorization, memory layout, all that stuff. Coming from C++, this actually sounds interesting. Arrays and memory? That's my comfort zone.&lt;/p&gt;

&lt;p&gt;The roadmap says I'll be doing image transformations using only NumPy (no OpenCV). Not sure why yet, but I'm guessing it's about understanding how the low-level stuff works before using the high-level libraries.&lt;/p&gt;

&lt;p&gt;After that: actual machine learning. Linear models, decision trees, ensemble methods. The stuff that makes predictions.&lt;/p&gt;

&lt;p&gt;But first: arrays.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Document This?
&lt;/h2&gt;

&lt;p&gt;A few reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accountability&lt;/strong&gt; - Harder to skip days when you've committed publicly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perspective&lt;/strong&gt; - I'm learning this as a complete ML beginner but an experienced systems engineer. Maybe that viewpoint helps someone else in the same boat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality&lt;/strong&gt; - Most learning blogs are polished success stories. I'm sharing the actual process: bugs, confusion, "I have no idea" moments, and figuring it out anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection&lt;/strong&gt; - If you're also transitioning into ML from systems/C++/infrastructure work, or if you're interested in the production/systems side of ML, let's talk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Commitment
&lt;/h2&gt;

&lt;p&gt;One hour per day. Seven days a week. For twelve months.&lt;/p&gt;

&lt;p&gt;That's what the roadmap promised, anyway. Reality? More like 1.5-2 hours most days. Turns out AIs are optimistic about how long things take. They're great at designing curricula but bad at estimating "figure out why your import statement doesn't work" time.&lt;/p&gt;

&lt;p&gt;Day 7 was supposed to include writing a report generator in 10 minutes. I know string formatting - I didn't need a lesson on that. So I just had the AI write that function. It was 120 lines long. I don't why it thought that was a 10 minute task, but that's the way it is I guess &lt;/p&gt;

&lt;p&gt;Other things take longer because you hit a real problem. Type detection ambiguity. CSV parsing weirdness. Tests that fail for mysterious reasons. That's where the actual learning happens.&lt;/p&gt;

&lt;p&gt;Week 1: Probably 10-12 hours total, one complete portfolio project.&lt;/p&gt;

&lt;p&gt;If I keep this pace: more like 500-700 hours over the year instead of 365, but still very achievable. The consistency matters more than the exact hours.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; ✅&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Week 2-8:&lt;/strong&gt; Traditional ML&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Months 3-6:&lt;/strong&gt; Deep Learning&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Months 7-12:&lt;/strong&gt; Specialization (probably ML Systems Engineering - combining C++ performance work with ML)&lt;/p&gt;

&lt;p&gt;One hour at a time. Or two. We'll see how optimistic claude gets.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Find me: &lt;a href="https://github.com/bean2778/ai_learning_2025" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devjournal</category>
      <category>career</category>
      <category>cpp</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
