Forem: David Bean

Building My First ML Data Pipeline

David Bean — Tue, 21 Oct 2025 22:18:54 +0000

Three Days, One Deployed Dashboard, and a Lesson About Letting Data Drive Business Questions

I just finished my first complete machine learning project—a renewable energy investment analysis dashboard that's now live on Streamlit Cloud. Three days of work. 181,915 rows of data. And one really important lesson: your initial business problem is probably wrong.

I'm a software engineer learning ML with Claude designing my course. This project clarified a lot about how data science work actually happens.

Day 1: When Your Business Problem Meets Reality

I started with a plan: build a tool to help optimize fossil fuel plant modernization schedules based on renewable production patterns. Sounded reasonable. Turned out to be impossible with my data.

I had a renewable energy dataset covering 52 countries from 2010-2022. Six energy types. Good coverage. But after loading it into the interactive EDA dashboard I'd built the previous week, reality hit:

Dataset showed production, not capacity or demand
Renewables depend on weather—you can't schedule them
No grid data, no regional breakdowns
Historical trends can't predict modernization timing

My business problem didn't match what the data could actually answer.

The pivot: I asked a different question. Instead of "when should plants modernize," I asked "which countries represent the best opportunities for battery storage investments based on renewable penetration, growth rates, and energy mix diversity?"

That question? The data could answer it perfectly.

What I Learned: Validate Before You Commit

The EDA dashboard from Week 2 was useful here. Twenty minutes of exploration showed me:

Scale mismatches (totals mixed with individual sources)
Missing data patterns (expected in first-year entries)
Distribution issues (couldn't fix with log transforms)
Time coverage worked for trend analysis

Claude pointed out the business problem didn't match the data. You deal with the situation you're in, so we pivoted to a question the data could actually answer.

Day 1 Continued: The Preprocessing Pipeline

Coming from C++ where I think about data flow and single responsibilities, I built a five-function pipeline:

load_and_clean → filter → aggregate → calculate_metrics → rank

Each function takes a DataFrame, returns a DataFrame, has one clear job, prints progress, and handles edge cases.

The Scale Problem I Almost Got Wrong

Early on, my visualizations looked terrible. Some categories showed values 100x larger than others. My first instinct: log transformation.

Wrong.

The real issue: my data mixed individual renewable sources (Hydro = 1,000 GWh) with aggregate totals (Total Electricity = 200,000 GWh). These shouldn't be on the same chart at all.

Solution: Filter out aggregates entirely. Keep only the discrete renewable sources.

This wasn't a math problem—it was a data structure problem. No transformation fixes a fundamental category mismatch.

Day 2: When Your Model Is "Wrong" (But Actually Right)

I trained a Random Forest model to predict storage infrastructure scores:

Input: Percentages of Hydro, Wind, Solar, Geothermal, Other
Output: Storage need score (0-100)
Performance: R² = 0.948

Model worked. Then I tested extreme cases:

100% Hydro: Score 56.21

100% Wind: Score 31.37

Wait. Wind is intermittent—shouldn't it need MORE storage than stable hydro? Why was my model backwards?

I debugged for 15 minutes before realizing: the model wasn't wrong. My assumption was.

My Day 1 scoring formula:

storage_score = 0.4 × renewable_share + 0.4 × growth_rate + 0.2 × diversity

This measured investment opportunity, not technical storage need. Countries with high hydro (Norway, Iceland) scored high because:

Very high renewable penetration (27-30%)
Mature markets ready for more storage
High penetration signals strong renewable commitment

The model learned exactly what I trained it on. I just forgot what I'd actually built versus what I thought I was building.

Lesson: Models optimize for your training signal, not your intentions. When behavior seems wrong, check what you actually trained it on.

Day 3: Production Deployment Teaches Fast

I built a four-tab Streamlit dashboard:

Overview: Top 10 investment opportunities
Country Analysis: Interactive comparisons
Predictions: ML model with input sliders
Technical Details: Full methodology

Building for production exposed design flaws I'd never catch in a Jupyter notebook.

Problem 1: Path Management

Local: model = joblib.load('storage_model.pkl') worked fine

Streamlit Cloud: FileNotFoundError

Why? My dashboard lived in a src/ subfolder, models in the parent directory. Relative paths resolved from where the code runs, not where the file lives.

Fix:

import os
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
model = joblib.load(os.path.join(parent_dir, 'storage_model.pkl'))

Problem 2: Requirements File Location

Streamlit Cloud looks for requirements.txt at repository root, not in subdirectories. Took two deployment failures to figure this out.

Problem 3: Feature Scaling

Almost made a critical mistake: feeding raw percentages directly to the model.

Wrong:

input_data = np.array([[hydro, wind, solar, geo, other]])
prediction = model.predict(input_data)  # Wrong!

Right:

input_data = np.array([[hydro, wind, solar, geo, other]])
input_scaled = scaler.transform(input_data)  # Scale first!
prediction = model.predict(input_scaled)

Models trained on scaled features expect scaled inputs. Skip this and predictions don't work.

Lesson: Development and production environments have different problems. Same issues I deal with in systems work—environment differences, dependencies, synchronization—show up in ML deployments.

What Three Days Produced

Live dashboard with public URL

GitHub repo with professional README

Trained ML model (three deployment patterns: batch/API/edge)

Complete data pipeline with reproducible preprocessing

Documentation with screenshots

Top investment opportunities identified:

Netherlands (63.08) - 838% growth rate
Iceland (62.05) - 29.5% renewable penetration
Norway (59.47) - Strong baseline, steady growth
Hungary (52.82) - 658% growth, emerging market
UK (48.90) - Large market, 504% growth

Technical stats:

181,915 data points processed
52 countries analyzed
156 months of time series
8,033 predictions/second (batch)
89.4 KB model (ONNX edge deployment)
R² = 0.948

What Actually Surprised Me

1. Preprocessing Takes Most of the Time

In C++, optimization takes most of the time. In ML, data cleaning and feature engineering dominated. Good preprocessing makes modeling straightforward. Bad preprocessing makes it impossible.

2. Production Deployment Shows Problems Fast

Jupyter notebooks hide issues:

Path dependencies
Environment differences
Feature scaling synchronization
Input validation

Deploying early forced me to deal with these.

3. The README Matters

I spent 30 minutes writing a professional README:

Business problem clearly stated
Technical approach explained
Setup instructions
Screenshots
Live demo URL

Project looks more complete with good documentation.

4. End-to-End Matters More Than Depth

I could've spent three days optimizing model accuracy from 0.948 to 0.952. Instead I built a complete pipeline: data → model → deployment → documentation.

For real job hunting, I hope this matters more.

Real Bugs I Hit

Bug 1: Streamlit Cloud couldn't find plotly module

Cause: requirements.txt in wrong directory

Fix: Moved to repo root, specified plotly>=5.0.0

Bug 2: Model files not loading

Cause: Relative paths broken in cloud environment

Fix: Used os.path.dirname(__file__) for portable paths

Bug 3: "Random Forest" truncated in UI columns

Cause: Text too long for column width

Fix: Made it a subheader instead of metric in column

Bug 4: Predictions looked weird

Cause: Forgot to scale input features

Fix: Applied scaler before model.predict()

Claude caught most of these during code review. I understand the patterns now—scoping issues, path management, feature preprocessing flow. I'm delegating implementation details and focusing on understanding architecture.

What's Next

This was Portfolio Project 1 of 6. Each project adds new capabilities:

Project 1 (Done): Data analysis dashboard, traditional ML
Project 2: Traditional ML pipeline with feature engineering
Project 3: Deep learning computer vision
Project 4: Generative AI with LLMs
Project 5: MLOps with CI/CD
Project 6: ML systems engineering specialization

Goal isn't just learning ML—it's building a portfolio proving I can deliver production ML systems.

Tools That Helped

Streamlit (dashboard framework)
Plotly (interactive viz)
scikit-learn (Random Forest, preprocessing)
Pandas (data manipulation)
Streamlit Cloud (deployment)
Claude (course design, code review, debugging partner)

Live Demo & Code

🔗 Live Dashboard: https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/

💻 GitHub Repository: https://github.com/bean2778/ai_learning_2025

📊 Dataset: Global Renewable Energy Production (2010-2022)

About this series: I'm a software engineer learning machine learning with Claude designing my curriculum. Week 3 done: EDA, problem formulation, first portfolio project deployed. More posts coming on traditional ML, deep learning, and production systems.

Connect:

LinkedIn: www.linkedin.com/in/bean2778
GitHub: https://github.com/bean2778/ai_learning_2025
Previous: blog 2

Next: Traditional ML fundamentals—supervised learning, evaluation metrics, bias-variance tradeoff.

Time: 3 days (Days 19-21 of 270-day roadmap)

Status: Portfolio Project 1 complete ✅

Coffee consumed: Enough

Blog Post 2: NumPy Through a C++ Programmer's Eyes

David Bean — Sat, 11 Oct 2025 01:54:30 +0000

Blog Post 2: NumPy Through a C++ Programmer's Eyes

Week Two: Finally Writing Code That Feels Fast

Week two of my ML learning journey, and I'm starting to see why Python dominates machine learning despite being "slow."

The secret? Most of the time, you're not actually running Python.

This week was all about NumPy and pandas - the foundations of pretty much every ML library. And as someone who's written a lot of C++ code focused on performance, watching NumPy operations run was genuinely satisfying. These aren't slow Python loops. They're compiled C code operating on contiguous arrays, using SIMD instructions where possible.

It's basically everything I love about C++ performance, wrapped in Python's convenience.

Day 8: Building Image Transformations Without Image Libraries

The first challenge: implement image transformations (rotate, flip, crop, brightness adjustment) using only NumPy. No OpenCV, no PIL for the actual transformations.

The rotation algorithm was the fun part. I knew I needed to rotate 90° clockwise, but which operations exactly? After some debugging with test patterns (red left half, blue right half), I figured it out:

def rotate_90(image):
    # Step 1: Transpose (swap rows and columns)
    transposed = np.transpose(image, (1, 0, 2))

    # Step 2: Flip vertically
    return np.flip(transposed, axis=0)

Transpose alone doesn't give rotation - you need transpose + flip. I only really understood this after printing intermediate steps and tracing through what should happen to each quadrant.

When Claude suggested np.transpose(image, (1, 0, 2)), I made myself stop and ask: what does that tuple actually mean? Turns out (1, 0, 2) means "put axis 1 first, axis 0 second, keep axis 2 third." So columns become rows, rows become columns, color channels stay unchanged. The debugging process of creating test patterns and visualizing transformations taught me more than just reading documentation would have.

The performance difference is wild. Every operation works on entire arrays at once. No loops over millions of pixels. image * brightness_factor multiplies every single pixel value in one vectorized operation. This is the SIMD parallelism I'm used to from C++, but I didn't have to write it myself.

Days 9-10: Pandas Element-Wise Operators Are Not Python Operators

Pandas threw me for a loop because it looks like regular Python but behaves completely differently.

The element-wise operator confusion:

I kept trying to write conditionals like normal Python:

# This doesn't work:
df[df['age'] > 120 or df['age'] < 0]  # ERROR!

# You need element-wise operators:
df[(df['age'] > 120) | (df['age'] < 0)]  # Works!

Use | for OR, & for AND, ~ for NOT. Always. This tripped me up for a solid day until it finally clicked: these operators work on entire columns at once, not single values.

The groupby-aggregate pattern is everywhere:

This pattern appears constantly in ML preprocessing:

# Calculate total spending per customer
customer_totals = df.groupby('customer_id')['amount'].sum()

# Map those totals back to every row
df['customer_total'] = df['customer_id'].map(customer_totals)

Split the data into groups, apply some aggregation, combine the results back. Once I understood this pattern, tons of feature engineering operations made sense.

The CSV string conversion gotcha:

My favorite bug of the week: I had integration tests failing because pandas reads CSV columns as strings, not numbers. My unit tests all passed (they used real Python numbers), but when I tested the complete pipeline reading from a file, everything broke.

# CSV gives you strings:
df.iloc[:, 0].tolist()  # ['1', '2', '3'] - all strings!

# Need explicit conversion:
pd.to_numeric(df.iloc[:, 0], errors='coerce')  # [1, 2, 3]

This is exactly why you need integration tests, not just unit tests. Different test types catch different bugs.

Day 12: The 150x Speedup

This was the most satisfying day. I had a function that processed transactions using .apply() with lambdas and some iterrows loops. It worked. It was slow. Claude challenged me to optimize it using vectorization.

The results:

Slow version: 0.46 seconds for 10k rows (21,559 rows/second)
Fast version: 0.003 seconds for 10k rows (3,249,635 rows/second)
Speedup: 150x faster

Same input. Same output (verified with pd.testing.assert_frame_equal()). Just replaced Python loops with vectorized NumPy operations.

The key transformations:

# SLOW - apply with lambda
df['total'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)

# FAST - vectorized multiplication  
df['total'] = df['price'] * df['quantity']

# SLOW - apply with if/elif/else function
def categorize(row):
    if row['amount'] < 50:
        return 'small'
    elif row['amount'] < 200:
        return 'medium'
    else:
        return 'large'
df['category'] = df.apply(categorize, axis=1)

# FAST - np.select with conditions
conditions = [
    df['amount'] < 50,
    (df['amount'] >= 50) & (df['amount'] < 200),
    df['amount'] >= 200
]
choices = ['small', 'medium', 'large']
df['category'] = np.select(conditions, choices)

The lesson: .apply() and .iterrows() are 150x slower because they're Python loops in disguise. Every iteration has interpreter overhead. Vectorized operations run in compiled C code with no per-element overhead.

This isn't "premature optimization." This is fundamental to how you write pandas code. You can't just "optimize later" - you need to think vectorized from the start.

Days 13-14: Making Data Problems Visible

The weekend project was building a data quality dashboard. I took the matplotlib visualizations from Day 13 and wrapped them in a Streamlit app.

The result: upload any CSV, instantly see:

Amount distribution (with outliers highlighted in red)
Time series (with missing data periods shaded)
Age distribution (valid vs impossible values)
Category balance (class imbalance visualization)

Plus automated detection of:

Missing values
Statistical outliers
Invalid ages (negative or >120)
Negative amounts
With specific counts and recommendations for each issue

What I learned about Streamlit:

It's refreshingly simple. The entire script reruns on every user interaction, which sounds inefficient but makes the programming model dead simple. No state management, no callbacks, no frontend/backend separation.

uploaded_file = st.file_uploader("Choose a CSV")

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.dataframe(df.head(10))
    # Show visualizations...

That's it. Upload → Process → Display. No web development required.

The "calculate once, use twice" pattern:

I caught myself calling the same detection functions multiple times:

# Inefficient - calls function twice:
st.metric("Missing", find_missing(df))
if find_missing(df) > 0:
    st.warning("Found missing values")

My C++ performance instincts kicked in:

# Better - calculate once:
missing_count = find_missing(df)
st.metric("Missing", missing_count)
if missing_count > 0:
    st.warning("Found missing values")

Not a huge deal for small datasets, but good habits matter.

What My C++ Background Got Right and Wrong

What transferred well:

Performance awareness: I instinctively noticed when operations might be slow and looked for vectorized alternatives.
Memory layout intuition: Understanding that NumPy arrays are contiguous in memory made sense immediately.
Type thinking: Python's type hints feel natural. When pandas operations convert uint8 to float64, I notice.
Debugging mindset: Add logging, test edge cases, isolate the problem systematically.

What I had to unlearn:

Loops are fine → Loops are death: In C++, loops are normal. In pandas, they're 150x slower. This is a fundamental mental shift.
Control flow is explicit → Control flow is vectorized: Can't use if/elif/else on arrays. Must use np.select() or np.where().
Build from scratch → Use the ecosystem: C++ culture is "roll your own." Python ML culture is "there's definitely a library for that."

The biggest surprise: NumPy gives me C++ performance without writing C++. Most of the time. When I eventually need even more speed, the roadmap has me implementing custom C++ extensions later. But for now, vectorized NumPy is fast enough.

The Discovery-Based Learning Struggle

The hardest part of this week wasn't the code - it was staying curious instead of copying solutions.

When Claude suggested using np.transpose(image, (1, 0, 2)) for rotation, I had to force myself to stop and ask:

What does the (1, 0, 2) tuple actually mean?
Why those specific numbers?
What happens if I change the order?

This turns a 5-minute "just make it work" into a 20-minute learning session where I actually understand axis manipulation.

Same with pd.to_numeric(..., errors='coerce'):

What does 'coerce' do?
What are the alternatives?
When would I use 'raise' or 'ignore' instead?

It's slower. Sometimes frustrating. But it's the difference between having code that works vs understanding why it works.

What Actually Tripped Me Up

The "Rumpelstiltskin problem" is real. The hardest part of learning pandas isn't understanding concepts - it's knowing what operations exist and what they're called.

I can't use .mask() if I don't know it exists. I can't search for "how to do X" if I don't know X is called "broadcasting." This is where having Claude as a guide helps - it can suggest the right operation for the problem, then I go understand how it works.

NaN propagation is weird. Coming from languages where NULL works differently, pandas' NaN behavior took getting used to. It silently propagates through operations in ways that break boolean logic:

# Without na=False, NaN breaks filtering:
df['email'].str.contains('@')  # Returns [True, False, NaN, True]
df[df['email'].str.contains('@')]  # ERROR!

# Must handle explicitly:
df['email'].str.contains('@', na=False)  # Returns [True, False, False, True]

Week 2 vs Week 1

Week 1 was about development practices (testing, error handling, packaging). Week 2 was about the actual data manipulation tools (NumPy, pandas, visualization).

Both feel essential. You can't build production ML without both:

Clean code that doesn't crash (Week 1)
Fast data processing that scales (Week 2)

The combination is what makes ML engineering work in production.

What's Next

Week 3 starts traditional machine learning - linear models, decision trees, ensemble methods. Still using Claude's discovery-based approach: here's the problem, here's the documentation, now figure it out.

I'm getting more comfortable with this pattern. The first few days I wanted explicit instructions. Now I appreciate the struggle - it's where the learning happens.

Also: I've told Claude to start writing most of my tests because I understand the patterns now. Learning to delegate to AI is part of learning with AI.

Two weeks in. Still no neural networks. Just data engineering foundations. And honestly? I'm starting to understand why everyone says data engineering is 80% of ML work.

About this series: I'm a software engineer learning ML using a custom roadmap designed by Claude. The approach focuses on production skills and problem-solving over tutorials. Week 2 complete: NumPy, pandas, and an interactive data quality dashboard. All code and daily summaries on [GitHub link].

Feedback welcome: Did the C++ perspective add value or just clutter? Should I include more code examples or keep it high-level?

Why a C++ Systems Engineer is Learning Machine Learning

David Bean — Fri, 03 Oct 2025 20:44:40 +0000

A senior systems programmer's journey into AI/ML - Week 1 reflections

The Decision

After spending over a decade building high-performance C++ systems in defense and aerospace, I've made a decision: I'm learning machine learning. Not casually browsing tutorials on weekends, but committing to a structured 12-month roadmap with one hour of focused work every single day.

Why? Because the intersection of systems engineering and ML represents one of the most valuable skill combinations in tech right now. MLOps engineers see 9.8× demand growth with salaries averaging $122k-$167k. More importantly, most ML practitioners lack deep systems knowledge, while most systems engineers don't understand ML. I'm betting that bridging this gap is worth the investment.

Why This Feels Different

I've looked at ML courses before. They all seem to follow the same pattern: install Anaconda, run some scikit-learn examples, train a model on the Iris dataset, celebrate. That's fine for getting started, but it doesn't prepare you for production systems where models fail silently, data pipelines break, and performance matters.

So I chose a different approach: a discovery-based roadmap that prioritizes production skills from day one. Instead of copying tutorial code, I solve problems by reading documentation, debugging issues independently, and building understanding through experimentation.

The mindset shift: I'm not learning to run ML models. I'm learning to build ML systems.

Week 1: Building Something Real

Most "Week 1 ML" tutorials have you print "Hello World" and maybe plot a graph. My Week 1 looked different.

I built a data quality checker. Sounds boring, right? But here's the thing - I have no idea what makes good ML data. I'm literally learning this from an AI assistant (Claude) in real-time, using a roadmap designed to make me figure things out rather than copy-paste solutions.

The framework analyzes numeric, categorical, and temporal data. It detects outliers, finds missing values, identifies data quality issues. It has 44 tests because I spent two full days just writing tests.

But honestly? I don't know if these are the right checks for ML. I'm a C++ guy who knows about memory management and thread safety. Data quality for machine learning? That's completely new territory.

Day 1: Just Make It Not Crash

First day, I wrote a function to check data quality. Coming from C++, my instinct was to write something that handles edge cases without dying.

def check_data_quality(data: List[float]) -> Dict[str, Any]:
    clean_data = [x for x in data if x is not None and not pd.isna(x)]

    if not clean_data:
        return {'error': 'no valid data points'}

    # Continue with analysis...

The AI teaching me asked: "Why not just let it crash with an error?"

Because in my world, if your distributed system crashes because someone passed it bad data, you've failed. You handle errors gracefully, you log what happened, you return something useful.

Apparently that's also important for ML pipelines. Who knew? (Everyone who does ML, probably. But I didn't.)

Day 2-3: Making It Installable

While I was setting up proper Python packaging with pyproject.toml, I kept thinking "this seems like overkill for a learning project."

But the roadmap insisted: documentation, logging, proper module structure from day one. Not because the code is complex, but because production habits need to be habitual.

Fine. I wrote docstrings. I set up logging. I made it pip installable.

Two days later when I had to debug why my tests were failing, those logs saved me 30 minutes of confusion. The docstrings reminded me what I was trying to do. Point taken.

Day 4-5: Testing Like My Career Depends On It

I spent two days writing tests. Not "does it run" tests. Real tests. Unit tests, integration tests, property-based tests using a library called Hypothesis that generates random inputs to find bugs.

Hypothesis found actual bugs I never would have caught:

Floating-point precision issues with large numbers
Numerical overflow with extreme values
CSV type conversion errors where pandas read numbers as strings

This is where my C++ background actually helped. I know what edge cases look like. I know that "works on my machine" isn't good enough. I know that systems fail in weird ways when you least expect it.

Turns out that's useful for ML too. Data is messy. Edge cases are everywhere. Tests catch problems before they break production.

Day 6-7: The "I Have No Idea" Moment

Weekend project: add temporal data analysis. Dates, timestamps, time series stuff.

I built gap detection - finding missing dates in time series data. The algorithm calculates time deltas between dates, finds the most common one, flags anything bigger as a gap.

Then Claude (the AI helping me learn (and right these blog post)) asked: "What temporal quality checks matter most for ML?"

My answer: "I really have no idea. I'm doing this whole course to find that out."

And you know what? That was the right answer.

Claude's response: "Start simple, document your assumptions, make it observable, iterate later. This is how real ML engineering works. Even senior engineers build V1 without knowing all requirements."

That was valuable. Not because I learned some ML best practice, but because I learned it's okay to not know. You build something reasonable, you see how it's used, you improve it.

This actually feels familiar. People think defense/aerospace work is all upfront specs and formal requirements. Reality? You get dropped into a mess of legacy systems, vague requirements, and contradictory stakeholder demands, then you hack your way through until something works. ML engineering sounds similar, just with different tools.

The Integer Problem

Here's a fun debugging story. I wrote a dispatcher that automatically figures out if your data is numeric, categorical, or temporal (dates/times).

Initial version routed [1, 2, 3, 4, 5] to the temporal analyzer. Why? Because pandas interprets small integers as Unix epoch days. Day 1 after Unix epoch is January 1, 1970. So [1, 2, 3, 4, 5] looked like a date sequence to pandas.

That's... not what anyone would expect.

Solution: Only test large integers (>946684800, roughly year 2000) as potential timestamps. Small integers default to numeric.

if isinstance(item, int) and item > 946684800:
    # Large integers: might be Unix timestamps
    try:
        pd.to_datetime([item], unit='s', errors='raise')
        temp_count += 1
    except:
        pass
# Small ints: skip temporal test, treat as numeric

I have no idea if this is how production ML systems handle this. But it makes sense, tests pass, and it solves the immediate problem. V2 can be smarter if needed.

What Surprised Me

Pandas is kind of amazing

Coming from C++ where you manually manage everything, pandas feels like cheating:

# Frequency distribution in one line
series.value_counts().to_dict()

# Date parsing with error handling
pd.to_datetime(series, errors='coerce')

What would be 20-30 lines of careful C++ becomes a method call. I can see why everyone uses this.

Not knowing is fine

That moment when Claude asked what ML engineers need for temporal data and I said "I have no idea" - that felt vulnerable. Like admitting I don't know what I'm doing.

But it led to the best insight of the week: nobody knows everything upfront. You build something reasonable, document your assumptions, ship it, learn from how it's used, improve it later.

That's actually freeing. I can stop trying to make perfect decisions with incomplete information and just... build something that works.

Systems thinking transfers

My C++ experience helped with:

Architecture decisions (I used the Strategy pattern without even thinking about it)
Understanding when to optimize vs. when good enough is fine
Knowing that defensive programming matters
Writing code that won't confuse me in six months

But I'm learning entirely new patterns: how pandas works, why statistical validation matters, what makes data "good" for ML (still figuring this one out).

It's weirdly complementary. Systems knowledge gives me structure. ML is teaching me to think about data differently.

What I Built (In Plain English)

The data quality framework has three analyzers:

Numeric: Checks numbers - calculates mean, standard deviation, finds outliers using a 2-sigma rule. I don't know if 2-sigma is the right threshold for ML, but it's what I learned in college and it seems reasonable.

Categorical: Checks text/category data - counts unique values, finds frequency distribution, identifies the most and least common items. Warns you if you accidentally passed it numbers.

Temporal: Checks dates/times - finds the date range, detects gaps in time series (like missing days of sensor data), tries to figure out if data is regular (daily readings) or irregular (random events).

Plus a dispatcher that looks at your data, figures out which type it probably is, and routes it to the right analyzer. Uses something called Yamane's formula for sampling so it doesn't have to look at every single item in huge datasets.

Is this what professional ML engineers use? I have literally no idea. But it works, it has tests, and it solves problems I can understand: don't let bad data silently break your stuff.

Reality Check

Here's what I don't know yet:

What actually makes data "good" for ML models
When my outlier detection would help vs. hurt
Whether these are the right data quality checks
How real ML pipelines handle this stuff
Literally anything about neural networks, transformers, or the AI stuff people talk about

Here's what I do know:

How to write code that handles errors gracefully
How to test thoroughly
How to structure projects so they don't become unmaintainable messes
How to read documentation and figure stuff out
That pandas is really handy

Week 1 taught me that systems engineering skills transfer to ML tooling, even when I don't know the ML part yet. The fundamentals are the same: handle errors, test thoroughly, document clearly, build things that won't break six months from now.

Next Week: NumPy

Week 2 is about NumPy - arrays, vectorization, memory layout, all that stuff. Coming from C++, this actually sounds interesting. Arrays and memory? That's my comfort zone.

The roadmap says I'll be doing image transformations using only NumPy (no OpenCV). Not sure why yet, but I'm guessing it's about understanding how the low-level stuff works before using the high-level libraries.

After that: actual machine learning. Linear models, decision trees, ensemble methods. The stuff that makes predictions.

But first: arrays.

Why Document This?

A few reasons:

Accountability - Harder to skip days when you've committed publicly.

Perspective - I'm learning this as a complete ML beginner but an experienced systems engineer. Maybe that viewpoint helps someone else in the same boat.

Reality - Most learning blogs are polished success stories. I'm sharing the actual process: bugs, confusion, "I have no idea" moments, and figuring it out anyway.

Connection - If you're also transitioning into ML from systems/C++/infrastructure work, or if you're interested in the production/systems side of ML, let's talk.

The Commitment

One hour per day. Seven days a week. For twelve months.

That's what the roadmap promised, anyway. Reality? More like 1.5-2 hours most days. Turns out AIs are optimistic about how long things take. They're great at designing curricula but bad at estimating "figure out why your import statement doesn't work" time.

Day 7 was supposed to include writing a report generator in 10 minutes. I know string formatting - I didn't need a lesson on that. So I just had the AI write that function. It was 120 lines long. I don't why it thought that was a 10 minute task, but that's the way it is I guess

Other things take longer because you hit a real problem. Type detection ambiguity. CSV parsing weirdness. Tests that fail for mysterious reasons. That's where the actual learning happens.

Week 1: Probably 10-12 hours total, one complete portfolio project.

If I keep this pace: more like 500-700 hours over the year instead of 365, but still very achievable. The consistency matters more than the exact hours.

Week 1: ✅

Week 2-8: Traditional ML

Months 3-6: Deep Learning

Months 7-12: Specialization (probably ML Systems Engineering - combining C++ performance work with ML)

One hour at a time. Or two. We'll see how optimistic claude gets.

Find me: GitHub