Forem: Nyson Markus

Why the Line Between Data Engineer and ML Engineer Is Disappearing, And Why That's Your Cue to Cross It

Nyson Markus — Thu, 23 Apr 2026 04:51:28 +0000

The ML engineer role is changing. If you're a data engineer who's been watching from the sidelines, the window is opening wider than it's ever been.

Something is shifting in how the best technology companies think about machine learning engineering. And if you've been paying attention, you have probably noticed it too. The shift is not loud or sudden. It is showing up in job descriptions, in how ML teams are structured, in the postmortems of failed AI projects, and in the quiet frustration of engineering leaders who have spent years watching sophisticated models underperform because the data underneath them was never taken seriously enough.

This is not a story about roles disappearing or being replaced. It is a story about a profession in the middle of a significant transformation, and about why that transformation creates a genuine, time-sensitive opportunity for anyone with a data engineering background who has been curious about ML but unsure whether the leap is worth making.

The Version of ML Engineering That Existed Five Years Ago

For a long time, ML engineering meant something relatively narrow: take a model that a data scientist or researcher had built, figure out how to deploy it reliably, and keep it running in production. The research scientists did the exploratory work. The data scientists ran the experiments. The ML engineers were responsible for turning notebook code into something that could survive contact with the real world, which in practice meant containerizing models, setting up serving infrastructure, managing dependency conflicts, and writing the glue code that connected a trained artifact to an API endpoint.

The role was real and the work was genuinely difficult. But it did not require deep knowledge of how training data was assembled, where features came from, or what happened upstream in the data pipeline before a training job ever kicked off. The model was treated as a black box that arrived fully formed, and the ML engineer's job was to find somewhere to put it.

That version of the role is fading, and it is fading for reasons that are worth understanding in detail.

What the Role Is Becoming

The companies setting the pace today are asking ML engineers to own the entire lifecycle of a machine learning system. Not just deployment, not just inference infrastructure, but the complete arc from raw data to production prediction and everything that needs to happen reliably in between.

In practice, that means ML engineers are now expected to understand and often build the data pipelines that feed model training. They are expected to design and maintain feature stores, think carefully about feature freshness and consistency between training and serving environments, and catch subtle issues like training-serving skew before they compound into production failures. They are expected to instrument models for observability, build retraining pipelines that trigger on data drift or performance degradation, and reason about what happens to model behavior when upstream data sources change without notice. They are expected, in short, to treat the entire system as their responsibility rather than just the model-shaped portion of it.

This is a fundamentally different job than what ML engineering looked like in 2019, and it maps with striking precision onto the skills that data engineers have been building throughout their careers.

Why the Data Layer Became the ML Engineer's Problem

The reason for this shift is not theoretical. It is the accumulated result of a large number of expensive failures that companies across the industry have experienced as they moved from ML experimentation to ML at scale.

The pattern repeats itself with depressing regularity. A team builds a model that performs well in offline evaluation. The model is deployed. Initial results are promising. Then, gradually or suddenly, performance degrades. An investigation begins. More often than not, the root cause is found not in the model architecture or the training procedure but in the data. A feature pipeline was computing aggregations over the wrong time window. An upstream schema change altered the distribution of a key input variable. A join condition introduced subtle leakage that inflated training metrics without improving real-world performance. The model was learning from data that did not accurately represent the problem it was supposed to solve.

These are not exotic failure modes. They are the normal failure modes of production ML systems, and they are data engineering problems at their core. The industry has slowly and painfully learned that you cannot build reliable ML systems without treating data infrastructure as a first-class concern, and that lesson is now showing up in how companies define and hire for ML engineering roles.

How Job Descriptions Are Actually Changing

The shift is visible in the concrete language of ML engineering job postings. Where these descriptions once led with requirements around model architectures, familiarity with recent research papers, or experience implementing specific neural network components, a growing number now foreground skills that data engineers will recognize immediately.

Requirements around data pipeline design, experience with workflow orchestration tools like Apache Airflow or Prefect, familiarity with distributed processing frameworks like Apache Spark, knowledge of feature store architecture and the practical challenges of feature serving at low latency, and comfort with data quality monitoring and validation frameworks are appearing with increasing regularity in ML engineering job descriptions at companies that are serious about production ML.

The MLOps skill set, which sits at the intersection of ML engineering and data engineering, has become one of the most sought-after profiles in the market. Tools like MLflow for experiment tracking, Feast or Tecton for feature management, Great Expectations or Soda for data quality validation, and Kubeflow or Metaflow for ML pipeline orchestration are appearing alongside the more traditional ML engineering stack of PyTorch, TensorFlow, and Kubernetes. Engineers who are comfortable across both layers are genuinely difficult to find.

The Technical Gaps That Still Need to Be Closed

None of this means the transition requires no effort. There are real technical gaps between data engineering and ML engineering, and it is worth being specific about what they are.

The most significant gap for most data engineers is statistical and mathematical foundations. Understanding how gradient descent works and why learning rate schedules matter, knowing the difference between L1 and L2 regularization and when each is appropriate, being able to reason about bias-variance tradeoff in practical terms, and understanding how evaluation metrics like precision, recall, AUC-ROC, and NDCG connect to real business objectives are all areas that require deliberate study. These are not optional. An ML engineer who cannot reason clearly about model evaluation and the ways evaluation metrics can be misleading will build systems that fail in ways that are hard to diagnose.

Beyond the mathematical foundations, data engineers moving into ML need to develop fluency with the model development workflow itself. This includes experiment tracking practices, the mechanics of cross-validation done correctly, hyperparameter optimization strategies, and the discipline of keeping experiments reproducible. It also includes an understanding of how different model families behave and what their failure modes look like in production, since a random forest and a deep neural network require different monitoring strategies and fail in qualitatively different ways.

These are learnable gaps. Most engineers with strong data backgrounds close them within six to twelve months of focused effort. The reason the overall transition is more accessible than it appears from the outside is precisely that the foundational infrastructure instincts, which take years to develop and cannot be shortcut, are already in place.

The Supply and Demand Reality

The market reality right now is that there is a genuine and significant shortage of engineers who combine strong data systems thinking with ML engineering capability. The research-oriented candidates who dominated ML hiring in earlier years often bring impressive theoretical depth but limited experience with the production data infrastructure challenges that dominate real ML engineering work. The infrastructure engineers who can manage Kubernetes clusters and design distributed systems often lack the ability to reason about model behavior, evaluation methodology, and the specific failure modes of learning systems. The overlap between these two skill sets is the profile that companies most need and have the most difficulty finding.

Data engineers who invest in building the ML-specific layer of skills are positioning themselves directly in that overlap. And because the industry is still in the process of figuring out that this profile is what it actually needs, there is currently a gap between the value these engineers bring and the premium the market is placing on their background. That gap will close as more people recognize the opportunity. The engineers who move early are the ones who will benefit most from it.

For a detailed and practical breakdown of what this transition actually involves, including which skills transfer directly, which ones need to be built from scratch, how to sequence the learning, and how to position yourself for the roles that are emerging at this intersection, this comprehensive guide to transitioning from Data Engineer to ML Engineer is one of the most thorough resources available on the subject.

The Engineers Who Will Define the Next Decade of ML

There is a version of ML engineering that is emerging as the field matures, one that looks less like applied research and more like rigorous systems engineering with a learning component at its center. The engineers who are doing the most impactful work in this version of the field are not necessarily the ones who can derive the attention mechanism from first principles or reproduce a recent NeurIPS paper from memory. They are the ones who can look at a production ML system that is behaving unexpectedly and know exactly where to start the investigation. They are the ones who built the feature pipeline to be auditable in the first place.

They treat data as infrastructure rather than input. They build training pipelines with the same discipline they would apply to any other production system, including testing, versioning, monitoring, and documentation. They ask what happens when the upstream data changes before they ever run a training job, because they have seen enough times what happens when nobody asks that question.

This profile has always existed at the edges of both data engineering and ML engineering. What is different now is that the industry is actively looking for it, building job descriptions around it, and paying accordingly for people who can genuinely fill it.

The Window Is Open, But Trends Do Not Wait

Industry role definitions do not stay in flux indefinitely. There is a window between the moment a role begins to change and the moment the new profile becomes fully codified and intensely competitive. During that window, engineers who recognize the shift early and position themselves accordingly have significant advantages over those who wait until the pattern is obvious.

The data is pointing in a clear direction. ML engineering is expanding to encompass the data infrastructure layer that was previously treated as someone else's problem. Data engineers who have spent years building that infrastructure are sitting on a foundation of skills that the market is beginning to value in a new and more prominent way. The question for any data engineer who has been curious about this transition is not whether the opportunity is real. The question is whether to move while the conditions are still this favorable or to wait until the window has narrowed.

Written for data and ML professionals navigating the evolving landscape of AI engineering roles.

PySpark to Pandas/scikit-learn: A Practical Migration Guide for Data Engineers Learning ML

Nyson Markus — Fri, 10 Apr 2026 11:09:34 +0000

If you've spent years writing PySpark pipelines, the first time you open a Jupyter notebook full of pd.DataFrame and sklearn.fit() calls, it can feel like you've switched languages entirely.

You haven't. The concepts are the same: transformations, aggregations, pipelines, model evaluation. But the execution model, API design, and idioms are different enough to cause real friction when you're trying to learn ML fast.

This guide is not a beginner tutorial. It's a translation layer, a direct mapping from what you already know in PySpark to its equivalent in Pandas and scikit-learn, with side-by-side code, gotchas, and practical advice for anyone making the shift from data engineer to machine learning engineer.

The Single Biggest Mental Model Shift

Before any code, understand this: PySpark uses lazy evaluation. Pandas and scikit-learn do not.

In PySpark, transformations like .filter(), .select(), and .groupBy() build a logical execution plan. Nothing runs until you call an action like .collect() or .show(). This is what enables distributed optimization across a cluster.

In Pandas, every operation executes immediately on the data in memory. When you write df['col'].mean(), it computes right then.

# PySpark: lazy, nothing computed yet
df_filtered = spark_df.filter(spark_df['age'] > 30)

# Pandas: eager, executes immediately
df_filtered = pandas_df[pandas_df['age'] > 30]

This difference has downstream consequences. Debugging is easier in Pandas because errors surface instantly. But Pandas cannot handle datasets that exceed your machine's RAM. PySpark adds cluster overhead that makes it slower than Pandas for datasets under roughly 5 to 10 GB.

Part 1: Core DataFrame Operations Side-by-Side

Most of your day-to-day DE work maps cleanly. Here's the translation table for the operations you use most.

Filtering Rows

# PySpark
df.filter(df['salary'] > 100000)
df.filter("salary > 100000")           # SQL-style string also works

# Pandas
df[df['salary'] > 100000]
df.query("salary > 100000")            # equivalent SQL-style

Selecting Columns

# PySpark
df.select('name', 'salary', 'department')

# Pandas
df[['name', 'salary', 'department']]

GroupBy and Aggregation

# PySpark
df.groupBy('department').agg(
    F.mean('salary').alias('avg_salary'),
    F.count('*').alias('headcount')
)

# Pandas
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    headcount=('salary', 'count')
).reset_index()

Note the small but important differences. PySpark uses groupBy (camelCase) while Pandas uses groupby (lowercase). PySpark's .agg() requires explicit column references via F.mean(), while Pandas uses tuple notation (column, function).

Joins

# PySpark
df1.join(df2, on='user_id', how='left')

# Pandas
pd.merge(df1, df2, on='user_id', how='left')

Adding and Transforming Columns

# PySpark
df.withColumn('salary_k', df['salary'] / 1000)

# Pandas
df['salary_k'] = df['salary'] / 1000

# or the non-mutating version
df = df.assign(salary_k=df['salary'] / 1000)

The .withColumn() pattern in PySpark has no single Pandas equivalent. You can use direct assignment or .assign(). Prefer .assign() in chained operations because it returns a new DataFrame without modifying the original.

Handling Nulls

# PySpark
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df.agg(F.mean('salary')).collect()})

# Pandas
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df['salary'].mean()})

This is one of the cleanest translations. The APIs are nearly identical, except Pandas lets you call .mean() directly on a Series without a .collect() call.

Part 2: What `.toPandas()` Actually Costs

As a data engineer, you've probably used .toPandas() to pull Spark results into a notebook. It's tempting to treat this as a bridge that makes the two worlds equivalent. It is not, or at least not for free.

# This works, but has important implications
pandas_df = spark_df.toPandas()

When you call .toPandas():

All data is collected from every executor back to the driver node
It must fit entirely in driver memory
Arrow-based transfer (enabled by spark.sql.execution.arrow.pyspark.enabled = true) makes this significantly faster but does not change the memory constraint

For ML work on training datasets, your feature table often fits comfortably in memory. Feature stores, for example, typically serve pre-aggregated feature vectors that are far smaller than raw event data. But for very large training sets, this is where tools like Ray, Dask, or Spark's native MLlib become relevant.

Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers

This is where the API divergence is most significant, and where most data engineers hit the steepest learning curve.

The Fundamental Difference: Vectors vs. Native Columns

PySpark MLlib requires all numerical features to be assembled into a single dense or sparse vector column before training. scikit-learn works on native NumPy arrays and Pandas DataFrames directly.

# PySpark MLlib: you MUST assemble features into a vector first
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=['age', 'salary', 'tenure'],
    outputCol='features'
)
df_assembled = assembler.transform(df)
# df now has a 'features' column of DenseVector([age, salary, tenure])

# scikit-learn: no assembly needed, just pass the DataFrame directly
from sklearn.linear_model import LogisticRegression

X = df[['age', 'salary', 'tenure']]
y = df['label']
model = LogisticRegression()
model.fit(X, y)

The VectorAssembler requirement is a common source of confusion when coming from PySpark. In scikit-learn, you skip it entirely and work with 2D arrays directly.

Encoding Categorical Variables

# PySpark MLlib: two-step process
from pyspark.ml.feature import StringIndexer, OneHotEncoder

indexer = StringIndexer(inputCol='city', outputCol='city_index')
encoder = OneHotEncoder(inputCols=['city_index'], outputCols=['city_ohe'])

# scikit-learn: single step
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
city_encoded = encoder.fit_transform(df[['city']])

PySpark requires a two-step process (StringIndexer then OneHotEncoder) because it operates in a distributed setting where integer indices must be assigned consistently across partitions. scikit-learn handles this in a single step.

Scaling Features

# PySpark MLlib
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

# scikit-learn
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Part 4: Building Pipelines: The Most Transferable Concept

This is the good news for data engineers: both PySpark and scikit-learn use the Pipeline abstraction, and the mental model translates almost directly.

A pipeline chains transformers and a final estimator into a single reusable object that can be fit, serialized, and applied to new data.

# PySpark MLlib Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

pipeline = Pipeline(stages=[
    StringIndexer(inputCol='city', outputCol='city_idx'),
    OneHotEncoder(inputCols=['city_idx'], outputCols=['city_ohe']),
    VectorAssembler(inputCols=['age', 'salary', 'city_ohe'], outputCol='features'),
    StandardScaler(inputCol='features', outputCol='scaled_features'),
    RandomForestClassifier(featuresCol='scaled_features', labelCol='label')
])

model = pipeline.fit(train_df)
predictions = model.transform(test_df)

# scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city'])
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)

The key structural difference: scikit-learn's ColumnTransformer lets you apply different transformations to different columns in parallel, while PySpark's Pipeline applies stages sequentially. You must explicitly order every step in PySpark. The scikit-learn approach is more concise for heterogeneous feature sets.

Part 5: Model Evaluation: A Familiar Concept, Cleaner API

PySpark's evaluation API requires instantiating a separate evaluator object. scikit-learn bundles everything into sklearn.metrics.

# PySpark
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)

acc_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = acc_evaluator.evaluate(predictions)

# scikit-learn
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, predictions)
print(classification_report(y_test, predictions))  # precision, recall, F1 in one call

scikit-learn's classification_report() is significantly more informative than PySpark's single-metric evaluators. It gives precision, recall, F1, and support for every class in a single call.

Part 6: When PySpark Still Wins

Making the shift to Pandas/scikit-learn for ML does not mean PySpark becomes irrelevant. As an ML engineer, you'll continue using PySpark for:

Feature computation at scale: computing rolling aggregations, user behavior features, or entity embeddings over billions of rows before they land in your feature store
Training data generation pipelines: joining raw event tables, filtering time windows, and assembling labeled datasets
Batch inference at scale: applying trained sklearn or PyTorch models inside a Spark UDF to score millions of records
Large-scale data validation: running data quality checks (Great Expectations, Deequ) on incoming training data

The common production pattern is: PySpark for data preparation, then Pandas/scikit-learn/PyTorch for model training, then PySpark again for batch scoring. Your DE background gives you a direct advantage in both the first and last mile of this pipeline.

Part 7: Practical Learning Path

If you're working through this transition, here's a sequenced approach based on the operations data engineers use most.

Week 1: Core Pandas fluency

Rebuild 3 of your existing PySpark ETL jobs in Pandas
Master: filtering, groupby/agg, merge, apply, pivot_table
Read: Pandas User Guide: Comparison with Spark

Week 2: NumPy foundations

Understand array broadcasting, which is the engine under Pandas
Practice vectorized operations instead of loops
This directly maps to understanding tensor operations in PyTorch later

Week 3: scikit-learn core API

The estimator interface: .fit(), .transform(), .predict()
Build your first Pipeline with at least one ColumnTransformer
Implement cross-validation with cross_val_score()

Week 4: End-to-end project

Pick a Kaggle tabular dataset, preferably structured like a transaction dataset that resembles your DE work
Build a full pipeline: EDA, feature engineering, model training, evaluation, and serialization with joblib
This becomes your first ML portfolio project, exactly the kind of hands-on work that signals readiness for an ML engineering role

Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet

Operation	PySpark	Pandas / scikit-learn
Filter rows	`df.filter(condition)`	`df[condition]` or `df.query()`
Select columns	`df.select('a', 'b')`	`df[['a', 'b']]`
Add column	`df.withColumn('c', expr)`	`df.assign(c=expr)`
GroupBy agg	`df.groupBy('x').agg(F.mean('y'))`	`df.groupby('x').agg(avg_y=('y','mean'))`
Join	`df1.join(df2, on='id', how='left')`	`pd.merge(df1, df2, on='id', how='left')`
Drop nulls	`df.dropna(subset=['col'])`	`df.dropna(subset=['col'])`
Fill nulls	`df.fillna({'col': val})`	`df.fillna({'col': val})`
Encode categoricals	`StringIndexer` + `OneHotEncoder`	`OneHotEncoder` (single step)
Scale features	`StandardScaler(inputCol='features')`	`StandardScaler()` on array directly
Build pipeline	`Pipeline(stages=[...])`	`Pipeline([('step', transformer)])`
Evaluate model	`BinaryClassificationEvaluator`	`roc_auc_score()`, `classification_report()`
Lazy evaluation	Yes, actions trigger execution	No, eager by default

Key Takeaways

The shift from PySpark to Pandas/scikit-learn is a surface-level API change over a familiar conceptual foundation. Every concept you already know, including transformations, aggregations, joins, pipelines, and train/test splits, exists in both worlds. What changes is:

Execution model: eager vs. lazy; in-memory vs. distributed
Feature representation: native columns vs. assembled vectors
Pipeline structure: sequential stages vs. parallel ColumnTransformer
Evaluation API: single-metric evaluators vs. a comprehensive metrics module

Your data engineering instincts around schema validation, null handling, data quality, and pipeline reproducibility are directly applicable to ML work. The transition is about building on top of what you have, not starting over.

8 Machine Learning Projects for Software Engineers to Build in 2026

Nyson Markus — Mon, 06 Apr 2026 01:32:53 +0000

Most ML project lists are built for data science students. This one is built for software engineers who already know how to ship production code and want to demonstrate ML competence to hiring teams, not just familiarity with Scikit-learn.

Every project here is chosen for one reason: it forces you to solve problems that show up in real ML engineering roles, not just in Kaggle notebooks. The stack choices are opinionated and current. The "what it actually demonstrates" notes are written from the perspective of what a hiring manager at a product company looks for, not what makes a clean tutorial.

Projects are ordered from foundational to advanced. Each builds on patterns from the one before it.

1. Text Classification Pipeline With Drift Monitoring

What you build: A sentiment or topic classifier trained on a public dataset (Amazon reviews, AG News), wrapped in a FastAPI endpoint, with a basic drift detection layer that flags when incoming text starts diverging from the training distribution.

Stack: Python, Scikit-learn or HuggingFace, FastAPI, Evidently AI, Docker

The production element most people skip: The drift monitor. Most engineers build the classifier and stop. Adding Evidently to track feature drift over time and log alerts when distribution shifts exceed a threshold is what turns this from a tutorial into an ML system.

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd

def check_drift(reference_data: pd.DataFrame, current_data: pd.DataFrame) -> dict:
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference_data, current_data=current_data)
    result = report.as_dict()
    drift_detected = result["metrics"][0]["result"]["dataset_drift"]
    return {"drift_detected": drift_detected, "report": result}

What it demonstrates: Model serving, containerization, and the monitoring mindset that separates MLEs from notebook practitioners.

2. Feature Store From Scratch

What you build: A lightweight feature store that computes, stores, and serves features for a tabular ML problem (churn prediction, loan default). Features are computed offline, stored in a database, and retrieved at inference time via a point-in-time correct query that prevents future leakage.

Stack: Python, PostgreSQL or Redis, Feast (or hand-rolled), FastAPI

The production element most people skip: Point-in-time correctness. Most engineers join features naively on entity ID, which leaks future data into training. A real feature store retrieves the feature value that existed at the time of the label, not the latest value.

def get_features_at_timestamp(
    entity_id: str,
    timestamp: datetime,
    feature_names: list[str],
    conn
) -> dict:
    query = """
        SELECT feature_name, feature_value
        FROM feature_store
        WHERE entity_id = %s
          AND feature_name = ANY(%s)
          AND computed_at <= %s
        ORDER BY computed_at DESC
    """
    rows = conn.execute(query, (entity_id, feature_names, timestamp)).fetchall()
    seen = {}
    for name, value in rows:
        if name not in seen:
            seen[name] = value
    return seen

What it demonstrates: Understanding of training-serving skew, data leakage, and production feature pipelines — one of the most commonly tested concepts in MLE system design interviews.

3. Fine-Tuned LLM With Evaluation Harness

What you build: A domain-specific fine-tuned model using LoRA/QLoRA on a task like legal clause classification, medical note summarization, or code review comment generation. The evaluation harness runs the model against a golden test set on every training run and logs results to an experiment tracker.

Stack: Python, HuggingFace PEFT, QLoRA, Weights & Biases, PyTorch

The production element most people skip: The evaluation harness. Most engineers fine-tune, check loss curves, and call it done. Building a golden set of 50-100 human-labeled examples and writing automated evaluation that runs on every checkpoint is what makes this a system.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

def apply_lora(model_name: str, r: int = 8, lora_alpha: int = 16) -> object:
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        device_map="auto"
    )
    config = LoraConfig(
        r=r,
        lora_alpha=lora_alpha,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    return get_peft_model(model, config)

What it demonstrates: Modern LLM adaptation techniques, experiment tracking discipline, and evaluation methodology — all directly relevant to applied ML roles in 2026.

4. Real-Time Fraud Detection System

What you build: A streaming fraud detection pipeline that consumes transaction events from Kafka, computes real-time features (time since last transaction, rolling spend deviation), runs a trained classifier, and logs decisions with confidence scores for auditing.

Stack: Python, Apache Kafka, Redis (for real-time feature retrieval), XGBoost or LightGBM, FastAPI

The production element most people skip: Handling class imbalance correctly in both training and threshold selection. Fraud datasets are typically 0.1-1% positive. Training without addressing this produces a model that predicts "not fraud" for everything and achieves 99% accuracy. The threshold for flagging fraud should be tuned on business cost, not F1.

from sklearn.utils.class_weight import compute_sample_weight
import lightgbm as lgb

def train_fraud_model(X_train, y_train):
    weights = compute_sample_weight(class_weight='balanced', y=y_train)
    model = lgb.LGBMClassifier(
        n_estimators=500,
        learning_rate=0.05,
        num_leaves=31,
        class_weight='balanced'
    )
    model.fit(X_train, y_train, sample_weight=weights)
    return model

def select_threshold_by_cost(
    y_true, y_proba,
    cost_fn: float = 10,
    cost_fp: float = 1
) -> float:
    best_threshold, best_cost = 0.5, float('inf')
    for t in [i / 100 for i in range(1, 100)]:
        preds = (y_proba >= t).astype(int)
        fn = ((preds == 0) & (y_true == 1)).sum()
        fp = ((preds == 1) & (y_true == 0)).sum()
        total_cost = fn * cost_fn + fp * cost_fp
        if total_cost < best_cost:
            best_cost, best_threshold = total_cost, t
    return best_threshold

What it demonstrates: Streaming data pipelines, imbalanced classification, business-aware threshold tuning, and real-time serving — a complete production ML system.

5. RAG System With Retrieval Evaluation

What you build: A retrieval-augmented generation system over a document corpus (company docs, research papers, a Wikipedia subset). The system chunks documents, generates embeddings, stores them in a vector database, retrieves context at query time, and passes it to an LLM. Critically, it includes retrieval evaluation that measures whether the right chunks are being retrieved.

Stack: Python, LangChain or LlamaIndex, Pinecone or ChromaDB, OpenAI or open-source LLM, RAGAS

The production element most people skip: Retrieval evaluation. Most engineers build the RAG pipeline and eyeball a few outputs. RAGAS gives you automated metrics for context precision, context recall, and answer faithfulness. Without these, you have no way to know if chunking strategy or embedding model changes actually improved the system.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)
from datasets import Dataset

def evaluate_rag_pipeline(
    questions: list[str],
    answers: list[str],
    contexts: list[list[str]],
    ground_truths: list[str]
) -> dict:
    data = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths
    }
    dataset = Dataset.from_dict(data)
    result = evaluate(
        dataset,
        metrics=[context_precision, context_recall, faithfulness, answer_relevancy]
    )
    return result

What it demonstrates: The full LLM application stack, embedding and retrieval systems, and evaluation discipline for generative systems — directly aligned with what most AI product teams are hiring for in 2026.

6. ML Pipeline With Full CI/CD

What you build: A complete ML pipeline where every commit triggers automated tests, the model is retrained on new data if tests pass, evaluation metrics are compared against the currently deployed model, and deployment only proceeds if the new model wins on a held-out test set. No manual steps.

Stack: Python, GitHub Actions, DVC (data version control), MLflow, Docker, any cloud (AWS/GCP/Azure)

The production element most people skip: The model promotion gate. Most CI/CD tutorials for ML cover training automation but stop before the comparison step. A real pipeline only deploys if the challenger model outperforms the champion on the evaluation set.

# .github/workflows/ml_pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'data/**'
      - 'src/**'
      - 'params.yaml'

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run data validation tests
        run: pytest tests/test_data.py

      - name: Train model
        run: python src/train.py

      - name: Evaluate and compare vs champion
        run: python src/evaluate.py --compare-champion

      - name: Deploy if challenger wins
        if: ${{ steps.evaluate.outputs.challenger_wins == 'true' }}
        run: python src/deploy.py

What it demonstrates: MLOps maturity, reproducible training pipelines, and automated model governance — the skills that hiring managers consistently say separate production-ready candidates from notebook-only practitioners.

7. Computer Vision Inference Service With Batching

What you build: An object detection or image classification model (fine-tuned on a custom dataset using a YOLO or EfficientNet backbone) served behind an API that supports dynamic request batching — grouping individual inference requests together and processing them as a batch to maximize GPU throughput.

Stack: Python, PyTorch, Ultralytics YOLO or timm, FastAPI, NVIDIA Triton Inference Server or custom batching logic

The production element most people skip: Dynamic batching. Most engineers serve one image per request, which leaves GPU utilization at 10-20% under real load. A batching layer collects requests over a short time window and processes them together, dramatically improving throughput without increasing per-request latency at moderate traffic.

import asyncio
from collections import defaultdict

class DynamicBatcher:
    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: float = 10.0):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = asyncio.Queue()

    async def infer(self, image_tensor):
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((image_tensor, future))
        return await future

    async def process_batches(self):
        while True:
            batch, futures = [], []
            deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000

            while len(batch) < self.max_batch_size:
                timeout = deadline - asyncio.get_event_loop().time()
                if timeout <= 0:
                    break
                try:
                    tensor, future = await asyncio.wait_for(
                        self.queue.get(), timeout=timeout
                    )
                    batch.append(tensor)
                    futures.append(future)
                except asyncio.TimeoutError:
                    break

            if batch:
                import torch
                results = self.model(torch.stack(batch))
                for future, result in zip(futures, results):
                    future.set_result(result)

What it demonstrates: GPU-aware serving, latency vs throughput tradeoffs, and inference optimization — all of which appear in MLE system design interviews and on-the-job performance reviews.

8. End-to-End Recommendation System

What you build: A two-tower retrieval and ranking system. The retrieval tower generates user and item embeddings and uses approximate nearest neighbor search to retrieve candidates. A separate ranking model scores the candidates using additional features. Both stages are served via API and the full system logs impressions and clicks for future retraining.

Stack: Python, PyTorch, Faiss (ANN search), FastAPI, PostgreSQL (interaction logging), Airflow (retraining schedule)

The production element most people skip: The two-stage architecture itself. Most engineers build a single model that scores all items, which doesn't scale past a few thousand items. The retrieval-then-ranking split is how Netflix, Spotify, YouTube, and every serious recommendation system at scale actually works.

import torch
import torch.nn as nn
import faiss
import numpy as np

class TwoTowerModel(nn.Module):
    def __init__(self, user_dim: int, item_dim: int, embedding_dim: int = 64):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Linear(user_dim, 128),
            nn.ReLU(),
            nn.Linear(128, embedding_dim)
        )
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, 128),
            nn.ReLU(),
            nn.Linear(128, embedding_dim)
        )

    def forward(self, user_features, item_features):
        user_emb = self.user_tower(user_features)
        item_emb = self.item_tower(item_features)
        return torch.sum(user_emb * item_emb, dim=1)

def build_faiss_index(item_embeddings: np.ndarray) -> faiss.Index:
    dim = item_embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    faiss.normalize_L2(item_embeddings)
    index.add(item_embeddings)
    return index

def retrieve_candidates(
    user_embedding: np.ndarray,
    index: faiss.Index,
    k: int = 100
) -> np.ndarray:
    faiss.normalize_L2(user_embedding)
    distances, indices = index.search(user_embedding, k)
    return indices[0]

What it demonstrates: The most commonly asked ML system design question in interviews ("design a recommendation system"), implemented end-to-end with the architecture that actually scales — retrieval, ranking, logging, and retraining loop included.

What Separates These Projects From Tutorial Clones

Every project above has one thing in common: it includes the part that tutorials skip. Drift monitoring, point-in-time correct features, retrieval evaluation, model promotion gates, dynamic batching, two-stage retrieval. These are the elements that show up in production ML systems and almost never in beginner resources.

Building these projects also changes how you talk about your work in interviews. The difference between a candidate who says "I built a RAG system" and one who says "I built a RAG system and measured context recall and faithfulness using RAGAS across three chunking strategies" is a gap in how they approached the project.

The other thing worth knowing is that project selection matters as much as project execution. Engineers who are making the transition from software engineer to machine learning engineer often overbuild in one area and neglect others.

A portfolio with three NLP projects and nothing on serving or monitoring reads differently to a hiring team than one that covers the full ML lifecycle. For a detailed breakdown of how to structure and present these projects, this ML engineer portfolio guide covers what hiring managers actually look for beyond the GitHub link.