Forem: Akhilesh

70. Hyperparameter Tuning: Finding the Best Settings.

Akhilesh — Tue, 12 May 2026 18:55:42 +0000

You picked a model. You trained it. You got decent accuracy. Then someone asks: did you tune the hyperparameters?

You picked max_depth=5 because it felt right. Learning rate 0.1 because you saw it in a tutorial. Number of trees because 100 is a round number.

That's guessing. Hyperparameter tuning replaces guessing with a systematic search. It finds the combination of settings that actually works best for your specific data.

What You'll Learn Here

What hyperparameters are and why they matter
Grid search: exhaustive but slow
Random search: faster and often just as good
Bayesian optimization with Optuna: smarter search
How to avoid overfitting your validation set during tuning
Nested cross-validation for honest evaluation
Practical tuning strategy for real projects

Parameters vs Hyperparameters

First the distinction, because people mix these up.

Parameters are learned by the model during training. The weights in a neural network. The split thresholds in a decision tree. You don't set these. The training algorithm finds them.

Hyperparameters are set by you before training. They control how the training happens.

Model parameters (learned):
  - Decision tree split thresholds
  - Linear regression coefficients
  - Neural network weights

Hyperparameters (you set these):
  - max_depth in a decision tree
  - n_estimators in a random forest
  - learning_rate in XGBoost
  - C and gamma in SVM
  - n_neighbors in KNN

Changing hyperparameters changes how the model learns. Wrong settings lead to overfitting, underfitting, or slow convergence. Good settings squeeze out the best possible performance.

Grid Search: Try Everything

Grid search is the simplest approach. You define a grid of hyperparameter values. It tries every possible combination. It returns the best one.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import pandas as pd
import time

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define the grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth':    [3, 5, 10, None],
    'min_samples_leaf': [1, 2, 4],
}

# Total combinations = 3 * 4 * 3 = 36
# With 5-fold CV = 36 * 5 = 180 model fits
total_fits = (len(param_grid['n_estimators']) *
              len(param_grid['max_depth']) *
              len(param_grid['min_samples_leaf'])) * 5

print(f"Grid combinations: {total_fits // 5}")
print(f"Total model fits with 5-fold CV: {total_fits}")

rf = RandomForestClassifier(random_state=42, n_jobs=-1)

start = time.time()
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)
elapsed = time.time() - start

print(f"\nSearch time: {elapsed:.1f}s")
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"Test accuracy: {accuracy_score(y_test, grid_search.predict(X_test)):.3f}")

Output:

Grid combinations: 36
Total model fits with 5-fold CV: 180
Fitting 5 folds for each of 36 candidates...
Search time: 8.2s
Best params: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV score: 0.967
Test accuracy: 0.974

Grid search is thorough. But it scales badly. If you add one more hyperparameter with 4 values, you go from 36 combinations to 144. With many hyperparameters and large ranges, grid search becomes impractical.

Analyzing Grid Search Results

# Look at all results as a dataframe
results_df = pd.DataFrame(grid_search.cv_results_)
results_df = results_df[[
    'param_n_estimators', 'param_max_depth',
    'param_min_samples_leaf', 'mean_test_score', 'std_test_score'
]].sort_values('mean_test_score', ascending=False)

print("Top 10 results:")
print(results_df.head(10).to_string(index=False))

Reading these results helps you understand which parameters matter most and which ones barely affect performance.

Random Search: Faster and Often Just as Good

Instead of trying every combination, random search samples random combinations. It covers a much wider range with fewer trials.

Why does it work? Most hyperparameters have large "flat" regions. Moving max_depth from 7 to 8 might not matter. But moving it from 3 to 15 might matter a lot. Random search samples from the full range more efficiently than a coarse grid.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define distributions instead of fixed lists
param_dist = {
    'n_estimators':     randint(50, 500),      # sample from range 50 to 500
    'max_depth':        [3, 5, 7, 10, 15, None],
    'min_samples_leaf': randint(1, 10),
    'max_features':     ['sqrt', 'log2', 0.5, 0.7],
    'min_samples_split':randint(2, 20),
}

rf_r = RandomForestClassifier(random_state=42, n_jobs=-1)

start = time.time()
random_search = RandomizedSearchCV(
    estimator=rf_r,
    param_distributions=param_dist,
    n_iter=50,          # try 50 random combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)
random_search.fit(X_train, y_train)
elapsed = time.time() - start

print(f"Search time: {elapsed:.1f}s")
print(f"Best params: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")
print(f"Test accuracy: {accuracy_score(y_test, random_search.predict(X_test)):.3f}")

Output:

Search time: 6.3s
Best params: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1,
              'min_samples_split': 4, 'n_estimators': 347}
Best CV score: 0.971
Test accuracy: 0.982

Random search found a better result in similar time because it explored a wider space. The grid search only tried 3 values for n_estimators. Random search sampled from 50 to 500 continuously.

Rule of thumb: use random search over grid search almost always. Only use grid search when you've already narrowed down the important ranges with random search and want to fine-tune.

Optuna: Bayesian Optimization

Grid and random search have no memory. Each trial is independent. They don't learn from previous results.

Optuna uses Bayesian optimization. It builds a model of which parameter regions are promising and focuses future trials there. It's smarter and usually finds better results in fewer trials.

pip install optuna

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Suppress optuna logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    # Define the search space
    n_estimators  = trial.suggest_int('n_estimators', 50, 500)
    max_depth     = trial.suggest_categorical('max_depth', [3, 5, 7, 10, 15, None])
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
    max_features  = trial.suggest_categorical('max_features', ['sqrt', 'log2', 0.5])
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)

    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        min_samples_split=min_samples_split,
        random_state=42,
        n_jobs=-1
    )

    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

start = time.time()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)
elapsed = time.time() - start

print(f"\nSearch time: {elapsed:.1f}s")
print(f"Best params: {study.best_params}")
print(f"Best CV score: {study.best_value:.3f}")

# Train final model with best params
best_rf = RandomForestClassifier(
    **study.best_params, random_state=42, n_jobs=-1
)
best_rf.fit(X_train, y_train)
print(f"Test accuracy: {accuracy_score(y_test, best_rf.predict(X_test)):.3f}")

Output:

Search time: 12.4s
Best params: {'n_estimators': 423, 'max_depth': 10, 'min_samples_leaf': 1,
              'max_features': 'sqrt', 'min_samples_split': 3}
Best CV score: 0.974
Test accuracy: 0.982

Optuna found the best result because it focused on promising regions. With more trials the gap between Optuna and random search grows larger.

Visualizing Optuna Results

import matplotlib.pyplot as plt

# Plot optimization history
trials_df = study.trials_dataframe()

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(trials_df['number'], trials_df['value'], alpha=0.5, color='blue', linewidth=1)
best_so_far = trials_df['value'].cummax()
plt.plot(trials_df['number'], best_so_far, color='red', linewidth=2, label='Best so far')
plt.xlabel('Trial')
plt.ylabel('CV Accuracy')
plt.title('Optimization History')
plt.legend()

plt.subplot(1, 2, 2)
# Parameter importance
importances = optuna.importance.get_param_importances(study)
params = list(importances.keys())
values = list(importances.values())
plt.barh(params, values, color='steelblue')
plt.xlabel('Importance')
plt.title('Hyperparameter Importance')

plt.tight_layout()
plt.savefig('optuna_results.png', dpi=100)
plt.show()

print("\nHyperparameter importance:")
for param, imp in importances.items():
    print(f"  {param}: {imp:.3f}")

The importance plot shows which hyperparameters actually mattered. If n_estimators has near-zero importance, you don't need to tune it carefully. Focus on the ones that matter.

Tuning XGBoost With Optuna

XGBoost has many hyperparameters. Optuna handles this better than grid search.

import xgboost as xgb

def xgb_objective(trial):
    params = {
        'n_estimators':    trial.suggest_int('n_estimators', 100, 1000),
        'max_depth':       trial.suggest_int('max_depth', 3, 8),
        'learning_rate':   trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample':       trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree':trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha':       trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda':      trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'random_state': 42,
        'eval_metric': 'logloss',
        'verbosity': 0
    }

    model = xgb.XGBClassifier(**params)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

study_xgb = optuna.create_study(direction='maximize')
study_xgb.optimize(xgb_objective, n_trials=50, show_progress_bar=True)

print(f"\nXGBoost best CV: {study_xgb.best_value:.3f}")
print(f"Best params: {study_xgb.best_params}")

best_xgb = xgb.XGBClassifier(**study_xgb.best_params, random_state=42, verbosity=0)
best_xgb.fit(X_train, y_train)
print(f"Test accuracy: {accuracy_score(y_test, best_xgb.predict(X_test)):.3f}")

The Overfitting Problem in Tuning

Here's a subtle trap. Every time you check the test set during tuning, you leak information about the test set into your choices. If you tune for 200 trials and always pick the best test score, you've effectively trained on the test set.

The solution is nested cross-validation. The inner loop tunes. The outer loop evaluates.

from sklearn.model_selection import cross_val_score, KFold, GridSearchCV

# Inner CV for tuning, outer CV for honest evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv  = KFold(n_splits=3, shuffle=True, random_state=42)

# Simple param grid for speed
param_grid_nested = {
    'n_estimators': [50, 100],
    'max_depth':    [5, 10, None],
}

rf_nested = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_nested = GridSearchCV(rf_nested, param_grid_nested, cv=inner_cv, scoring='accuracy')

# Outer CV gives the honest estimate
nested_scores = cross_val_score(grid_nested, X, y, cv=outer_cv, scoring='accuracy')

print(f"Nested CV accuracy: {nested_scores.mean():.3f} +/- {nested_scores.std():.3f}")
print("This is the honest estimate of real-world performance.")
print()

# Compare to non-nested (optimistically biased)
non_nested_scores = cross_val_score(
    GridSearchCV(rf_nested, param_grid_nested, cv=3),
    X, y, cv=outer_cv
)
print(f"Non-nested CV: {non_nested_scores.mean():.3f} +/- {non_nested_scores.std():.3f}")
print("This can be overly optimistic on small datasets.")

Nested CV is slower but gives you an unbiased estimate. Use it when reporting final results, especially on small datasets.

Practical Tuning Strategy

Here's the workflow that works well in practice:

Step 1: Start with default hyperparameters.
        Know your baseline before you tune.

Step 2: Use random search with 50-100 trials
        across a wide range of values.
        This finds the good region fast.

Step 3: Narrow the range based on step 2 results.
        Run Optuna with 50-100 trials in the narrowed space.

Step 4: Focus on the hyperparameters that matter.
        Check Optuna's importance plot.
        Ignore the ones with near-zero importance.

Step 5: Evaluate the final model on the test set once.
        Only once. Never tune based on test set results.

# Full practical example
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import optuna

optuna.logging.set_verbosity(optuna.logging.WARNING)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 1: Baseline
baseline_model = RandomForestClassifier(random_state=42, n_jobs=-1)
baseline_score = cross_val_score(baseline_model, X_train, y_train, cv=5).mean()
print(f"Step 1 - Baseline CV: {baseline_score:.3f}")

# Step 2: Random search wide range
param_dist_wide = {
    'n_estimators':     randint(10, 1000),
    'max_depth':        [2, 3, 5, 7, 10, 15, None],
    'min_samples_leaf': randint(1, 20),
    'max_features':     ['sqrt', 'log2', 0.3, 0.5, 0.7],
}

rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_dist_wide, n_iter=30, cv=5, random_state=42
)
rs.fit(X_train, y_train)
print(f"Step 2 - Random search CV: {rs.best_score_:.3f}")
print(f"         Best params: {rs.best_params_}")

# Step 3: Optuna in narrowed space based on step 2
def narrow_objective(trial):
    model = RandomForestClassifier(
        n_estimators     = trial.suggest_int('n_estimators', 100, 600),
        max_depth        = trial.suggest_categorical('max_depth', [5, 7, 10, None]),
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5),
        max_features     = trial.suggest_categorical('max_features', ['sqrt', 0.5, 0.7]),
        random_state=42, n_jobs=-1
    )
    return cross_val_score(model, X_train, y_train, cv=5).mean()

study_narrow = optuna.create_study(direction='maximize')
study_narrow.optimize(narrow_objective, n_trials=40)
print(f"Step 3 - Optuna CV: {study_narrow.best_value:.3f}")

# Step 5: Final evaluation on test set (only once)
final_model = RandomForestClassifier(
    **study_narrow.best_params, random_state=42, n_jobs=-1
)
final_model.fit(X_train, y_train)
print(f"\nStep 5 - FINAL Test accuracy: {accuracy_score(y_test, final_model.predict(X_test)):.3f}")
print("\nFinal classification report:")
print(classification_report(y_test, final_model.predict(X_test), target_names=data.target_names))

Comparison: Grid vs Random vs Optuna

import time

results = {}

# Grid search
start = time.time()
gs = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None]},
    cv=5, n_jobs=-1
)
gs.fit(X_train, y_train)
results['Grid Search']   = {'cv': gs.best_score_, 'time': time.time()-start, 'trials': 9}

# Random search
start = time.time()
rs2 = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    {'n_estimators': randint(50, 500), 'max_depth': [3, 5, 10, None],
     'min_samples_leaf': randint(1, 10)},
    n_iter=50, cv=5, random_state=42, n_jobs=-1
)
rs2.fit(X_train, y_train)
results['Random Search'] = {'cv': rs2.best_score_, 'time': time.time()-start, 'trials': 50}

# Optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
start = time.time()
def comp_obj(trial):
    m = RandomForestClassifier(
        n_estimators     = trial.suggest_int('n_estimators', 50, 500),
        max_depth        = trial.suggest_categorical('max_depth', [3, 5, 10, None]),
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10),
        random_state=42, n_jobs=-1
    )
    return cross_val_score(m, X_train, y_train, cv=5).mean()

s = optuna.create_study(direction='maximize')
s.optimize(comp_obj, n_trials=50)
results['Optuna'] = {'cv': s.best_value, 'time': time.time()-start, 'trials': 50}

print(f"\n{'Method':<16} {'CV Score':<12} {'Time':<10} {'Trials'}")
print("-" * 45)
for method, r in results.items():
    print(f"{method:<16} {r['cv']:.3f}        {r['time']:.1f}s       {r['trials']}")

Quick Cheat Sheet

Method	When to use	Trials needed
Grid Search	Fine-tuning 1-2 params with known ranges	Low (exhaustive)
Random Search	First pass, many params, wide ranges	50-100
Optuna	When you need the best result and have compute	100-500

Task	Code
Grid search	`GridSearchCV(model, param_grid, cv=5)`
Random search	`RandomizedSearchCV(model, param_dist, n_iter=50, cv=5)`
Best params	`.best_params_`
Best CV score	`.best_score_`
Best model	`.best_estimator_`
Optuna study	`optuna.create_study(direction='maximize')`
Run Optuna	`study.optimize(objective, n_trials=100)`
Optuna importance	`optuna.importance.get_param_importances(study)`
Nested CV	`cross_val_score(GridSearchCV(...), X, y, cv=outer_cv)`

Practice Challenges

Level 1:
Run grid search on load_wine() with a RandomForest. Try n_estimators of [50, 100, 200] and max_depth of [3, 5, None]. Print the full results table. Which parameter matters more?

Level 2:
Compare random search with 30 trials to Optuna with 30 trials on the breast cancer dataset. Run each 3 times with different random_state values. Which method is more consistent across runs?

Level 3:
Use Optuna to tune XGBoost on the California housing dataset (regression). Tune at least 5 hyperparameters including learning_rate, max_depth, subsample, reg_alpha, and n_estimators. Plot the optimization history and the hyperparameter importance chart. What are the two most important hyperparameters?

References

Next up, Post 71: End-to-End ML Project: Predict Something Real. We take everything from Phase 6 and build one complete project from raw data to final predictions. Data cleaning, feature engineering, model selection, tuning, and evaluation all in one place.

69. Feature Engineering: Building Better Inputs

Akhilesh — Tue, 12 May 2026 06:34:06 +0000

You've tried three different algorithms. None of them break 78% accuracy. You add dropout, tune hyperparameters, try XGBoost. Still stuck.

Then you create one new feature from the existing data. Accuracy jumps to 86%.

That's feature engineering. And it's the part of ML that makes the biggest difference in practice. Not the algorithm. Not the hyperparameters. The features.

This post covers the core techniques you'll actually use on real datasets.

What You'll Learn Here

Why features matter more than algorithms
Handling categorical variables: label encoding vs one-hot encoding
Scaling and transformation: when and why
Creating new features from existing ones
Interaction features and polynomial features
Handling dates and times
Domain-specific feature ideas
Feature selection: dropping what doesn't help

Why Features Beat Algorithms

Here's a concrete example. You're predicting house prices. You have:

bedrooms: 3
bathrooms: 2
square_feet: 1800

A simple addition gives you:

bed_bath_ratio: 1.5 (bedrooms per bathroom)
price_per_sqft: calculated from sale price
total_rooms: bedrooms + bathrooms

That ratio might tell the model something neither raw number could. A house with 5 bedrooms and 1 bathroom signals something completely different from a house with 5 bedrooms and 4 bathrooms. The ratio captures that relationship.

Good features compress domain knowledge into numbers the model can use. No algorithm can discover what it was never told.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Baseline score
baseline = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X, y, cv=5, scoring='r2'
)
print(f"Baseline R2: {baseline.mean():.3f}")

# Add engineered features
X_eng = X.copy()
X_eng['rooms_per_person']  = X['AveRooms']  / X['AveOccup']
X_eng['beds_per_room']     = X['AveBedrms'] / X['AveRooms']
X_eng['pop_per_household'] = X['Population'] / X['AveOccup']

engineered = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X_eng, y, cv=5, scoring='r2'
)
print(f"With features R2: {engineered.mean():.3f}")
print(f"Improvement: +{(engineered.mean() - baseline.mean()):.3f}")

Output:

Baseline R2: 0.789
With features R2: 0.806
Improvement: +0.017

Three new features. One point seven percent improvement. No algorithm change.

Encoding Categorical Variables

Most ML algorithms need numbers. When you have text categories, you need to convert them.

Label Encoding
Assigns an integer to each category. Fine for tree-based models. Bad for linear models because it implies order (cat=2 is not "twice" cat=1).

from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})

le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

print(df)
print(f"\nMapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Output:

   color  color_encoded
0    red              2
1   blue              0
2  green              1
3   blue              0
4    red              2

Mapping: {'blue': 0, 'green': 1, 'red': 2}

One-Hot Encoding
Creates a binary column for each category. No false ordering. Works for all models. Can create many columns if there are many categories.

df_onehot = pd.get_dummies(df['color'], prefix='color')
print(df_onehot)

Output:

   color_blue  color_green  color_red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1

Ordinal Encoding
For categories with a real order: Small < Medium < Large.

from sklearn.preprocessing import OrdinalEncoder

size_data = pd.DataFrame({'size': ['Small', 'Large', 'Medium', 'Small', 'Large']})

oe = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_data['size_encoded'] = oe.fit_transform(size_data[['size']])
print(size_data)

Output:

     size  size_encoded
0   Small           0.0
1   Large           2.0
2  Medium           1.0
3   Small           0.0
4   Large           2.0

High-cardinality categories: Target encoding

When a category has 500+ unique values (like zip codes), one-hot creates 500 columns. Target encoding replaces each category with the mean of the target for that category.

# Target encoding example
df_target = pd.DataFrame({
    'city':       ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
    'house_price': [800, 600, 850, 400, 650, 780]
})

# Replace city with mean price per city
city_means = df_target.groupby('city')['house_price'].mean()
df_target['city_encoded'] = df_target['city'].map(city_means)
print(df_target)

Output:

      city  house_price  city_encoded
0      NYC          800        810.0
1       LA          600        625.0
2      NYC          850        810.0
3  Chicago          400        400.0
4       LA          650        625.0
5      NYC          780        810.0

Warning: target encoding can leak information if done before the train/test split. Always fit encoding on training data only.

Scaling and Transformations

Some features need to be transformed before they're useful.

Log transformation for skewed features

Many real-world features are heavily right-skewed. Income. House prices. Population. Taking the log makes the distribution more symmetric and helps linear models.

import matplotlib.pyplot as plt
import numpy as np

# Skewed data
incomes = np.random.exponential(scale=50000, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(incomes, bins=50, color='steelblue')
axes[0].set_title('Raw Income (skewed)')
axes[0].set_xlabel('Income')

axes[1].hist(np.log1p(incomes), bins=50, color='orange')
axes[1].set_title('Log(Income + 1) (more symmetric)')
axes[1].set_xlabel('log(Income)')

plt.tight_layout()
plt.savefig('log_transform.png', dpi=100)
plt.show()

# In a real pipeline
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X[['Population']])
print(f"Before: mean={X['Population'].mean():.0f}, std={X['Population'].std():.0f}")
print(f"After:  mean={X_log.mean():.2f}, std={X_log.std():.2f}")

Power transformation for normalizing distributions

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')  # handles negative values too
X_transformed = pt.fit_transform(X[['MedInc', 'Population', 'AveRooms']])
print("Distributions after power transform are more Gaussian-like")

Creating New Features From Existing Ones

This is the creative part. You combine, divide, subtract, and multiply features to capture relationships the model might miss.

import pandas as pd
import numpy as np

# Simulated customer dataset
np.random.seed(42)
n = 1000

customers = pd.DataFrame({
    'total_spend':    np.random.exponential(200, n),
    'n_orders':       np.random.randint(1, 50, n),
    'days_since_join':np.random.randint(30, 1000, n),
    'last_purchase':  np.random.randint(1, 365, n),
    'n_returns':      np.random.randint(0, 10, n),
    'n_complaints':   np.random.randint(0, 5, n),
})

# Ratio features
customers['avg_order_value']   = customers['total_spend'] / customers['n_orders']
customers['return_rate']       = customers['n_returns'] / customers['n_orders']
customers['spend_per_day']     = customers['total_spend'] / customers['days_since_join']

# Difference features
customers['recency_frequency_gap'] = customers['last_purchase'] - (365 / customers['n_orders'])

# Aggregation features
customers['problem_score'] = customers['n_returns'] + customers['n_complaints'] * 2

# Binary flag features
customers['is_high_value']    = (customers['total_spend'] > 500).astype(int)
customers['is_recent_buyer']  = (customers['last_purchase'] < 30).astype(int)
customers['has_complained']   = (customers['n_complaints'] > 0).astype(int)

# Binning continuous features
customers['spend_bucket'] = pd.cut(
    customers['total_spend'],
    bins=[0, 100, 300, 600, np.inf],
    labels=['low', 'medium', 'high', 'premium']
)

print(customers.head())
print(f"\nOriginal features: 6, New total: {len(customers.columns)}")

Date and Time Features

Dates carry a lot of information that models can't use in raw form. You need to extract it.

import pandas as pd

# Sample transaction log
df_dates = pd.DataFrame({
    'transaction_date': pd.date_range('2023-01-01', periods=10, freq='13D'),
    'amount': [120, 45, 380, 90, 210, 55, 430, 175, 310, 88]
})

# Extract useful components
df_dates['year']         = df_dates['transaction_date'].dt.year
df_dates['month']        = df_dates['transaction_date'].dt.month
df_dates['day']          = df_dates['transaction_date'].dt.day
df_dates['day_of_week']  = df_dates['transaction_date'].dt.dayofweek  # 0=Monday
df_dates['is_weekend']   = (df_dates['day_of_week'] >= 5).astype(int)
df_dates['quarter']      = df_dates['transaction_date'].dt.quarter
df_dates['week_of_year'] = df_dates['transaction_date'].dt.isocalendar().week.astype(int)

# Time since a reference point
reference_date = pd.Timestamp('2023-01-01')
df_dates['days_since_start'] = (df_dates['transaction_date'] - reference_date).dt.days

print(df_dates[['transaction_date', 'month', 'day_of_week', 'is_weekend',
                 'quarter', 'days_since_start']].to_string())

Cyclical encoding for time features

Month 12 is close to month 1. But if you use raw month numbers, the model sees 12 and 1 as far apart. Cyclical encoding fixes this using sine and cosine.

df_dates['month_sin'] = np.sin(2 * np.pi * df_dates['month'] / 12)
df_dates['month_cos'] = np.cos(2 * np.pi * df_dates['month'] / 12)

df_dates['dow_sin'] = np.sin(2 * np.pi * df_dates['day_of_week'] / 7)
df_dates['dow_cos'] = np.cos(2 * np.pi * df_dates['day_of_week'] / 7)

print("\nCyclical encoding example:")
print(df_dates[['month', 'month_sin', 'month_cos']].head())

Now January (month=1) and December (month=12) are numerically close in the sine/cosine space. The model can learn seasonal patterns correctly.

Interaction Features

When two features combine to mean something neither means alone.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Simple example
df_int = pd.DataFrame({
    'study_hours':  [2, 5, 8, 1, 6],
    'sleep_hours':  [8, 6, 5, 4, 7],
    'exam_score':   [70, 82, 75, 55, 88]
})

# study_hours * sleep_hours = well-prepared AND well-rested
df_int['study_x_sleep'] = df_int['study_hours'] * df_int['sleep_hours']

print("Correlation with exam score:")
print(df_int.corr()['exam_score'].sort_values(ascending=False))

# Automated polynomial and interaction features
from sklearn.preprocessing import PolynomialFeatures

X_small = df_int[['study_hours', 'sleep_hours']].values

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_small)

feature_names = poly.get_feature_names_out(['study_hours', 'sleep_hours'])
print(f"\nOriginal features: 2")
print(f"After degree-2 polynomial: {X_poly.shape[1]}")
print(f"New features: {list(feature_names)}")

Output:

Original features: 2
After degree-2 polynomial: 5
New features: ['study_hours', 'sleep_hours', 'study_hours^2', 'study_hours sleep_hours', 'sleep_hours^2']

Be careful with high-degree polynomial features. With 20 original features and degree=2 you already get 210 new features. With degree=3 it explodes. Only use this with few features.

Feature Selection: Dropping What Doesn't Help

Adding many features can hurt. Noisy features add dimensions and confuse the model. Use selection to keep only what matters.

Method 1: Correlation filter

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')

# Drop features with low correlation to target
correlations = X.corrwith(y).abs().sort_values(ascending=False)
print("Correlations with target:")
print(correlations)

# Keep features with correlation > 0.1
keep_features = correlations[correlations > 0.1].index.tolist()
print(f"\nKeeping {len(keep_features)} of {len(X.columns)} features")

Method 2: SelectKBest

from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

# F-statistic based selection
selector_f = SelectKBest(score_func=f_regression, k=5)
selector_f.fit(X, y)
selected_f = X.columns[selector_f.get_support()]
print(f"SelectKBest (F-stat) top 5: {list(selected_f)}")

# Mutual information selection (catches non-linear relationships too)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=5)
selector_mi.fit(X, y)
selected_mi = X.columns[selector_mi.get_support()]
print(f"SelectKBest (MI) top 5: {list(selected_mi)}")

Method 3: Tree-based feature importance

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

importance_df = pd.DataFrame({
    'Feature':    X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(importance_df.to_string(index=False))

# Drop features with near-zero importance
threshold   = 0.01
keep_rf     = importance_df[importance_df['Importance'] >= threshold]['Feature'].tolist()
print(f"\nKeeping features with importance >= {threshold}: {keep_rf}")

Method 4: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

rfe = RFE(estimator=LinearRegression(), n_features_to_select=4)
rfe.fit(X, y)

selected_rfe = X.columns[rfe.support_]
print(f"RFE selected features: {list(selected_rfe)}")

Putting It All Together: A Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Simulate dataset with mixed types
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age':        np.random.randint(18, 80, n),
    'income':     np.random.exponential(50000, n),
    'city':       np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
    'education':  np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'experience': np.random.randint(0, 40, n),
})

# Add some missing values
df.loc[np.random.choice(n, 30), 'income']  = np.nan
df.loc[np.random.choice(n, 20), 'age']     = np.nan

# Create target
df['buys'] = ((df['income'].fillna(df['income'].median()) > 60000) &
              (df['age'].fillna(30) > 25)).astype(int)

X_df = df.drop('buys', axis=1)
y_df = df['buys']

# Define column types
numeric_cols     = ['age', 'income', 'experience']
categorical_cols = ['city', 'education']

# Preprocessing pipeline for each type
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols),
])

# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=100, random_state=42)),
])

scores = cross_val_score(full_pipeline, X_df, y_df, cv=5)
print(f"Pipeline CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

This ColumnTransformer pattern is what real ML engineers use in production. Numeric and categorical features get different treatments, everything is done safely inside a pipeline, and there's no data leakage.

The Things Everyone Gets Wrong

Mistake 1: Engineering features before the train/test split

If you compute target encoding or fill missing values using the entire dataset, information from the test set leaks into training. Always put feature engineering inside a pipeline or do it only on training data.

Mistake 2: One-hot encoding high-cardinality features

A city column with 500 cities creates 500 binary columns. Most will be sparse and useless. Use target encoding or embeddings for high-cardinality categories.

Mistake 3: Ignoring domain knowledge

The best features come from understanding the business. A data scientist who knows that revenue / headcount is a key business metric will create better features than one blindly generating all combinations.

Mistake 4: Adding too many features and not selecting

More features is not better. Noisy, irrelevant features add noise and slow training. Always run feature selection after engineering.

Quick Cheat Sheet

Technique	When to use	Code
Label encoding	Tree models, ordinal data	`LabelEncoder()`
One-hot encoding	Linear/NN models, low cardinality	`pd.get_dummies()` or `OneHotEncoder()`
Target encoding	High cardinality categories	`groupby().mean()` on train only
Log transform	Right-skewed features	`np.log1p(X)`
Power transform	Normalize distributions	`PowerTransformer()`
Interaction features	Known domain relationships	`X1 * X2` or `PolynomialFeatures`
Cyclical encoding	Time/date features	`sin`, `cos` of period
Binning	Non-linear bucket effects	`pd.cut()`
Feature selection	Too many features	`SelectKBest`, RF importance, RFE

Practice Challenges

Level 1:
Take the California housing dataset. Add three engineered features: rooms per person, population density per block, and a flag for unusually large households. Does cross-val R2 improve?

Level 2:
Load a Kaggle dataset with dates (any sales or event dataset). Extract year, month, day of week, is_weekend, and cyclical month encoding. Check which extracted features correlate most with the target.

Level 3:
Build a full ColumnTransformer pipeline on a mixed dataset. Include numeric imputation, log-transform for skewed columns, one-hot for low-cardinality categorical, and ordinal encoding for an ordered category. Compare CV accuracy before and after the full preprocessing pipeline.

References

Next up, Post 70: Hyperparameter Tuning: Finding the Best Settings. Grid search, random search, and Optuna. Stop guessing your model's settings and start finding them systematically.

68. PCA: Shrinking Data Without Losing Information

Akhilesh — Mon, 11 May 2026 13:40:54 +0000

You have 100 features. Most of them are correlated. Training is slow. Visualization is impossible. KNN is useless (curse of dimensionality).

PCA is the tool that handles this. It takes your 100 features and finds 10 new features that capture 95% of the original information. Training gets faster, visualization becomes possible, and your models often get better too.

It's one of those techniques you'll use constantly once you understand it.

What You'll Learn Here

What PCA actually does in plain terms
What principal components and explained variance are
How to decide how many components to keep
PCA for visualization of high-dimensional data
PCA as a preprocessing step before ML models
What PCA can't do and when to skip it

The Core Idea: Find the Directions of Spread

Imagine you have data in 2D. A cloud of points that stretches more in one direction than another.

PCA finds the direction with the most spread (variance). That's the first principal component. Then it finds the direction with the second most spread that's perpendicular to the first. That's the second principal component.

If most of the spread is along PC1 and PC2, you can project your data onto just those two directions and keep most of the information. The other directions had little variance, meaning they contributed little signal.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Create correlated 2D data
np.random.seed(42)
mean   = [0, 0]
cov    = [[3, 2], [2, 2]]   # correlated features
X_2d   = np.random.multivariate_normal(mean, cov, 300)

# Fit PCA
pca_2d = PCA(n_components=2)
pca_2d.fit(X_2d)

pc1 = pca_2d.components_[0]
pc2 = pca_2d.components_[1]

plt.figure(figsize=(7, 5))
plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.4, color='steelblue', s=25)

# Draw the principal components as arrows
origin = X_2d.mean(axis=0)
scale  = 2
plt.arrow(*origin, *(scale * pc1), head_width=0.15, head_length=0.1,
          color='red',    label='PC1 (most variance)')
plt.arrow(*origin, *(scale * pc2), head_width=0.15, head_length=0.1,
          color='orange', label='PC2 (2nd most variance)')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Principal Components Show Directions of Maximum Variance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pca_directions.png', dpi=100)
plt.show()

print(f"PC1 direction: {pc1.round(3)}")
print(f"PC2 direction: {pc2.round(3)}")
print(f"Variance explained by PC1: {pca_2d.explained_variance_ratio_[0]:.1%}")
print(f"Variance explained by PC2: {pca_2d.explained_variance_ratio_[1]:.1%}")

Output:

PC1 direction: [0.847 0.532]
PC2 direction: [-0.532  0.847]
Variance explained by PC1: 88.2%
Variance explained by PC2: 11.8%

PC1 captures 88% of the variance. If you only keep PC1, you keep 88% of the information. PC2 adds another 12%. Together they explain 100% because the original data was 2D.

In practice you start with 100+ dimensions and find that the first 10-20 components explain 95%+ of the variance.

PCA on Real Data: Digits Dataset

The digits dataset has 64 features (8x8 pixel images). Let's compress it.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data    # 1797 samples, 64 features
y = digits.target

print(f"Original shape: {X.shape}")

# Fit PCA to find all components
pca_full = PCA()
pca_full.fit(X)

# Explained variance ratio
evr = pca_full.explained_variance_ratio_
cumulative_evr = np.cumsum(evr)

# How many components to reach 95% variance?
n_95 = np.argmax(cumulative_evr >= 0.95) + 1
n_99 = np.argmax(cumulative_evr >= 0.99) + 1

print(f"Components for 95% variance: {n_95}")
print(f"Components for 99% variance: {n_99}")

# Plot the explained variance
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(range(1, 21), evr[:20], color='steelblue')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance per Component (first 20)')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(evr)+1), cumulative_evr, color='blue', linewidth=2)
plt.axhline(0.95, color='red',    linestyle='--', label='95%')
plt.axhline(0.99, color='orange', linestyle='--', label='99%')
plt.axvline(n_95, color='red',    linestyle=':',  alpha=0.7)
plt.axvline(n_99, color='orange', linestyle=':',  alpha=0.7)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Explained Variance')
plt.legend()

plt.tight_layout()
plt.savefig('pca_explained_variance.png', dpi=100)
plt.show()

Output:

Original shape: (1797, 64)
Components for 95% variance: 29
Components for 99% variance: 41

You can represent 64-dimensional digit images in just 29 dimensions and keep 95% of the information. That's a 55% reduction.

How to Decide How Many Components to Keep

Three strategies:

Strategy 1: Cumulative explained variance threshold
Keep enough components to explain 95% (or 99%) of the variance. Most common.

# Keep 95% of variance
pca_95 = PCA(n_components=0.95)   # pass float between 0 and 1
pca_95.fit(X)
print(f"Components kept: {pca_95.n_components_}")

Strategy 2: Fixed number
When you know what you want (e.g., 2 for visualization, 50 for a pipeline).

pca_50 = PCA(n_components=50)
pca_50.fit(X)

Strategy 3: The elbow in the scree plot
Plot variance per component. Pick the point where adding more components gives diminishing returns.

plt.figure(figsize=(8, 4))
plt.plot(range(1, 21), evr[:20], marker='o', color='blue', linewidth=2)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot: Look for the Elbow')
plt.grid(True, alpha=0.3)
plt.savefig('scree_plot.png', dpi=100)
plt.show()

The elbow (where the curve flattens) suggests how many components carry most of the signal. Components after the elbow add noise as much as information.

PCA for Visualization

The most common use: compress any high-dimensional data to 2D so you can see it.

from sklearn.preprocessing import StandardScaler

# Scale first (always before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compress to 2D
pca_2d_digits = PCA(n_components=2, random_state=42)
X_2d_digits   = pca_2d_digits.fit_transform(X_scaled)

print(f"Original: {X_scaled.shape}")
print(f"After PCA: {X_2d_digits.shape}")
print(f"Variance explained: {pca_2d_digits.explained_variance_ratio_.sum():.1%}")

# Plot colored by digit class
plt.figure(figsize=(9, 7))
scatter = plt.scatter(
    X_2d_digits[:, 0], X_2d_digits[:, 1],
    c=y, cmap='tab10', s=15, alpha=0.7
)
plt.colorbar(scatter, label='Digit')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Digits Dataset in 2D via PCA')
plt.savefig('pca_digits_2d.png', dpi=100)
plt.show()

Even in 2D, you can see clusters of similar digits. 0s cluster together. 1s cluster together. Some digits (4, 7, 9) overlap because they look similar.

Try 3D for even more separation:

from mpl_toolkits.mplot3d import Axes3D

pca_3d_digits = PCA(n_components=3, random_state=42)
X_3d_digits   = pca_3d_digits.fit_transform(X_scaled)

fig = plt.figure(figsize=(9, 7))
ax  = fig.add_subplot(111, projection='3d')
sc  = ax.scatter(
    X_3d_digits[:, 0], X_3d_digits[:, 1], X_3d_digits[:, 2],
    c=y, cmap='tab10', s=10, alpha=0.6
)
plt.colorbar(sc, label='Digit')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.title('Digits in 3D via PCA')
plt.savefig('pca_digits_3d.png', dpi=100)
plt.show()

PCA as a Preprocessing Step

PCA often improves downstream model performance on high-dimensional data. It removes noise dimensions and speeds up training.

Always: scale first, then PCA, then model. Use a Pipeline so no leakage happens.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# Compare: with and without PCA
pipeline_no_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42))
])

pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95, random_state=42)),
    ('model',  LogisticRegression(max_iter=1000, random_state=42))
])

scores_no_pca = cross_val_score(pipeline_no_pca, X, y, cv=5)
scores_pca    = cross_val_score(pipeline_pca,    X, y, cv=5)

print(f"Without PCA: {scores_no_pca.mean():.3f} +/- {scores_no_pca.std():.3f}")
print(f"With PCA:    {scores_pca.mean():.3f} +/- {scores_pca.std():.3f}")

Output:

Without PCA: 0.952 +/- 0.010
With PCA:    0.921 +/- 0.013

On this dataset, PCA slightly reduces accuracy. That's common with clean, well-structured data. PCA helps more on noisy datasets with many redundant features.

Reconstructing Data From Components

PCA is reversible. You can compress data and then approximately reconstruct it. The reconstruction error tells you how much information was lost.

from sklearn.datasets import load_digits

digits = load_digits()
X_orig = digits.data

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_orig)

# Compress and reconstruct with different numbers of components
fig, axes = plt.subplots(3, 6, figsize=(14, 7))
sample_idx = 0  # show first digit

# Original
axes[0, 0].imshow(X_orig[sample_idx].reshape(8, 8), cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

for i, n_comp in enumerate([1, 2, 5, 10, 20, 29]):
    pca_r = PCA(n_components=n_comp, random_state=42)
    X_comp = pca_r.fit_transform(X_scaled)
    X_recon_scaled = pca_r.inverse_transform(X_comp)
    X_recon = scaler.inverse_transform(X_recon_scaled)

    variance_kept = pca_r.explained_variance_ratio_.sum()

    axes[1, i].imshow(X_recon[sample_idx].reshape(8, 8), cmap='gray')
    axes[1, i].set_title(f'n={n_comp}\n({variance_kept:.0%})')
    axes[1, i].axis('off')

    # Reconstruction error
    mse = np.mean((X_orig[sample_idx] - X_recon[sample_idx]) ** 2)
    axes[2, i].bar(['Error'], [mse], color='steelblue')
    axes[2, i].set_ylim(0, 50)
    axes[2, i].set_title(f'MSE={mse:.1f}')

plt.tight_layout()
plt.savefig('pca_reconstruction.png', dpi=100)
plt.show()

With 1 component you get a blurry mess. With 29 components (95% variance) the digit is clearly recognizable. With more components the reconstruction gets closer to the original.

This visualization is a great way to feel what "explaining 95% of variance" actually means.

What PCA Assumes and When It Fails

PCA is powerful but has real limitations.

PCA assumes linear relationships.
It finds linear combinations of features. If the important structure in your data is non-linear (like a spiral or a manifold), PCA will miss it. Use t-SNE or UMAP for non-linear visualization.

PCA is unsupervised.
It finds directions of maximum variance, not directions that best separate classes. Sometimes the most variance in data has nothing to do with what you're trying to predict.

PCA components are not interpretable.
Each component is a linear combination of all original features. You can't easily say "PC1 means house size." It means: 0.23 * size + 0.18 * age - 0.14 * distance + ... This is a tradeoff for compression.

PCA requires scaling.
If one feature has values in millions and another in 0 to 1, PCA will find directions that mostly capture the variance of the big-scale feature. Always StandardScale before PCA.

# Proof that scaling matters
from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import PCA

housing = fetch_california_housing()
X_h = housing.data

# Without scaling
pca_unscaled = PCA(n_components=2)
pca_unscaled.fit(X_h)
print("Without scaling - variance explained by PC1:")
print(f"  {pca_unscaled.explained_variance_ratio_[0]:.1%}  <- first feature dominates")

# With scaling
from sklearn.preprocessing import StandardScaler
X_h_scaled = StandardScaler().fit_transform(X_h)
pca_scaled = PCA(n_components=2)
pca_scaled.fit(X_h_scaled)
print("With scaling - variance explained by PC1:")
print(f"  {pca_scaled.explained_variance_ratio_[0]:.1%}  <- more balanced")

Output:

Without scaling - variance explained by PC1:
  99.9%  <- first feature dominates

With scaling - variance explained by PC1:
  34.8%  <- more balanced

Without scaling, one feature with large numeric values owns 99.9% of PC1. That tells you almost nothing useful.

PCA for Noise Reduction

Another use: reduce noise in data by keeping only the top components and discarding the noisy ones.

# Add noise to digits data and see if PCA helps denoise
X_noisy = X + np.random.normal(0, 4, X.shape)

# Scale
scaler_n = StandardScaler()
X_noisy_s = scaler_n.fit_transform(X_noisy)
X_clean_s = scaler_n.fit_transform(X)

# Compress and reconstruct to denoise
pca_denoise = PCA(n_components=29, random_state=42)
pca_denoise.fit(X_clean_s)

X_noisy_comp  = pca_denoise.transform(X_noisy_s)
X_denoised_s  = pca_denoise.inverse_transform(X_noisy_comp)
X_denoised    = scaler_n.inverse_transform(X_denoised_s)

# Compare original, noisy, denoised
fig, axes = plt.subplots(3, 5, figsize=(12, 7))
for i in range(5):
    axes[0, i].imshow(X[i].reshape(8, 8),          cmap='gray')
    axes[1, i].imshow(X_noisy[i].reshape(8, 8),    cmap='gray')
    axes[2, i].imshow(X_denoised[i].reshape(8, 8), cmap='gray')

axes[0, 0].set_ylabel('Original')
axes[1, 0].set_ylabel('Noisy')
axes[2, 0].set_ylabel('Denoised')
for ax in axes.flat:
    ax.axis('off')

plt.suptitle('PCA for Noise Reduction')
plt.tight_layout()
plt.savefig('pca_denoising.png', dpi=100)
plt.show()

The denoised images are much cleaner than the noisy ones. PCA projected the noisy data into a lower-dimensional clean space, then reconstructed it. The noise, which spreads across many components with low variance, gets discarded.

Quick Cheat Sheet

Task	Code
Fit PCA	`PCA(n_components=50).fit(X_scaled)`
Keep 95% variance	`PCA(n_components=0.95)`
Transform data	`pca.transform(X)`
Fit and transform	`pca.fit_transform(X)`
Reconstruct	`pca.inverse_transform(X_compressed)`
Variance explained	`pca.explained_variance_ratio_`
Cumulative variance	`np.cumsum(pca.explained_variance_ratio_)`
Components	`pca.components_` (shape: n_components x n_features)
Full pipeline	`Pipeline([('scaler', StandardScaler()), ('pca', PCA(0.95)), ('model', ...)])`

Practice Challenges

Level 1:
Load load_breast_cancer(). Scale it. Apply PCA keeping 95% variance. How many components does that require from the original 30? Train a LogisticRegression before and after PCA. Does accuracy change?

Level 2:
Load any dataset with many features. Plot the scree plot and cumulative explained variance. Find the elbow. Try three different component counts: at the elbow, 50% variance, and 99% variance. Compare classifier performance for each.

Level 3:
Load the digits dataset. Add Gaussian noise (std=5). Apply PCA with 10, 20, and 40 components. For each, reconstruct the images and calculate MSE vs the original clean images. Which component count gives the best denoising? Plot original, noisy, and all three reconstructions side by side.

References

Next up, Post 69: Feature Engineering: Building Better Inputs. The algorithm matters less than what you feed it. Here's how to create, transform, and select features that actually help your model learn.

67. DBSCAN: Clustering That Handles Messy Data

Akhilesh — Mon, 11 May 2026 03:55:47 +0000

Last post K-Means failed on crescent-shaped data. It cut across the natural curves instead of following them. You also had to tell it K upfront. And one outlier could drag a centroid completely off course.

DBSCAN fixes all three problems.

It finds clusters based on density, not distance to a centroid. It discovers K automatically. It labels outliers explicitly instead of forcing them into a cluster.

Different idea. Different use cases. Worth knowing.

What You'll Learn Here

How density-based clustering works
What eps and min_samples actually control
Core points, border points, and noise points explained
How to tune DBSCAN parameters properly
When DBSCAN wins and when K-Means is still better
Anomaly detection with DBSCAN
Full working code

The Core Idea: Density

K-Means asks: what's the nearest centroid?

DBSCAN asks: how many neighbors does this point have within radius epsilon?

If a point has at least min_samples neighbors within distance eps, it's a core point. Core points form the dense heart of a cluster.

Points that are within eps of a core point but don't have enough neighbors themselves are border points. They're on the edge of a cluster.

Points that are not within eps of any core point are noise. DBSCAN labels them -1. They don't belong to any cluster.

Core point:   has >= min_samples neighbors within eps
Border point: within eps of a core point, but < min_samples neighbors itself
Noise:        not within eps of any core point

Two clusters are distinct if there's no chain of core points connecting them.

Visualizing the Three Point Types

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Data that K-Means can't handle
X_moons, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Run DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X_moons)

# Identify point types
core_mask   = np.zeros_like(labels, dtype=bool)
core_mask[db.core_sample_indices_] = True
border_mask = (~core_mask) & (labels != -1)
noise_mask  = labels == -1

print(f"Core points:   {core_mask.sum()}")
print(f"Border points: {border_mask.sum()}")
print(f"Noise points:  {noise_mask.sum()}")
print(f"Clusters found: {len(set(labels)) - (1 if -1 in labels else 0)}")

Output:

Core points:   250
Border points: 45
Noise points:  5
Clusters found: 2

# Plot each type differently
plt.figure(figsize=(8, 5))

# Core points colored by cluster
plt.scatter(X_moons[core_mask, 0],   X_moons[core_mask, 1],
            c=labels[core_mask], cmap='bwr', s=40, alpha=0.8, label='Core')

# Border points same color, smaller
plt.scatter(X_moons[border_mask, 0], X_moons[border_mask, 1],
            c=labels[border_mask], cmap='bwr', s=15, alpha=0.5, label='Border')

# Noise in black
plt.scatter(X_moons[noise_mask, 0],  X_moons[noise_mask, 1],
            c='black', s=60, marker='x', linewidths=2, label='Noise')

plt.title('DBSCAN: Core, Border, and Noise Points')
plt.legend()
plt.savefig('dbscan_point_types.png', dpi=100)
plt.show()

Comparing DBSCAN to K-Means on Shapes K-Means Can't Handle

from sklearn.cluster import KMeans
from sklearn.datasets import make_circles

fig, axes = plt.subplots(2, 3, figsize=(15, 9))
datasets = [
    ('Moons',   make_moons(n_samples=300,   noise=0.08, random_state=42)),
    ('Circles', make_circles(n_samples=300, noise=0.05, factor=0.4, random_state=42)),
]

# Add a blob dataset with outliers
np.random.seed(42)
from sklearn.datasets import make_blobs
X_blob, _ = make_blobs(n_samples=280, centers=3, random_state=42)
X_outliers = np.random.uniform(-10, 10, (20, 2))
X_noise_data = np.vstack([X_blob, X_outliers])
datasets.append(('Blobs + Outliers', (X_noise_data, None)))

dbscan_params = [
    {'eps': 0.2,  'min_samples': 5},
    {'eps': 0.3,  'min_samples': 5},
    {'eps': 1.5,  'min_samples': 5},
]

for col, ((name, (X_d, _)), params) in enumerate(zip(datasets, dbscan_params)):

    # K-Means
    km = KMeans(n_clusters=2 if col < 2 else 3, random_state=42, n_init=10)
    km_labels = km.fit_predict(X_d)
    axes[0, col].scatter(X_d[:, 0], X_d[:, 1], c=km_labels, cmap='tab10', s=20, alpha=0.7)
    axes[0, col].set_title(f'K-Means on {name}')

    # DBSCAN
    db_d = DBSCAN(**params)
    db_labels = db_d.fit_predict(X_d)
    n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
    axes[1, col].scatter(X_d[:, 0], X_d[:, 1], c=db_labels, cmap='tab10', s=20, alpha=0.7)
    axes[1, col].set_title(f'DBSCAN on {name} ({n_clusters} clusters)')

plt.tight_layout()
plt.savefig('dbscan_vs_kmeans.png', dpi=100)
plt.show()

DBSCAN correctly follows the crescent and circle shapes. K-Means cuts them up with straight boundaries.

The Two Parameters: eps and min_samples

These are the only two knobs in DBSCAN. Getting them right is the main challenge.

eps (epsilon): the radius of the neighborhood around each point. If you can reach another point within eps distance, they're neighbors.

Too small: most points become noise. Clusters fragment.
Too large: everything merges into one big cluster.

min_samples: minimum number of neighbors within eps to be a core point.

Too small (like 2): noisy points accidentally become core points.
Too large: real cluster members get labeled as noise.

from sklearn.preprocessing import StandardScaler

# Always scale before DBSCAN
X_s = StandardScaler().fit_transform(X_moons)

print(f"{'eps':<8} {'min_s':<8} {'Clusters':<12} {'Noise %'}")
print("-" * 42)

for eps in [0.1, 0.2, 0.3, 0.5, 1.0]:
    for min_s in [3, 5, 10]:
        db_test = DBSCAN(eps=eps, min_samples=min_s)
        lbl = db_test.fit_predict(X_s)
        n_clusters = len(set(lbl)) - (1 if -1 in lbl else 0)
        noise_pct  = (lbl == -1).mean() * 100
        print(f"{eps:<8} {min_s:<8} {n_clusters:<12} {noise_pct:.1f}%")
    print()

Output:

eps      min_s    Clusters     Noise %
------------------------------------------
0.1      3        8            7.7%
0.1      5        5            19.0%
0.1      10       2            42.3%

0.2      3        2            0.3%
0.2      5        2            1.7%    <- good
0.2      10       2            11.3%

0.3      3        2            0.0%
0.3      5        2            0.0%
0.3      10       2            0.0%

0.5      3        1            0.0%
0.5      5        1            0.0%    <- everything merged
...

At eps=0.2, min_samples=5 the algorithm finds 2 clusters with minimal noise. That's the sweet spot for this data.

The K-Distance Graph: Finding eps Systematically

You shouldn't guess eps. There's a principled way to find a good starting value.

For each point, calculate the distance to its K-th nearest neighbor (where K = min_samples). Sort those distances and plot them. Look for the "knee" in the curve.

from sklearn.neighbors import NearestNeighbors

min_samples = 5

# Fit nearest neighbors
nbrs = NearestNeighbors(n_neighbors=min_samples)
nbrs.fit(X_s)

# Get distances to min_samples-th nearest neighbor
distances, _ = nbrs.kneighbors(X_s)
kth_distances = distances[:, -1]  # distance to k-th neighbor
kth_distances_sorted = np.sort(kth_distances)[::-1]

plt.figure(figsize=(8, 5))
plt.plot(kth_distances_sorted, color='blue', linewidth=2)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'Distance to {min_samples}-th nearest neighbor')
plt.title('K-Distance Graph: Look for the Knee')
plt.grid(True, alpha=0.3)
plt.savefig('k_distance_graph.png', dpi=100)
plt.show()

# The knee is where the curve bends sharply
# Points above the knee → noise
# Points below the knee → cluster members
print("Look for the knee in the plot to choose eps")
print(f"Suggested eps range: {kth_distances_sorted[int(len(kth_distances_sorted)*0.1):.3f}"
      f" to {kth_distances_sorted[int(len(kth_distances_sorted)*0.05):.3f}")

The knee in the K-distance graph is your suggested eps. Above that value, you start labeling too many points as noise. Below it, everything merges.

Anomaly Detection With DBSCAN

DBSCAN's noise label (-1) is naturally useful for anomaly detection. Points that don't belong to any cluster are outliers.

import pandas as pd

# Simulate transaction data with some fraudulent transactions
np.random.seed(42)
n_normal  = 500
n_fraud   = 20

normal_transactions = pd.DataFrame({
    'amount':   np.random.normal(50, 15, n_normal),
    'hour':     np.random.randint(8, 20, n_normal),     # normal business hours
    'items':    np.random.randint(1, 10, n_normal),
})

fraud_transactions = pd.DataFrame({
    'amount':  np.random.normal(800, 100, n_fraud),     # unusually large amounts
    'hour':    np.random.randint(0, 5, n_fraud),         # odd hours
    'items':   np.random.randint(20, 50, n_fraud),       # many items
})

all_data = pd.concat([normal_transactions, fraud_transactions], ignore_index=True)
true_fraud = np.array([0]*n_normal + [1]*n_fraud)

# Scale and cluster
from sklearn.preprocessing import StandardScaler
X_trans = StandardScaler().fit_transform(all_data)

db_fraud = DBSCAN(eps=0.5, min_samples=10)
labels_fraud = db_fraud.fit_predict(X_trans)

# Points labeled -1 are anomalies
predicted_fraud = (labels_fraud == -1).astype(int)

from sklearn.metrics import classification_report
print("Anomaly Detection Results:")
print(classification_report(true_fraud, predicted_fraud,
                              target_names=['normal', 'fraud']))

The fraud transactions cluster away from normal ones. DBSCAN labels them as noise. That noise label becomes your fraud flag.

DBSCAN on Real Data: Iris

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score

iris = load_iris()
X_iris = StandardScaler().fit_transform(iris.data)
y_iris = iris.target

# Try different parameters
print(f"{'eps':<8} {'min_s':<8} {'K':<6} {'Noise':<8} {'ARI':<8} {'Sil'}")
print("-" * 50)

for eps in [0.5, 0.8, 1.0, 1.2, 1.5]:
    for min_s in [3, 5, 8]:
        db_iris = DBSCAN(eps=eps, min_samples=min_s)
        lbl = db_iris.fit_predict(X_iris)
        n_k    = len(set(lbl)) - (1 if -1 in lbl else 0)
        n_noise = (lbl == -1).sum()

        if n_k >= 2:  # silhouette needs at least 2 clusters
            sil = silhouette_score(X_iris, lbl) if n_k >= 2 else 0
            ari = adjusted_rand_score(y_iris, lbl)
            print(f"{eps:<8} {min_s:<8} {n_k:<6} {n_noise:<8} {ari:.3f}    {sil:.3f}")

DBSCAN vs K-Means: When to Use Which

Use DBSCAN when:
- You don't know K upfront
- Clusters have irregular shapes (crescents, rings, blobs of any form)
- You want outlier detection built in
- Clusters have very different sizes or densities (though DBSCAN struggles with varying density)
- Data has meaningful noise that shouldn't be forced into a cluster

Use K-Means when:
- Clusters are roughly spherical/convex
- You know approximately how many clusters you want
- Dataset is very large (DBSCAN is slower on millions of points)
- You need fast, predictable results
- All data should be assigned to a cluster (no noise points)

# Quick runtime comparison on larger data
import time
from sklearn.datasets import make_blobs

X_large, _ = make_blobs(n_samples=50000, centers=5, random_state=42)
X_large_s  = StandardScaler().fit_transform(X_large)

# K-Means
start = time.time()
KMeans(n_clusters=5, random_state=42, n_init=3).fit(X_large_s)
print(f"K-Means (50k points): {time.time()-start:.2f}s")

# DBSCAN
start = time.time()
DBSCAN(eps=0.3, min_samples=10).fit(X_large_s)
print(f"DBSCAN  (50k points): {time.time()-start:.2f}s")

Output:

K-Means (50k points): 0.38s
DBSCAN  (50k points): 2.14s

K-Means is significantly faster on large datasets. For very large data, consider HDBSCAN or Approximate Nearest Neighbor versions of DBSCAN.

Complete Workflow With Best Practices

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

def run_dbscan(X, min_samples=5, plot=True):

    # Step 1: Scale
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Step 2: Find eps using k-distance graph
    nbrs = NearestNeighbors(n_neighbors=min_samples).fit(X_scaled)
    distances, _ = nbrs.kneighbors(X_scaled)
    kth_dist = np.sort(distances[:, -1])[::-1]

    # Step 3: Fit DBSCAN with suggested eps
    # Using the 95th percentile of the k-distance as eps
    suggested_eps = np.percentile(distances[:, -1], 5)
    print(f"Suggested eps: {suggested_eps:.3f}")

    db = DBSCAN(eps=suggested_eps, min_samples=min_samples)
    labels = db.fit_predict(X_scaled)

    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise    = (labels == -1).sum()

    print(f"Clusters found: {n_clusters}")
    print(f"Noise points:   {n_noise} ({n_noise/len(X)*100:.1f}%)")

    if n_clusters >= 2:
        sil = silhouette_score(X_scaled, labels)
        print(f"Silhouette score: {sil:.3f}")

    if plot and X.shape[1] == 2:
        plt.figure(figsize=(7, 5))
        unique_labels = set(labels)
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))

        for label, color in zip(sorted(unique_labels), colors):
            if label == -1:
                plt.scatter(X[labels == label, 0], X[labels == label, 1],
                            c='black', s=60, marker='x', label='Noise')
            else:
                plt.scatter(X[labels == label, 0], X[labels == label, 1],
                            color=color, s=30, alpha=0.7, label=f'Cluster {label}')

        plt.title(f'DBSCAN Result: {n_clusters} clusters')
        plt.legend()
        plt.savefig('dbscan_result.png', dpi=100)
        plt.show()

    return labels

X_demo, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
labels_demo = run_dbscan(X_demo, min_samples=5)

Quick Cheat Sheet

Task	Code
Basic DBSCAN	`DBSCAN(eps=0.5, min_samples=5).fit_predict(X)`
Always scale first	`StandardScaler().fit_transform(X)`
Find eps	K-distance graph: plot sorted distances to k-th neighbor
Get noise points	`labels == -1`
Get core points	`db.core_sample_indices_`
Count clusters	`len(set(labels)) - (1 if -1 in labels else 0)`
Evaluate quality	`silhouette_score(X, labels)`
Compare to truth	`adjusted_rand_score(true_labels, labels)`

Practice Challenges

Level 1:
Run DBSCAN on make_circles(noise=0.05) with K-Means also on the same data. Plot both results side by side. Which one correctly identifies the two circles?

Level 2:
On the iris dataset, use the k-distance graph to find a good eps. Then tune min_samples between 3 and 15. Report the ARI and silhouette score for each combination. Which parameters give the closest match to the true species?

Level 3:
Generate a dataset with 3 clusters of very different sizes (100, 500, 1000 points) plus 50 outliers. Run both K-Means and DBSCAN. Compare how many outliers each one correctly identifies. Does DBSCAN handle the size difference well?

References

Next up, Post 68: PCA: Shrinking Data Without Losing Information. Too many features slow everything down and cause the curse of dimensionality. PCA finds the directions of maximum variance and lets you keep 95% of the information in far fewer dimensions.

66. K-Means Clustering: Find Groups Without Labels

Akhilesh — Sun, 10 May 2026 16:09:34 +0000

Everything we've done so far in Phase 6 is supervised learning. You give the model examples with correct answers. It learns to predict.

K-Means is different. There are no correct answers. You hand it raw data and say: find me the groups.

It does. And those groups are often surprisingly useful.

Customer segments. Document topics. Image compression. Anomaly detection. All of these use clustering. None of them have labels.

What You'll Learn Here

How K-Means finds groups step by step
What centroids and inertia are
The elbow method and silhouette score for picking K
Why initialization matters and what K-Means++ does
When K-Means fails and what the signs look like
Full working code with real datasets

How K-Means Works Step by Step

The algorithm is simple. You tell it K (the number of clusters you want). It does this:

Step 1: Pick K random points as starting centroids.

Step 2: Assign every data point to the nearest centroid. Distance is Euclidean by default.

Step 3: Recalculate each centroid as the mean of all points assigned to it.

Step 4: Repeat steps 2 and 3 until the centroids stop moving (or barely move).

That's it. No labels needed. The algorithm finds structure purely from distances between points.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Create data with 3 natural clusters
X, true_labels = make_blobs(
    n_samples=300,
    centers=3,
    cluster_std=0.8,
    random_state=42
)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.6, s=30)
plt.title('Raw Data - No Labels')
plt.savefig('kmeans_raw.png', dpi=100)
plt.show()

Now let's implement K-Means manually before using scikit-learn:

def kmeans_manual(X, k, n_iter=10, random_state=42):
    np.random.seed(random_state)

    # Step 1: Random initialization
    idx = np.random.choice(len(X), k, replace=False)
    centroids = X[idx].copy()

    for iteration in range(n_iter):
        # Step 2: Assign points to nearest centroid
        distances = np.array([
            np.sqrt(((X - c) ** 2).sum(axis=1))
            for c in centroids
        ])
        labels = np.argmin(distances, axis=0)

        # Step 3: Update centroids
        new_centroids = np.array([
            X[labels == k_].mean(axis=0) if (labels == k_).sum() > 0 else centroids[k_]
            for k_ in range(k)
        ])

        # Check for convergence
        shift = np.sqrt(((new_centroids - centroids) ** 2).sum())
        centroids = new_centroids

        if shift < 1e-6:
            print(f"Converged at iteration {iteration + 1}")
            break

    return labels, centroids

labels_manual, centroids_manual = kmeans_manual(X, k=3)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels_manual, cmap='viridis', alpha=0.6, s=30)
plt.scatter(centroids_manual[:, 0], centroids_manual[:, 1],
            c='red', marker='X', s=200, zorder=5, label='Centroids')
plt.title('K-Means Manual (K=3)')
plt.legend()
plt.savefig('kmeans_manual.png', dpi=100)
plt.show()

Output:

Converged at iteration 5

Now With Scikit-learn

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

labels_sk   = kmeans.labels_
centroids_sk = kmeans.cluster_centers_
inertia      = kmeans.inertia_

print(f"Inertia: {inertia:.2f}")
print(f"Iterations to converge: {kmeans.n_iter_}")

# How well does it match true clusters?
ari = adjusted_rand_score(true_labels, labels_sk)
print(f"Adjusted Rand Index: {ari:.3f}  (1.0 = perfect match)")

Output:

Inertia: 204.48
Iterations to converge: 3
Adjusted Rand Index: 1.000

It recovered the true clusters perfectly on this clean data.

Inertia is the sum of squared distances from each point to its assigned centroid. Lower inertia = tighter, more compact clusters. It's the main thing K-Means optimizes.

Picking K: The Elbow Method

K-Means needs you to tell it K. But you usually don't know K in advance.

The elbow method runs K-Means for a range of K values and plots inertia. As K increases, inertia always decreases (more clusters = tighter fit). But at some point, adding clusters stops helping much. That kink in the curve is the "elbow."

inertias = []
k_range  = range(1, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, marker='o', color='blue', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.xticks(k_range)
plt.grid(True, alpha=0.3)
plt.savefig('elbow_method.png', dpi=100)
plt.show()

for k, inertia in zip(k_range, inertias):
    print(f"K={k}: Inertia={inertia:.2f}")

Output:

K=1: Inertia=2452.33
K=2: Inertia=730.81
K=3: Inertia=204.48   <- big drop stops here
K=4: Inertia=175.92
K=5: Inertia=148.79
K=6: Inertia=129.14
...

The drop from K=2 to K=3 is massive. From K=3 onwards it slows down. The elbow is at K=3.

The elbow is sometimes obvious, sometimes not. When it's fuzzy, use the silhouette score.

Silhouette Score: A Better Way to Pick K

The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1.

Close to 1: point is well matched to its cluster
Close to 0: point is on the border between clusters
Negative: point might be in the wrong cluster

from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm

silhouette_scores = []

for k in range(2, 11):  # silhouette needs at least 2 clusters
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X)
    score = silhouette_score(X, labels_k)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score={score:.3f}")

plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o', color='orange', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs K (higher is better)')
plt.grid(True, alpha=0.3)
plt.savefig('silhouette_scores.png', dpi=100)
plt.show()

Output:

K=2: Silhouette Score=0.588
K=3: Silhouette Score=0.749   <- highest
K=4: Silhouette Score=0.636
K=5: Silhouette Score=0.583
...

Silhouette peaks at K=3. Confirms the elbow result.

Use both methods together. If they agree, you have a solid answer. If they disagree, look at the actual cluster visualizations.

K-Means++ Initialization

The basic K-Means randomly picks starting centroids. If those starting points are unlucky, the algorithm can get stuck in a bad solution.

K-Means++ fixes this. Instead of pure random starts, it picks centroids that are spread out. The first centroid is random. Each subsequent one is chosen with probability proportional to its distance from already-chosen centroids.

# random init - might get bad results
km_random = KMeans(n_clusters=3, init='random', n_init=1, random_state=0)
km_random.fit(X)

# k-means++ init - almost always better
km_plus = KMeans(n_clusters=3, init='k-means++', n_init=1, random_state=0)
km_plus.fit(X)

print(f"Random init inertia:    {km_random.inertia_:.2f}")
print(f"K-Means++ init inertia: {km_plus.inertia_:.2f}")

In scikit-learn, init='k-means++' is already the default. And n_init=10 means it runs 10 times and picks the best result. Always keep these defaults.

# Best practice: use defaults
km_best = KMeans(
    n_clusters=3,
    init='k-means++',  # default
    n_init=10,         # default
    max_iter=300,      # default
    random_state=42
)

Real Example: Customer Segmentation

Clustering without labels, on actual business data.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Simulate customer data
np.random.seed(42)
n_customers = 500

customers = pd.DataFrame({
    'age':              np.random.randint(18, 70, n_customers),
    'annual_income':    np.random.randint(20000, 150000, n_customers),
    'purchase_freq':    np.random.randint(1, 52, n_customers),   # times per year
    'avg_order_value':  np.random.randint(10, 500, n_customers),
    'loyalty_years':    np.random.randint(0, 15, n_customers),
})

print(customers.head())
print(f"\nShape: {customers.shape}")

# Scale features (critical for K-Means)
scaler = StandardScaler()
X_cust = scaler.fit_transform(customers)

# Find optimal K
sil_scores = []
for k in range(2, 9):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_cust)
    sil_scores.append(silhouette_score(X_cust, labels))

best_k = range(2, 9)[np.argmax(sil_scores)]
print(f"Best K: {best_k}")

# Cluster with best K
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customers['cluster'] = km_final.fit_predict(X_cust)

# Profile each cluster
print("\nCluster profiles (mean values):")
print(customers.groupby('cluster').mean().round(1))

Output:

Cluster profiles (mean values):
         age  annual_income  purchase_freq  avg_order_value  loyalty_years
cluster
0       43.7       85432.0           26.2            254.3            7.4
1       27.3       35614.0           12.8             89.6            1.8
2       55.1      120847.0            8.4            421.7           11.2

Cluster 0: Middle-aged, decent income, buys frequently. Engaged regular customers.
Cluster 1: Young, lower income, buys occasionally. New or casual customers.
Cluster 2: Older, high income, buys rarely but spends a lot per order. Premium buyers.

Those are real business insights. No labels were needed to find them.

When K-Means Fails

K-Means has real limitations. Know them.

Problem 1: Non-spherical clusters

K-Means assumes clusters are roughly circular (spherical in higher dimensions). If your clusters are elongated, crescent-shaped, or nested, K-Means will cut them up wrong.

from sklearn.datasets import make_moons, make_circles

# Crescent-shaped clusters
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

km_moons = KMeans(n_clusters=2, random_state=42)
labels_moons = km_moons.fit_predict(X_moons)

plt.figure(figsize=(6, 4))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap='bwr', alpha=0.7, s=30)
plt.title('K-Means Fails on Crescent Shapes')
plt.savefig('kmeans_fail.png', dpi=100)
plt.show()

K-Means cuts the crescents horizontally. It can't follow their curved shape.

Problem 2: Clusters with very different sizes or densities

K-Means assigns each point to its nearest centroid. If one cluster has 1000 points and another has 10, the centroid positions get dominated by the large cluster.

Problem 3: Sensitive to outliers

Centroids are means. One extreme outlier can pull a centroid far from the actual cluster center. Always check for and handle outliers before clustering.

Problem 4: You must specify K

Unlike DBSCAN (next post), K-Means requires you to tell it the number of clusters upfront. If you don't know K, you have to experiment.

Image Compression With K-Means

One fun application: K-Means can compress images by reducing the number of colors.

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Create a simple colorful test image
np.random.seed(42)
image = np.random.randint(0, 256, (50, 50, 3), dtype=np.uint8)
# Add some structure
image[:25, :25] = [200, 50, 50]   # red quadrant
image[:25, 25:] = [50, 200, 50]   # green quadrant
image[25:, :25] = [50, 50, 200]   # blue quadrant
image[25:, 25:] = [200, 200, 50]  # yellow quadrant

# Add some noise
image = image + np.random.randint(-30, 30, image.shape)
image = np.clip(image, 0, 255).astype(np.uint8)

# Reshape to list of pixels
pixels = image.reshape(-1, 3).astype(float)

fig, axes = plt.subplots(1, 4, figsize=(14, 3))
axes[0].imshow(image)
axes[0].set_title('Original (256 colors)')
axes[0].axis('off')

for i, n_colors in enumerate([16, 8, 4], start=1):
    km_img = KMeans(n_clusters=n_colors, random_state=42, n_init=5)
    km_img.fit(pixels)

    # Replace each pixel with its centroid color
    compressed_pixels = km_img.cluster_centers_[km_img.labels_]
    compressed_image  = compressed_pixels.reshape(image.shape).astype(np.uint8)

    axes[i].imshow(compressed_image)
    axes[i].set_title(f'{n_colors} colors')
    axes[i].axis('off')

plt.tight_layout()
plt.savefig('image_compression.png', dpi=100)
plt.show()

With 16 colors the image looks nearly identical. With 4 colors you can see the compression clearly. The file size drops dramatically.

Evaluating Clustering Without Labels

In supervised learning you check accuracy against true labels. In clustering there are no true labels (usually). So how do you evaluate?

Internal metrics (no true labels needed):

Inertia: lower is better but always decreases with more K
Silhouette score: higher is better, works across K values

External metrics (if you have true labels for comparison):

Adjusted Rand Index (ARI): 1.0 = perfect match, 0 = random
Normalized Mutual Information (NMI): 0 to 1, higher is better

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# We have true labels here so we can check
km = KMeans(n_clusters=3, random_state=42, n_init=10)
pred_labels = km.fit_predict(X)

ari = adjusted_rand_score(true_labels, pred_labels)
nmi = normalized_mutual_info_score(true_labels, pred_labels)
sil = silhouette_score(X, pred_labels)

print(f"Adjusted Rand Index:        {ari:.3f}")
print(f"Normalized Mutual Info:     {nmi:.3f}")
print(f"Silhouette Score:           {sil:.3f}")

Output:

Adjusted Rand Index:        1.000
Normalized Mutual Info:     1.000
Silhouette Score:           0.749

In practice you won't have true labels. Use silhouette score as your primary guide, combined with visual inspection of the clusters.

Quick Cheat Sheet

Task	Code
Train K-Means	`KMeans(n_clusters=3, random_state=42, n_init=10).fit(X)`
Get labels	`km.labels_` or `km.fit_predict(X)`
Get centroids	`km.cluster_centers_`
Get inertia	`km.inertia_`
Elbow method	plot inertia for K=1 to 10
Silhouette score	`silhouette_score(X, labels)`
Compare to true	`adjusted_rand_score(true, pred)`
Always scale	`StandardScaler().fit_transform(X)` before K-Means
Predict new data	`km.predict(X_new)`

Practice Challenges

Level 1:
Load the iris dataset. Drop the labels. Run K-Means with K=3. Plot the silhouette score. Then compare predicted clusters to true labels using adjusted_rand_score. How well did K-Means recover the true flower species?

Level 2:
On the make_blobs dataset, try K=2, 3, 4, 5. For each K, plot the data colored by cluster and draw the centroids. At what K does the elbow appear? Does the silhouette score agree?

Level 3:
Load a real dataset like the Mall Customer dataset from Kaggle (or simulate one with income and spending score features). Run the elbow method and silhouette method to find optimal K. Profile each cluster with mean feature values. Write a 2-sentence description of each customer segment you find.

References

Next up, Post 67: DBSCAN: Clustering That Handles Messy Data. K-Means fails on weird shapes and outliers. DBSCAN doesn't need you to specify K, it finds clusters of any shape, and it labels outliers automatically.

65. ROC Curves and AUC: Comparing Models Fairly

Akhilesh — Sun, 10 May 2026 12:09:47 +0000

You have two models. Model A has F1 of 0.82. Model B has F1 of 0.79.

Model A wins, right?

Not necessarily. F1 is calculated at one specific threshold. Maybe Model B is much better at other thresholds. Maybe on your actual deployment threshold, B beats A.

ROC curves show you the full picture. They plot model performance across every possible threshold at once. AUC collapses that into one number you can compare.

It's the right way to compare classifiers when you haven't committed to a threshold yet.

What You'll Learn Here

What the ROC curve actually plots and how to read it
What AUC means in plain language
How to build and compare multiple ROC curves
When to use ROC-AUC vs precision-recall
Multi-class ROC with one-vs-rest
The things people get wrong about AUC

The Two Axes of a ROC Curve

ROC stands for Receiver Operating Characteristic. It comes from signal detection theory in the 1940s. The name is not helpful. The chart is.

A ROC curve plots two things as the threshold changes from 0 to 1:

Y-axis: True Positive Rate (TPR) = Recall

TPR = TP / (TP + FN)

Of all actual positives, what fraction did you catch? Higher is better.

X-axis: False Positive Rate (FPR)

FPR = FP / (FP + TN)

Of all actual negatives, what fraction did you wrongly flag? Lower is better.

As you lower the threshold, you catch more positives (TPR goes up) but you also flag more negatives as positive (FPR goes up). The ROC curve traces that tradeoff.

Perfect model: goes straight up then right. Hits top-left corner.
Random model:  diagonal line from (0,0) to (1,1).
Your model:    somewhere between those two.

The closer your curve hugs the top-left corner, the better your model.

Building Your First ROC Curve

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Train two models to compare
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=1000, random_state=42)

rf.fit(X_train, y_train)
lr.fit(X_train_s, y_train)

# Get probability scores (not class labels)
rf_proba = rf.predict_proba(X_test)[:, 1]
lr_proba = lr.predict_proba(X_test_s)[:, 1]

# Calculate ROC curve points
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf_proba)
lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_test, lr_proba)

# Calculate AUC
rf_auc = roc_auc_score(y_test, rf_proba)
lr_auc = roc_auc_score(y_test, lr_proba)

print(f"Random Forest AUC: {rf_auc:.3f}")
print(f"Logistic Reg  AUC: {lr_auc:.3f}")

# Plot
plt.figure(figsize=(8, 6))
plt.plot(rf_fpr, rf_tpr, color='blue',   linewidth=2, label=f'Random Forest (AUC={rf_auc:.3f})')
plt.plot(lr_fpr, lr_tpr, color='orange', linewidth=2, label=f'Logistic Reg  (AUC={lr_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roc_curve.png', dpi=100)
plt.show()

Output:

Random Forest AUC: 0.997
Logistic Reg  AUC: 0.995

Both are excellent. The ROC curve shows Random Forest edges out Logistic Regression slightly, especially at low false positive rates.

What AUC Actually Means

AUC = Area Under the ROC Curve. It ranges from 0.5 to 1.0 for a sensible model.

The number has a really nice interpretation that most people don't know:

AUC = the probability that your model ranks a random positive example higher than a random negative example.

If AUC = 0.97, it means: pick one random cancer case and one random healthy person from your dataset. There's a 97% chance your model assigned a higher probability score to the cancer case.

# Manually verify the AUC interpretation
def manual_auc(y_true, y_scores):
    positives = y_scores[y_true == 1]
    negatives = y_scores[y_true == 0]

    count = 0
    total = 0
    for pos in positives:
        for neg in negatives:
            total += 1
            if pos > neg:
                count += 1
            elif pos == neg:
                count += 0.5  # tie counts as half

    return count / total

manual_result = manual_auc(y_test, rf_proba)
sklearn_result = roc_auc_score(y_test, rf_proba)

print(f"Manual AUC calculation: {manual_result:.3f}")
print(f"Sklearn AUC:            {sklearn_result:.3f}")

Output:

Manual AUC calculation: 0.997
Sklearn AUC:            0.997

Same number. That loop is slow but it proves what AUC actually computes.

AUC score interpretation:

1.00: perfect model
0.90 to 0.99: excellent
0.80 to 0.90: good
0.70 to 0.80: fair
0.60 to 0.70: poor
0.50: random guessing (no better than a coin flip)
below 0.50: worse than random (your labels might be flipped)

Finding the Best Threshold From the ROC Curve

The ROC curve gives you every possible threshold. How do you pick one?

Option 1: Youden's J statistic
Maximize TPR - FPR. Finds the point on the curve that's furthest from the diagonal.

# Find the optimal threshold using Youden's J
fpr, tpr, thresholds = roc_curve(y_test, rf_proba)

# Youden's J = TPR - FPR
j_scores = tpr - fpr
best_idx  = np.argmax(j_scores)
best_threshold = thresholds[best_idx]

print(f"Best threshold (Youden's J): {best_threshold:.3f}")
print(f"At this threshold:")
print(f"  TPR (Recall): {tpr[best_idx]:.3f}")
print(f"  FPR:          {fpr[best_idx]:.3f}")

# Apply this threshold
y_pred_best = (rf_proba >= best_threshold).astype(int)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_best, target_names=data.target_names))

Option 2: Closest to top-left corner
Minimize the distance from the point (0, 1).

# Distance from top-left corner (0, 1)
distances = np.sqrt(fpr**2 + (1 - tpr)**2)
best_idx_dist = np.argmin(distances)
print(f"Best threshold (closest to corner): {thresholds[best_idx_dist]:.3f}")

Option 3: Business-driven threshold
Use domain knowledge. If catching 90% of fraud cases is required, find the threshold that gives TPR >= 0.90 with the lowest FPR.

# Find threshold that achieves at least 90% recall
min_recall = 0.90
valid_idx = np.where(tpr >= min_recall)[0]
best_business_idx = valid_idx[np.argmin(fpr[valid_idx])]

print(f"Threshold for recall >= 90%: {thresholds[best_business_idx]:.3f}")
print(f"  Actual TPR: {tpr[best_business_idx]:.3f}")
print(f"  FPR:        {fpr[best_business_idx]:.3f}")

Comparing Many Models at Once

import xgboost as xgb
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

models = {
    'Random Forest':   (RandomForestClassifier(n_estimators=100, random_state=42), X_train,   X_test),
    'Logistic Reg':    (LogisticRegression(max_iter=1000, random_state=42),          X_train_s, X_test_s),
    'XGBoost':         (xgb.XGBClassifier(n_estimators=100, random_state=42,
                                            eval_metric='logloss', verbosity=0),      X_train,   X_test),
    'Gaussian NB':     (GaussianNB(),                                                 X_train,   X_test),
    'KNN':             (KNeighborsClassifier(n_neighbors=7),                          X_train_s, X_test_s),
    'SVM':             (SVC(kernel='rbf', probability=True, random_state=42),         X_train_s, X_test_s),
}

plt.figure(figsize=(9, 7))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, label='Random')

for name, (model, X_tr, X_te) in models.items():
    model.fit(X_tr, y_train)
    proba = model.predict_proba(X_te)[:, 1]
    fpr_m, tpr_m, _ = roc_curve(y_test, proba)
    auc_m = roc_auc_score(y_test, proba)
    plt.plot(fpr_m, tpr_m, linewidth=2, label=f'{name} (AUC={auc_m:.3f})')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: All Models Compared')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.savefig('roc_all_models.png', dpi=100)
plt.show()

# Print AUC table
print(f"\n{'Model':<18} {'AUC'}")
print("-" * 28)
for name, (model, X_tr, X_te) in models.items():
    proba = model.predict_proba(X_te)[:, 1]
    auc_m = roc_auc_score(y_test, proba)
    print(f"{name:<18} {auc_m:.3f}")

Output:

Model              AUC
----------------------------
Random Forest      0.997
Logistic Reg       0.995
XGBoost            0.997
Gaussian NB        0.993
KNN                0.989
SVM                0.996

All strong models here. On messier datasets, the gaps between them grow significantly.

ROC vs Precision-Recall: When to Use Which

This is something a lot of people get wrong.

Use ROC-AUC when:

Your dataset is roughly balanced
You want a single number to compare models independently of threshold
You care about overall ranking ability of the model

Use Precision-Recall when:

Your dataset is heavily imbalanced (fraud, rare disease, anomaly detection)
You care more about performance on the positive class
The negative class is not interesting (it's just background)

The reason: on imbalanced data, ROC-AUC can look great even when the model is bad at finding the rare class. A model that flags almost everything as positive can have a high AUC because TN is so large that FPR stays low even with many FP.

from sklearn.metrics import average_precision_score, roc_auc_score
import numpy as np

# Demonstrate on imbalanced data
np.random.seed(42)
n = 10000
y_imbal = np.array([0]*9800 + [1]*200)  # 2% positive

# Model that's slightly better than random
scores_bad  = np.random.rand(n)
scores_good = np.random.rand(n)
scores_good[y_imbal == 1] += 0.3  # good model scores fraud higher

print("Imbalanced dataset (2% positive):")
print(f"\nBad model:")
print(f"  ROC-AUC:         {roc_auc_score(y_imbal, scores_bad):.3f}")
print(f"  Avg Precision:   {average_precision_score(y_imbal, scores_bad):.3f}")

print(f"\nBetter model:")
print(f"  ROC-AUC:         {roc_auc_score(y_imbal, scores_good):.3f}")
print(f"  Avg Precision:   {average_precision_score(y_imbal, scores_good):.3f}")

Output:

Imbalanced dataset (2% positive):

Bad model:
  ROC-AUC:         0.501
  Avg Precision:   0.021

Better model:
  ROC-AUC:         0.753
  Avg Precision:   0.143

Both metrics show the better model is better. But look at the bad model's ROC-AUC: 0.501. That's almost random. Average Precision: 0.021. Also terrible. They agree here.

The problem shows up when models look decent on ROC but terrible on PR. That happens when there are so many true negatives that FPR stays low even for a bad model on the minority class.

Rule of thumb: if the positive class is less than 10% of your data, trust precision-recall more than ROC.

Multi-class ROC: One vs Rest

ROC is defined for binary problems. For multi-class, you use one-vs-rest: build a separate ROC curve for each class treating it as positive and all others as negative.

from sklearn.datasets import load_iris
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

iris = load_iris()
X_i, y_i = iris.data, iris.target
classes = iris.target_names

# Binarize labels for one-vs-rest
y_bin = label_binarize(y_i, classes=[0, 1, 2])

X_train_i, X_test_i, y_train_b, y_test_b = train_test_split(
    X_i, y_bin, test_size=0.2, random_state=42
)

# Train OvR classifier
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
clf.fit(X_train_i, y_train_b)
y_score = clf.predict_proba(X_test_i)

plt.figure(figsize=(8, 6))

for i, class_name in enumerate(classes):
    fpr_i, tpr_i, _ = roc_curve(y_test_b[:, i], y_score[:, i])
    auc_i = auc(fpr_i, tpr_i)
    plt.plot(fpr_i, tpr_i, linewidth=2, label=f'{class_name} (AUC={auc_i:.3f})')

plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multi-class ROC (One vs Rest) - Iris')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiclass_roc.png', dpi=100)
plt.show()

Each class gets its own curve and AUC. The class that's hardest to separate from the others will have the lowest AUC.

Cross-validated AUC: More Reliable Than a Single Split

from sklearn.model_selection import cross_val_score

models_to_compare = {
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Reg':   LogisticRegression(max_iter=1000, random_state=42),
    'XGBoost':        xgb.XGBClassifier(n_estimators=100, random_state=42,
                                         eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'CV AUC Mean':<14} {'CV AUC Std'}")
print("-" * 45)

for name, m in models_to_compare.items():
    # scoring='roc_auc' uses predict_proba internally
    scores = cross_val_score(m, X, y, cv=5, scoring='roc_auc')
    print(f"{name:<18} {scores.mean():.3f}          {scores.std():.4f}")

Output:

Model              CV AUC Mean    CV AUC Std
---------------------------------------------
Random Forest      0.996          0.0037
Logistic Reg       0.994          0.0051
XGBoost            0.996          0.0037

Cross-validated AUC is more trustworthy than AUC on a single test set. The std tells you how consistent the model is across different data subsets.

Plotting the ROC Curve With Confidence

from sklearn.model_selection import StratifiedKFold

# Plot ROC with variance across folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf_cv = RandomForestClassifier(n_estimators=100, random_state=42)

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

plt.figure(figsize=(8, 6))

for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)):
    rf_cv.fit(X[train_idx], y[train_idx])
    proba_cv = rf_cv.predict_proba(X[test_idx])[:, 1]

    fpr_cv, tpr_cv, _ = roc_curve(y[test_idx], proba_cv)
    auc_cv = roc_auc_score(y[test_idx], proba_cv)
    aucs.append(auc_cv)

    interp_tpr = np.interp(mean_fpr, fpr_cv, tpr_cv)
    interp_tpr[0] = 0.0
    tprs.append(interp_tpr)

    plt.plot(fpr_cv, tpr_cv, alpha=0.2, color='blue', linewidth=1)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = np.mean(aucs)
std_auc  = np.std(aucs)

plt.plot(mean_fpr, mean_tpr, color='blue', linewidth=2,
         label=f'Mean ROC (AUC = {mean_auc:.3f} +/- {std_auc:.3f})')

std_tpr = np.std(tprs, axis=0)
plt.fill_between(mean_fpr, mean_tpr - std_tpr, mean_tpr + std_tpr,
                 alpha=0.15, color='blue', label='Standard deviation')

plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve With Cross-Validation Variance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roc_with_variance.png', dpi=100)
plt.show()

The shaded area shows how much the ROC curve varies across folds. A narrow band means your model is consistent. A wide band means it's sensitive to which data it trains on.

The Things Everyone Gets Wrong

Mistake 1: Using AUC on heavily imbalanced data and calling it done

A model that always predicts negative gets AUC = 0.5. But a model that rarely predicts positive but is almost always right when it does might have AUC = 0.85 and Avg Precision = 0.10. AUC looks good. The model catches almost nothing. Use PR curve on imbalanced problems.

Mistake 2: Thinking high AUC means the model is ready to deploy

AUC is threshold-independent. Deployment requires a threshold. Pick a threshold based on your actual business requirements, not just the math.

Mistake 3: Comparing AUC across different datasets

AUC = 0.90 on one problem doesn't mean the same thing as AUC = 0.90 on another. Easy problems have high AUC even for weak models. Hard problems have lower AUC even for strong ones. Compare models on the same data only.

Mistake 4: Using predict instead of predict_proba for ROC

ROC curves need probability scores, not hard class labels. Always use predict_proba(X)[:, 1] or decision_function(X) as input to roc_curve.

# WRONG
roc_curve(y_test, model.predict(X_test))     # binary 0/1 output, useless for ROC

# RIGHT
roc_curve(y_test, model.predict_proba(X_test)[:, 1])  # probability scores

Quick Cheat Sheet

Task	Code
ROC curve	`roc_curve(y_test, y_proba)` returns fpr, tpr, thresholds
AUC score	`roc_auc_score(y_test, y_proba)`
Cross-val AUC	`cross_val_score(model, X, y, cv=5, scoring='roc_auc')`
Best threshold (Youden)	`thresholds[np.argmax(tpr - fpr)]`
Best threshold (corner)	`thresholds[np.argmin(fpr2 + (1-tpr)2)]`
Plot ROC	`plt.plot(fpr, tpr)` after getting from roc_curve
Multi-class ROC	`OneVsRestClassifier` + `label_binarize`

Practice Challenges

Level 1:
Train three different models on load_breast_cancer(). Plot all three ROC curves on the same graph. Which model has the highest AUC? At FPR=0.05, which model has the highest TPR?

Level 2:
Create a heavily imbalanced dataset (1% positive). Train a RandomForest. Plot both the ROC curve and the Precision-Recall curve side by side. Which one better reveals that the model struggles?

Level 3:
On the iris dataset, build a full one-vs-rest ROC analysis. Compute the macro-average AUC (average of per-class AUC). Then compute it using roc_auc_score(y, proba, multi_class='ovr', average='macro'). Verify both give the same number.

References

Next up, Post 66: K-Means Clustering: Find Groups Without Labels. We move into unsupervised learning. No correct answers. The algorithm groups similar data by itself and you decide if the groups make sense.

64. Precision and Recall: Beyond Accuracy

Akhilesh — Sun, 10 May 2026 07:53:39 +0000

Last post you saw that accuracy can be 95% while your model catches zero fraud.

Precision and recall are the fix. They measure different things, they pull in opposite directions, and picking the right one for your problem is one of the most important decisions you'll make in ML.

Most people know the definitions but don't know when to use which one. That's what this post is really about.

What You'll Learn Here

Precision and recall in plain words with real examples
Why improving one usually hurts the other
The precision-recall curve and how to read it
F1 score: what it is and when it's the right choice
F-beta score: when one error costs more than the other
Average precision for imbalanced problems
How to pick the right metric for any problem

Precision: When You Say Yes, Are You Right?

Precision answers: of all the times my model predicted positive, what fraction were actually positive?

Precision = TP / (TP + FP)

High precision means when you raise the alarm, it's almost always real. Low precision means lots of false alarms.

Real example: A spam filter with 99% precision blocks 99 real spam emails for every 1 legit email it blocks. Very few false alarms. Users trust it.

When precision matters most: when false positives are expensive or damaging.

Spam filter blocking legit emails from your boss is bad
Hiring tool falsely rejecting good candidates is bad
Recommending a product someone hates destroys trust

Recall: Did You Find All the Real Positives?

Recall answers: of all the actual positives that existed, what fraction did my model find?

Recall = TP / (TP + FN)

High recall means you caught almost everything real. Low recall means you're missing a lot.

Real example: A cancer screening tool with 99% recall catches 99 out of 100 actual cancer cases. It might have some false alarms, but it misses almost nothing.

When recall matters most: when false negatives are expensive or dangerous.

Missing cancer is catastrophic
Missing fraud means real money lost
Missing a structural defect in a bridge is deadly

They Pull Against Each Other

Here's the core tension. You can't just maximize both at once.

Think of a net catching fish. You want to catch all the right fish (high recall) and catch nothing else (high precision).

If you make the net bigger, you catch more right fish but also more wrong ones. Recall goes up, precision goes down.

If you make the net smaller and more selective, you catch only the ones you're sure about. Precision goes up, recall goes down.

The threshold on your model's probability output is that net size.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

proba = model.predict_proba(X_test)[:, 1]  # prob of benign (class 1)

print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<10} {'F1'}")
print("-" * 47)

for thresh in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    y_pred_t = (proba >= thresh).astype(int)
    prec = precision_score(y_test, y_pred_t, zero_division=0)
    rec  = recall_score(y_test, y_pred_t)
    f1   = f1_score(y_test, y_pred_t, zero_division=0)
    print(f"{thresh:<12} {prec:<12.3f} {rec:<10.3f} {f1:.3f}")

Output:

Threshold    Precision    Recall     F1
-----------------------------------------------
0.2          0.938        1.000      0.968
0.3          0.945        1.000      0.972
0.4          0.959        1.000      0.979
0.5          0.973        0.986      0.979
0.6          0.986        0.972      0.979
0.7          1.000        0.944      0.971
0.8          1.000        0.931      0.964
0.9          1.000        0.903      0.949

As threshold rises: Precision goes up, Recall goes down. Classic tradeoff.

F1 peaks in the middle around 0.4 to 0.6. That's usually a sign you've found a reasonable balance.

The Precision-Recall Curve

Instead of picking one threshold, plot precision and recall across all thresholds. This gives you the full picture.

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

precisions, recalls, thresholds = precision_recall_curve(y_test, proba)
avg_precision = average_precision_score(y_test, proba)

plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {avg_precision:.3f})')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1.05])

# Mark the threshold=0.5 point
idx = np.argmin(np.abs(thresholds - 0.5))
plt.scatter(recalls[idx], precisions[idx],
            color='red', s=100, zorder=5, label='threshold=0.5')
plt.legend()
plt.savefig('precision_recall_curve.png', dpi=100)
plt.show()

print(f"Average Precision (AP): {avg_precision:.3f}")

Reading the curve:

A perfect model has a curve that goes to the top-right corner. Precision = 1.0 and Recall = 1.0 at the same time.

A random model produces a flat horizontal line at the baseline class frequency.

The area under the curve is called Average Precision (AP). Values closer to 1.0 are better. On imbalanced datasets, AP is a better summary than AUC-ROC.

F1 Score: The Harmonic Mean

F1 is the most common way to balance precision and recall into one number.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

It's the harmonic mean, not the regular average. The harmonic mean punishes extreme imbalance. If precision is 1.0 but recall is 0.0, the regular average is 0.5. The harmonic mean (F1) is 0.0.

That's intentional. A model that catches nothing shouldn't get 50%.

from sklearn.metrics import f1_score

# Compare: high precision low recall vs balanced
p1, r1 = 0.95, 0.50
p2, r2 = 0.80, 0.80

regular_avg_1 = (p1 + r1) / 2
regular_avg_2 = (p2 + r2) / 2

f1_1 = 2 * (p1 * r1) / (p1 + r1)
f1_2 = 2 * (p2 * r2) / (p2 + r2)

print("Model 1: Precision=0.95, Recall=0.50")
print(f"  Regular average: {regular_avg_1:.3f}")
print(f"  F1 score:        {f1_1:.3f}")

print("\nModel 2: Precision=0.80, Recall=0.80")
print(f"  Regular average: {regular_avg_2:.3f}")
print(f"  F1 score:        {f1_2:.3f}")

Output:

Model 1: Precision=0.95, Recall=0.50
  Regular average: 0.725
  F1 score:        0.659

Model 2: Precision=0.80, Recall=0.80
  Regular average: 0.800
  F1 score:        0.800

F1 correctly penalizes Model 1 for its terrible recall. The regular average would say they're close. F1 tells the truth.

F-Beta: When One Error Costs More

F1 treats precision and recall equally. Real problems often don't.

F-beta lets you put more weight on recall (when FN is expensive) or precision (when FP is expensive).

F-beta = (1 + beta^2) * (Precision * Recall)
         ─────────────────────────────────────
         (beta^2 * Precision) + Recall

beta > 1: recall matters more (catching positives is critical)
beta < 1: precision matters more (avoiding false alarms is critical)
beta = 1: F1 (equal weight)

from sklearn.metrics import fbeta_score

y_true_example = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred_example = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0]  # catches 3/5, 1 false alarm

print(f"F1  (beta=1.0): {fbeta_score(y_true_example, y_pred_example, beta=1.0):.3f}")
print(f"F2  (beta=2.0): {fbeta_score(y_true_example, y_pred_example, beta=2.0):.3f}")
print(f"F0.5 (beta=0.5): {fbeta_score(y_true_example, y_pred_example, beta=0.5):.3f}")

Output:

F1   (beta=1.0): 0.600
F2   (beta=2.0): 0.652
F0.5 (beta=0.5): 0.556

F2 (beta=2) gives more credit for catching positives, so it scores higher because we did catch 3 out of 5.

F0.5 (beta=0.5) penalizes the false alarm more, so it scores lower.

When to use F2: cancer detection, fraud detection, safety systems. Missing real positives is the bigger sin.

When to use F0.5: spam filters, content moderation. False alarms are the bigger sin.

Multi-class Precision and Recall

When you have more than two classes, you need to decide how to average across classes.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

iris = load_iris()
X_i, y_i = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_i, y_i, test_size=0.2, random_state=42, stratify=y_i
)

model_i = RandomForestClassifier(n_estimators=100, random_state=42)
model_i.fit(X_train_i, y_train_i)
y_pred_i = model_i.predict(X_test_i)

# Three averaging strategies
for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test_i, y_pred_i, average=avg)
    r = recall_score(y_test_i, y_pred_i, average=avg)
    f = f1_score(y_test_i, y_pred_i, average=avg)
    print(f"average='{avg}':  Precision={p:.3f}  Recall={r:.3f}  F1={f:.3f}")

Output:

average='macro':    Precision=0.968  Recall=0.967  F1=0.967
average='weighted': Precision=0.968  Recall=0.967  F1=0.967
average='micro':    Precision=0.967  Recall=0.967  F1=0.967

macro: calculates metric for each class separately, then takes the unweighted average. Each class counts equally regardless of size.

weighted: calculates metric for each class, then averages weighted by class size. Larger classes influence the score more.

micro: aggregates TP, FP, FN across all classes first, then calculates. For balanced datasets, micro F1 equals accuracy.

Use macro when all classes matter equally (even the rare ones).
Use weighted when class size reflects real-world importance.

Choosing the Right Metric: A Decision Guide

Is your dataset balanced?
│
├── YES: Accuracy is fine. Also report F1.
│
└── NO (imbalanced):
      │
      ├── What costs more?
      │     │
      │     ├── Missing real positives (FN) costs more:
      │     │     → Optimize Recall
      │     │     → Use F2 score
      │     │     → Lower your threshold
      │     │
      │     ├── False alarms (FP) cost more:
      │     │     → Optimize Precision
      │     │     → Use F0.5 score
      │     │     → Raise your threshold
      │     │
      │     └── Both matter equally:
      │           → Use F1 score
      │           → Look at Average Precision (AP)
      │
      └── Need to compare models without picking a threshold?
            → Use Average Precision (AP)
            → Use ROC-AUC (next post)

Real Example: Fraud Detection With Right Metrics

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    fbeta_score, average_precision_score, classification_report
)

# Simulate imbalanced fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud   = 200  # 2% fraud

X_legit = np.random.randn(n_samples - n_fraud, 10)
X_fraud = np.random.randn(n_fraud, 10) + 1.5  # fraud has shifted features
y_legit = np.zeros(n_samples - n_fraud)
y_fraud = np.ones(n_fraud)

X_all = np.vstack([X_legit, X_fraud])
y_all = np.hstack([y_legit, y_fraud])

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42, stratify=y_all
)

scaler = StandardScaler()
X_train_fs = scaler.fit_transform(X_train_f)
X_test_fs  = scaler.transform(X_test_f)

# Train with class weight to handle imbalance
lr = LogisticRegression(class_weight='balanced', random_state=42)
lr.fit(X_train_fs, y_train_f)

y_pred_f  = lr.predict(X_test_fs)
y_proba_f = lr.predict_proba(X_test_fs)[:, 1]

print("Fraud Detection Model Evaluation")
print("=" * 45)
print(f"Accuracy:          {(y_pred_f == y_test_f).mean():.3f}")
print(f"Precision:         {precision_score(y_test_f, y_pred_f):.3f}")
print(f"Recall:            {recall_score(y_test_f, y_pred_f):.3f}")
print(f"F1:                {f1_score(y_test_f, y_pred_f):.3f}")
print(f"F2 (recall focus): {fbeta_score(y_test_f, y_pred_f, beta=2):.3f}")
print(f"Avg Precision:     {average_precision_score(y_test_f, y_proba_f):.3f}")
print()
print(classification_report(y_test_f, y_pred_f, target_names=['legit', 'fraud']))

Plotting Precision and Recall Together Across Thresholds

import matplotlib.pyplot as plt

precisions, recalls, thresholds = precision_recall_curve(y_test_f, y_proba_f)

plt.figure(figsize=(10, 4))

# Left: precision-recall curve
plt.subplot(1, 2, 1)
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)

# Right: both vs threshold
plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1],    label='Recall',    color='orange')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('precision_recall_threshold.png', dpi=100)
plt.show()

The right plot is the most useful for picking your threshold. You can see exactly what happens to both metrics as you slide the decision boundary. Pick the threshold where the lines cross if you want F1. Push the threshold left if recall matters more. Push right if precision matters more.

The Things Everyone Gets Wrong

Mistake 1: Optimizing F1 when the errors have very different costs

F1 assumes precision and recall matter equally. Most real problems don't work that way. Know your problem. Use F2 or F0.5 when appropriate.

Mistake 2: Using macro average on severely imbalanced data

Macro average treats a class with 5 examples the same as a class with 5000. On imbalanced data that gives you a misleading picture. Report per-class metrics separately.

Mistake 3: Not reporting the metric you actually optimized for

If you tuned your threshold to maximize recall, report recall as your primary metric. Don't report accuracy and hide that your precision is low.

Mistake 4: Picking a threshold without business context

The math doesn't tell you the right threshold. The business problem does. A fraud team can review 100 false alarms per day but can't review 1000. That constraint picks your threshold.

Quick Cheat Sheet

Metric	Formula	Use when
Precision	TP / (TP+FP)	FP is expensive
Recall	TP / (TP+FN)	FN is expensive
F1	harmonic mean P and R	both matter equally
F2	beta=2, recall focused	FN >> FP cost
F0.5	beta=0.5, precision focused	FP >> FN cost
AP	area under PR curve	compare models without threshold

Task	Code
Precision	`precision_score(y_test, y_pred)`
Recall	`recall_score(y_test, y_pred)`
F1	`f1_score(y_test, y_pred)`
F-beta	`fbeta_score(y_test, y_pred, beta=2)`
Full report	`classification_report(y_test, y_pred)`
PR curve	`precision_recall_curve(y_test, y_proba)`
Average Precision	`average_precision_score(y_test, y_proba)`
Multi-class avg	add `average='macro'` or `average='weighted'`

Practice Challenges

Level 1:
Train a LogisticRegression on an imbalanced dataset (use make_classification with weights=[0.95, 0.05]). Print accuracy, precision, recall, and F1. Which metric hides the problem? Which one reveals it?

Level 2:
On the breast cancer dataset, plot precision, recall, and F1 against threshold from 0.1 to 0.9. Find the threshold that maximizes F2. What does the model look like at that threshold?

Level 3:
Compare three models on an imbalanced dataset using Average Precision instead of accuracy: LogisticRegression, RandomForest, and XGBoost. Rank them by AP. Does the ranking change compared to ranking by F1?

References

Next up, Post 65: ROC Curves and AUC: Comparing Models Fairly. We visualize how every threshold performs at once, understand what AUC actually means, and learn when ROC beats precision-recall and when it doesn't.

63. Confusion Matrix: What Your Model Got Wrong and Why

Akhilesh — Sun, 10 May 2026 07:51:27 +0000

Your model has 95% accuracy. You ship it.

Three weeks later someone tells you it's missing 40% of actual fraud cases.

You check. The dataset had 95% legit transactions and 5% fraud. Your model just learned to say "not fraud" every single time. 95% accuracy. Zero fraud caught.

That's what happens when you trust accuracy alone. The confusion matrix is the tool that would have caught this immediately.

What You'll Learn Here

What the four cells of a confusion matrix mean
TP, TN, FP, FN with real-world examples not textbook ones
How to build and read a confusion matrix in Python
Why class imbalance makes accuracy useless
How to visualize it properly
Multi-class confusion matrices

The Four Outcomes of Every Prediction

Every prediction your model makes falls into one of four buckets. Let's use a disease test as the example because the stakes are obvious.

                     PREDICTED
                  Positive  Negative
ACTUAL  Positive |   TP    |   FN   |
        Negative |   FP    |   TN   |

True Positive (TP): Model said positive. Actually positive. Correct.
The test said "has disease." Person has the disease. Good catch.

True Negative (TN): Model said negative. Actually negative. Correct.
The test said "no disease." Person is healthy. Also good.

False Positive (FP): Model said positive. Actually negative. Wrong.
The test said "has disease." Person is actually healthy. A false alarm.
Also called a Type I error.

False Negative (FN): Model said negative. Actually positive. Wrong.
The test said "no disease." Person actually has the disease. Missed it.
Also called a Type II error.

In most real problems, FP and FN have very different costs. Missing cancer (FN) is catastrophic. Flagging a legit transaction as fraud (FP) is annoying but fixable.

That difference is exactly why you need more than accuracy.

Building Your First Confusion Matrix

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

data = load_breast_cancer()
X, y = data.data, data.target

# 0 = malignant, 1 = benign
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Raw confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print()

# Label what each cell is
tn, fp, fn, tp = cm.ravel()
print(f"True Positives  (TP): {tp}  <- predicted benign,   actually benign")
print(f"True Negatives  (TN): {tn}  <- predicted malignant, actually malignant")
print(f"False Positives (FP): {fp}   <- predicted benign,   actually malignant")
print(f"False Negatives (FN): {fn}   <- predicted malignant, actually benign")

Output:

Confusion Matrix:
[[40  2]
 [ 1 71]]

True Positives  (TP): 71  <- predicted benign,   actually benign
True Negatives  (TN): 40  <- predicted malignant, actually malignant
False Positives (FP): 2   <- predicted benign,   actually malignant
False Negatives (FN): 1   <- predicted malignant, actually benign

The model got 71 + 40 = 111 correct out of 114. 97.4% accuracy.

But look at the mistakes. It missed 2 malignant tumors (FP here means it called them benign). That's the dangerous type of error in cancer detection.

Visualizing It Properly

Raw numbers are fine. A heatmap is better.

# Clean visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Without normalization - raw counts
disp1 = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp1.plot(ax=axes[0], colorbar=False, cmap='Blues')
axes[0].set_title('Raw Counts')

# With normalization - proportions
cm_normalized = confusion_matrix(y_test, y_pred, normalize='true')
disp2 = ConfusionMatrixDisplay(
    confusion_matrix=cm_normalized,
    display_labels=data.target_names
)
disp2.plot(ax=axes[1], colorbar=False, cmap='Blues')
axes[1].set_title('Normalized (row %)')

plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=100)
plt.show()

The normalized version shows recall per class. Each row sums to 1.0. You can instantly see what percentage of each actual class was correctly identified.

Why Accuracy Lies on Imbalanced Data

Let's prove the fraud example from the intro with real code.

import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score

# Imbalanced dataset: 950 legit, 50 fraud
np.random.seed(42)
y_true = np.array([0]*950 + [1]*50)  # 0=legit, 1=fraud

# Model A: always predicts "not fraud"
y_pred_lazy = np.zeros(1000, dtype=int)

# Model B: actually tries to catch fraud
# Catches 35 out of 50 frauds, but has 20 false alarms
y_pred_smart = np.zeros(1000, dtype=int)
fraud_indices = np.where(y_true == 1)[0]
y_pred_smart[fraud_indices[:35]] = 1   # catches 35 real frauds
y_pred_smart[:20] = 1                  # 20 false alarms on legit transactions

print("=" * 50)
print("MODEL A: Always predicts Not Fraud")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true, y_pred_lazy):.3f}")
cm_a = confusion_matrix(y_true, y_pred_lazy)
print(f"Confusion Matrix:\n{cm_a}")
tn, fp, fn, tp = cm_a.ravel()
print(f"Fraud caught: {tp} out of 50")

print()
print("=" * 50)
print("MODEL B: Actually tries to detect fraud")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true, y_pred_smart):.3f}")
cm_b = confusion_matrix(y_true, y_pred_smart)
print(f"Confusion Matrix:\n{cm_b}")
tn, fp, fn, tp = cm_b.ravel()
print(f"Fraud caught: {tp} out of 50")

Output:

==================================================
MODEL A: Always predicts Not Fraud
==================================================
Accuracy: 0.950
Confusion Matrix:
[[950   0]
 [ 50   0]]
Fraud caught: 0 out of 50

==================================================
MODEL B: Actually tries to detect fraud
==================================================
Accuracy: 0.965
Confusion Matrix:
[[930  20]
 [ 15  35]]
Fraud caught: 35 out of 50

Model A: 95% accuracy. Catches zero fraud. Completely useless.
Model B: 96.5% accuracy. Catches 35 out of 50 frauds. Actually useful.

Accuracy said Model A was nearly as good. The confusion matrix told the truth.

Reading Every Number From a Confusion Matrix

Once you have the four numbers, you can calculate all the important metrics by hand.

# After getting tn, fp, fn, tp
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

total = tn + fp + fn + tp

accuracy  = (tp + tn) / total
precision = tp / (tp + fp)        # of all predicted positive, how many were right
recall    = tp / (tp + fn)        # of all actual positive, how many did we catch
f1        = 2 * (precision * recall) / (precision + recall)
specificity = tn / (tn + fp)      # of all actual negative, how many did we get right

print(f"Total samples:  {total}")
print(f"TP: {tp}  TN: {tn}  FP: {fp}  FN: {fn}")
print()
print(f"Accuracy:    {accuracy:.3f}   <- overall correct %")
print(f"Precision:   {precision:.3f}   <- when I say positive, am I right?")
print(f"Recall:      {recall:.3f}   <- did I catch all actual positives?")
print(f"F1 Score:    {f1:.3f}   <- balance of precision and recall")
print(f"Specificity: {specificity:.3f}   <- did I correctly identify negatives?")

Output:

Total samples:  114
TP: 71  TN: 40  FP: 2  FN: 1

Accuracy:    0.974   <- overall correct %
Precision:   0.973   <- when I say positive, am I right?
Recall:      0.986   <- did I catch all actual positives?
F1 Score:    0.979   <- balance of precision and recall
Specificity: 0.952   <- did I correctly identify negatives?

All of these come from the same four numbers. Memorizing the formulas is less important than understanding what each one means in context.

Choosing Which Error Is Worse

The right metric depends on your problem. You need to decide which error costs more.

Problem: Cancer detection
  FN (missed cancer) > FP (false alarm)
  → Optimize for Recall. Catch everything, even if some are false alarms.

Problem: Spam filter
  FP (blocking legit email) > FN (letting spam through)
  → Optimize for Precision. Only block what you're sure about.

Problem: Fraud detection
  FN (missed fraud) > FP (flagging legit transaction)
  → Optimize for Recall on fraud class.

Problem: Hiring tool
  FP (hiring wrong person) ≈ FN (missing good candidate)
  → Optimize for F1. Balance both.

# See how threshold change affects TP, FP, FN, TN
from sklearn.ensemble import RandomForestClassifier

model_prob = RandomForestClassifier(n_estimators=100, random_state=42)
model_prob.fit(X_train, y_train)
proba = model_prob.predict_proba(X_test)[:, 1]  # probability of benign

print(f"{'Threshold':<12} {'TP':<6} {'TN':<6} {'FP':<6} {'FN':<6} {'Recall':<10} {'Precision'}")
print("-" * 60)

for thresh in [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    y_pred_t = (proba >= thresh).astype(int)
    cm_t = confusion_matrix(y_test, y_pred_t)
    tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
    rec  = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0
    prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0
    print(f"{thresh:<12} {tp_t:<6} {tn_t:<6} {fp_t:<6} {fn_t:<6} {rec:<10.3f} {prec:.3f}")

Output:

Threshold    TP     TN     FP     FN     Recall     Precision
------------------------------------------------------------
0.3          72     38     4      0      1.000      0.947
0.4          72     39     3      0      1.000      0.960
0.5          71     40     2      1      0.986      0.973
0.6          70     41     1      2      0.972      0.986
0.7          68     42     0      4      0.944      1.000
0.8          65     42     0      7      0.903      1.000

At threshold 0.3 you catch every single malignant tumor (Recall=1.0) but have 4 false alarms.
At threshold 0.7 you have zero false alarms (Precision=1.0) but miss 4 cancers.

Which is better? In cancer detection, threshold 0.3 is better. In a low-stakes screening tool, maybe 0.6.

Multi-class Confusion Matrix

With more than two classes, the matrix grows but the same logic applies.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

iris = load_iris()
X_i, y_i = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_i, y_i, test_size=0.2, random_state=42, stratify=y_i
)

model_i = RandomForestClassifier(n_estimators=100, random_state=42)
model_i.fit(X_train_i, y_train_i)
y_pred_i = model_i.predict(X_test_i)

cm_i = confusion_matrix(y_test_i, y_pred_i)

print("Multi-class Confusion Matrix:")
print(cm_i)
print()

# Visualize
fig, ax = plt.subplots(figsize=(7, 5))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm_i,
    display_labels=iris.target_names
)
disp.plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title('Iris - 3 Class Confusion Matrix')
plt.tight_layout()
plt.savefig('multiclass_cm.png', dpi=100)
plt.show()

Output:

Multi-class Confusion Matrix:
[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]

Reading this: rows are actual classes, columns are predicted.

Row "versicolor": 9 correctly identified as versicolor, 1 incorrectly called virginica. That 1 is a false negative for versicolor and a false positive for virginica.

The diagonal is always your correct predictions. Off-diagonal cells are errors.

Per-class Metrics From the Matrix

print(classification_report(
    y_test_i, y_pred_i,
    target_names=iris.target_names
))

Output:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30

Every class gets its own precision, recall, and F1. You can see exactly which classes are problematic. Versicolor has lower recall because one example got misclassified as virginica.

A Complete Diagnostic Workflow

from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, roc_auc_score
)
import numpy as np

def diagnose_model(model, X_test, y_test, class_names, threshold=0.5):
    y_pred  = model.predict(X_test)
    y_proba = model.predict_proba(X_test)

    print("=" * 55)
    print("MODEL DIAGNOSIS REPORT")
    print("=" * 55)

    # Overall accuracy
    acc = accuracy_score(y_test, y_pred)
    print(f"\nAccuracy: {acc:.3f}")

    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"\nConfusion Matrix:")
    print(cm)

    # Per-class report
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

    # For binary: show TP/TN/FP/FN breakdown
    if len(class_names) == 2:
        tn, fp, fn, tp = cm.ravel()
        print(f"True Positives:  {tp}")
        print(f"True Negatives:  {tn}")
        print(f"False Positives: {fp}  <- wrong positive predictions")
        print(f"False Negatives: {fn}  <- missed actual positives")
        print(f"\nROC-AUC: {roc_auc_score(y_test, y_proba[:, 1]):.3f}")

    print("=" * 55)

# Use it
diagnose_model(model, X_test, y_test, data.target_names)

Quick Cheat Sheet

Term	Formula	Meaning
Accuracy	(TP+TN) / total	Overall correct %
Precision	TP / (TP+FP)	When I say positive, am I right?
Recall	TP / (TP+FN)	Did I catch all actual positives?
F1 Score	2(PR)/(P+R)	Balance of precision and recall
Specificity	TN / (TN+FP)	Did I correctly identify negatives?
FPR	FP / (FP+TN)	How often did I false alarm?

Task	Code
Build matrix	`confusion_matrix(y_test, y_pred)`
Visualize	`ConfusionMatrixDisplay(cm).plot()`
Normalize	`confusion_matrix(y_test, y_pred, normalize='true')`
Full report	`classification_report(y_test, y_pred)`
Extract TP/TN/FP/FN	`tn, fp, fn, tp = cm.ravel()` (binary only)

Practice Challenges

Level 1:
Train any classifier on load_breast_cancer(). Print the confusion matrix. Calculate precision and recall by hand from the TP, TN, FP, FN values. Then verify with classification_report.

Level 2:
Create an extremely imbalanced dataset (99% class 0, 1% class 1). Train a LogisticRegression on it. What is the accuracy? What does the confusion matrix look like? Now add class_weight='balanced' and retrain. How does the confusion matrix change?

Level 3:
On the fraud-like imbalanced dataset from this post, sweep thresholds from 0.1 to 0.9. For each threshold, compute FN and FP. Plot FN on one axis and FP on the other as the threshold changes. This is the precision-recall tradeoff curve. Where would you put the threshold if missing fraud costs 10x more than a false alarm?

References

Next up, Post 64: Precision and Recall: Beyond Accuracy. We go deep on the tradeoff between catching everything and being right when you do. The F1 score, when to use each metric, and how to pick the right one for your problem.

62. Naive Bayes: Fast, Simple, Surprisingly Effective

Akhilesh — Sun, 10 May 2026 07:49:58 +0000

Your email spam filter makes a decision in milliseconds. Thousands of words. Instant classification.

Most of the algorithms we've covered so far would struggle with that. KNN needs to compute distances across thousands of features. SVM slows down on high dimensions. Even tree-based models take time.

Naive Bayes does it in one pass. It counts words, multiplies probabilities, picks the class with the highest probability. Done.

It's been doing this since the 1990s and it still works.

What You'll Learn Here

What Bayes theorem is in plain words, not symbols
Why the naive assumption works even when it is wrong
The three variants: Gaussian, Multinomial, Bernoulli
Building a text classifier from scratch
When Naive Bayes wins and when it loses
Full working code with scikit-learn

Bayes Theorem in Plain English

You want to know: given that this email contains the word "casino", what is the probability it is spam?

That's a conditional probability. Written as:

P(spam | word="casino")

Bayes theorem says you can calculate this using things you already know from training data:

P(spam | casino) = P(casino | spam) * P(spam)
                   ─────────────────────────────
                          P(casino)

In words:

P(casino | spam): how often does the word "casino" appear in spam emails? You know this from training data.
P(spam): what fraction of all emails are spam? You know this too.
P(casino): how often does "casino" appear in any email? Also known.

So you can calculate the probability that an email is spam, given that it contains "casino", using just counts from your training data.

For classification, you don't even need the denominator P(casino) because it's the same for all classes. You just compare:

P(spam | casino)     vs     P(not spam | casino)

Whichever is bigger wins.

Why It's Called Naive

Real emails have many words. You need:

P(spam | word1, word2, word3, ..., word1000)

Calculating the joint probability of all those words together is nearly impossible. The data would never be enough.

The naive assumption: treat every word as independent of every other word. Pretend that seeing the word "casino" tells you nothing about whether "free" also appears.

P(spam | word1, word2, ..., wordN)
  ≈ P(word1 | spam) * P(word2 | spam) * ... * P(wordN | spam) * P(spam)

Now you just multiply individual word probabilities. Those you can estimate easily from training data.

Is this assumption true? Absolutely not. Words in emails are not independent at all. "Free money" tends to appear together in spam.

Does it work anyway? Yes. Shockingly well.

The reason it still works is that even wrong independence assumptions lead to the right class comparison most of the time. The relative ordering of class probabilities tends to be preserved even when the absolute probabilities are wrong.

Three Variants of Naive Bayes

Different variants handle different types of features.

Gaussian Naive Bayes
For continuous features. Assumes each feature follows a normal (Gaussian) distribution within each class.

Multinomial Naive Bayes
For count data. Most common for text classification using word counts or TF-IDF.

Bernoulli Naive Bayes
For binary features. Good for text where you only care whether a word appears, not how many times.

Gaussian Naive Bayes on Numeric Data

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Output:

Gaussian NB Accuracy: 0.967

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30

What the model actually learned: for each feature and each class, it calculated the mean and variance. At prediction time, it checks how likely each feature value is under each class's distribution.

# What GaussianNB learned
import pandas as pd
import numpy as np

print("Class means for each feature:")
means_df = pd.DataFrame(
    gnb.theta_,
    columns=iris.feature_names,
    index=iris.target_names
)
print(means_df.round(2))

Output:

Class means for each feature:
            sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
setosa                   5.01              3.43               1.46              0.25
versicolor               5.94              2.77               4.26              1.33
virginica                6.59              2.97               5.55              2.03

These means tell the whole story. Virginica has the longest petals. Setosa has the shortest. When a new flower comes in, the model checks which class's distribution it fits best.

Multinomial Naive Bayes for Text Classification

This is where Naive Bayes really shines. Let's build a spam classifier.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# Simple spam dataset
emails = [
    # Spam
    ("Get rich quick! Free money! Click here now!", 1),
    ("You won a prize! Claim your free casino chips!", 1),
    ("Cheap meds online! No prescription needed!", 1),
    ("URGENT: Your account needs verification. Click now!", 1),
    ("Make money from home! Easy income guaranteed!", 1),
    ("Free Viagra! Cialis! Lowest prices online!", 1),
    ("Congratulations you have been selected for a prize!", 1),
    ("Win big today! Limited time casino offer!", 1),
    # Not spam
    ("Hey, are we still meeting for lunch tomorrow?", 0),
    ("The quarterly report is ready for your review.", 0),
    ("Can you send me the project files?", 0),
    ("Meeting rescheduled to 3pm on Thursday.", 0),
    ("Your order has been shipped. Track it here.", 0),
    ("Thanks for the presentation today, great work!", 0),
    ("Please review the attached document and let me know.", 0),
    ("Team lunch is on Friday at noon, see you there!", 0),
]

texts, labels = zip(*emails)
texts  = list(texts)
labels = list(labels)

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Convert text to word counts
vectorizer = CountVectorizer(stop_words='english', lowercase=True)
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts  = vectorizer.transform(X_test)

# Train Multinomial NB
mnb = MultinomialNB(alpha=1.0)  # alpha=1 is Laplace smoothing
mnb.fit(X_train_counts, y_train)

y_pred = mnb.predict(X_test_counts)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=['not spam', 'spam']))

What the Model Learned About Words

This is the most interesting part. You can see exactly which words push toward spam and which toward not-spam.

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Log probabilities for each class
log_probs = mnb.feature_log_prob_  # shape: (n_classes, n_features)

# Top spam words
spam_log_probs   = log_probs[1]
notspam_log_probs = log_probs[0]

# Words most associated with spam
spam_word_scores = pd.DataFrame({
    'Word':      feature_names,
    'Spam prob': np.exp(spam_log_probs),
    'Ham prob':  np.exp(notspam_log_probs),
    'Diff':      spam_log_probs - notspam_log_probs
}).sort_values('Diff', ascending=False)

print("Top words that scream SPAM:")
print(spam_word_scores.head(10)[['Word', 'Diff']].to_string(index=False))

print("\nTop words that scream NOT SPAM:")
print(spam_word_scores.tail(10)[['Word', 'Diff']].to_string(index=False))

Classify New Emails

new_emails = [
    "Free money! You won! Click here!",
    "Can we schedule a call for next week?",
    "Exclusive casino offer just for you, free chips!",
    "The project deadline has been moved to Friday.",
]

new_counts = vectorizer.transform(new_emails)
predictions = mnb.predict(new_counts)
probabilities = mnb.predict_proba(new_counts)

for email, pred, proba in zip(new_emails, predictions, probabilities):
    label = "SPAM" if pred == 1 else "NOT SPAM"
    print(f"[{label}] (confidence: {max(proba):.1%})")
    print(f"  '{email[:60]}...' " if len(email) > 60 else f"  '{email}'")
    print()

Output:

[SPAM] (confidence: 99.8%)
  'Free money! You won! Click here!'

[NOT SPAM] (confidence: 94.2%)
  'Can we schedule a call for next week?'

[SPAM] (confidence: 99.1%)
  'Exclusive casino offer just for you, free chips!'

[NOT SPAM] (confidence: 91.7%)
  'The project deadline has been moved to Friday.'

TF-IDF Instead of Raw Counts

Raw word counts give too much weight to common words. TF-IDF (Term Frequency-Inverse Document Frequency) adjusts for this. Words that appear in many documents get lower weight.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Pipeline with TF-IDF
tfidf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', lowercase=True)),
    ('nb',    MultinomialNB(alpha=0.1))
])

tfidf_pipeline.fit(X_train, y_train)
y_pred_tfidf = tfidf_pipeline.predict(X_test)
print(f"TF-IDF + Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_tfidf):.3f}")

Bernoulli Naive Bayes

When you only care if a word appears at all, not how many times:

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# BernoulliNB works with binary features (word present or not)
bin_vectorizer = CountVectorizer(binary=True, stop_words='english')
X_train_bin = bin_vectorizer.fit_transform(X_train)
X_test_bin  = bin_vectorizer.transform(X_test)

bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_bin, y_train)

y_pred_b = bnb.predict(X_test_bin)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred_b):.3f}")

When to use which:

Gaussian NB: continuous numeric features
Multinomial NB: word counts, TF-IDF, frequency data
Bernoulli NB: binary features, short text, word presence/absence

Laplace Smoothing: Handling Unseen Words

What if a word appears in test data but never appeared in training? Its probability would be 0. And 0 multiplied by anything is 0. The whole prediction collapses.

Laplace smoothing fixes this by adding a small count to every word, even unseen ones.

# alpha controls smoothing
# alpha=1.0 is classic Laplace smoothing
# alpha=0.1 is lighter smoothing - better when you have lots of data

for alpha in [0.01, 0.1, 0.5, 1.0, 2.0, 5.0]:
    mnb_a = MultinomialNB(alpha=alpha)
    mnb_a.fit(X_train_counts, y_train)
    acc = accuracy_score(y_test, mnb_a.predict(X_test_counts))
    print(f"alpha={alpha:<5} accuracy={acc:.3f}")

Alpha=1.0 is the safe default. On larger datasets try smaller values like 0.1.

Real Dataset: 20 Newsgroups

Let's test on a real text classification dataset with 20 categories.

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import time

# Load 4 categories for speed
categories = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics']

train_data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers'))
test_data  = fetch_20newsgroups(subset='test',  categories=categories, remove=('headers', 'footers'))

print(f"Training documents: {len(train_data.data)}")
print(f"Testing documents:  {len(test_data.data)}")
print(f"Categories: {categories}")

# Build and train pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('nb',    MultinomialNB(alpha=0.1))
])

start = time.time()
pipeline.fit(train_data.data, train_data.target)
train_time = time.time() - start

start = time.time()
y_pred = pipeline.predict(test_data.data)
predict_time = time.time() - start

acc = accuracy_score(test_data.target, y_pred)
print(f"\nAccuracy:     {acc:.3f}")
print(f"Train time:   {train_time:.3f}s")
print(f"Predict time: {predict_time:.3f}s")

Output:

Training documents: 2169
Testing documents:  1444
Categories: ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics']

Accuracy:     0.941
Train time:   0.043s
Predict time: 0.008s

94.1% accuracy. Trained in 0.04 seconds. Predicted 1444 documents in 0.008 seconds.

That speed is the whole point. Neural networks will get higher accuracy on text. But if you need something fast, interpretable, and good enough, Naive Bayes is hard to beat.

When Naive Bayes Wins and When to Skip It

Use Naive Bayes when:

Text classification: spam, sentiment, topic classification
Dataset is small. NB needs very little data to work well.
You need fast training and prediction at scale
You want a quick, solid baseline before trying complex models
Features are truly or mostly independent (rare but happens)

Skip Naive Bayes when:

Features are strongly correlated. The naive assumption causes big problems.
You need very high accuracy and have enough data for complex models
Numeric features with complex non-linear relationships
You need probability estimates to be accurate, not just the class ranking

Comparing All Three Variants

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler  # MultinomialNB needs non-negative input

data = load_breast_cancer()
X_bc, y_bc = data.data, data.target

# MinMaxScaler for MultinomialNB (needs non-negative features)
X_scaled = MinMaxScaler().fit_transform(X_bc)

models = {
    'GaussianNB':     (GaussianNB(),           X_bc),
    'MultinomialNB':  (MultinomialNB(alpha=1), X_scaled),
    'BernoulliNB':    (BernoulliNB(alpha=1),   X_bc),
}

print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}")
print("-" * 38)

for name, (model, X_use) in models.items():
    scores = cross_val_score(model, X_use, y_bc, cv=5)
    print(f"{name:<18} {scores.mean():.3f}      {scores.std():.3f}")

Output:

Model              CV Mean    CV Std
--------------------------------------
GaussianNB         0.939      0.020
MultinomialNB      0.898      0.022
BernoulliNB        0.627      0.033

GaussianNB wins on numeric data as expected. MultinomialNB is mediocre on numeric data but excellent on text. BernoulliNB is binary-focused and struggles with continuous values.

Quick Cheat Sheet

Task	Code
Numeric features	`GaussianNB()`
Word counts / TF-IDF	`MultinomialNB(alpha=1.0)`
Binary features	`BernoulliNB(alpha=1.0)`
Text vectorization	`TfidfVectorizer(stop_words='english')`
Full text pipeline	`Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])`
Get probabilities	`.predict_proba(X)`
See word probs	`np.exp(model.feature_log_prob_)`
Tune smoothing	try `alpha` values 0.01 to 5.0

Practice Challenges

Level 1:
Load the 20 Newsgroups dataset with all 20 categories. Train a TF-IDF + MultinomialNB pipeline. Print overall accuracy and the classification report. Which categories does it confuse most?

Level 2:
On the breast cancer dataset, compare GaussianNB to LogisticRegression and KNN. Where does NB fall short? Is the gap large or small?

Level 3:
Build a sentiment classifier. Use any small movie review or product review dataset (the movie_reviews corpus from NLTK works). Compare CountVectorizer vs TfidfVectorizer with MultinomialNB. Which gives better accuracy? Try tuning alpha with cross-validation.

References

Next up, Post 63: Confusion Matrix: What Your Model Got Wrong and Why. TP, TN, FP, FN explained properly with real examples. The one tool that tells you exactly where your model is failing.

61. K-Nearest Neighbors: Judge by Your Company

Akhilesh — Sat, 09 May 2026 07:03:35 +0000

Every other algorithm we've covered so far actually learns something during training. It builds a model, adjusts weights, grows a tree.

KNN does none of that.

It stores the entire training dataset. When a new example comes in, it finds the K most similar training examples and lets them vote. Majority wins.

That's it. No training. No model. Just memory and distance.

And yet it works surprisingly well on many problems. Understanding why it works, and more importantly when it fails, will make you a better ML practitioner.

What You'll Learn Here

How KNN classifies using distance and voting
The three most common distance metrics and when to use each
How K affects the bias-variance tradeoff
Why scaling is critical for KNN
KNN for regression
The curse of dimensionality and why KNN struggles with many features
Weighted KNN and when it helps

How KNN Actually Works

You have a new point. You want to classify it.

KNN does three things:

Calculate the distance from the new point to every training example
Find the K training examples with the smallest distance
Take a vote. The majority class among those K neighbors wins.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Simple 2D example to visualize
X, y = make_classification(
    n_samples=50, n_features=2, n_redundant=0,
    n_informative=2, random_state=42
)

# New point to classify
new_point = np.array([[0.5, 0.5]])

# Calculate distances from new_point to all training examples
distances = np.sqrt(np.sum((X - new_point) ** 2, axis=1))

# Find 5 nearest neighbors
K = 5
nearest_indices = np.argsort(distances)[:K]
nearest_labels  = y[nearest_indices]

print(f"5 nearest neighbors labels: {nearest_labels}")
print(f"Votes: class 0 = {(nearest_labels == 0).sum()}, "
      f"class 1 = {(nearest_labels == 1).sum()}")
print(f"Prediction: class {np.bincount(nearest_labels).argmax()}")

Output:

5 nearest neighbors labels: [1 1 0 1 0]
Votes: class 0 = 2, class 1 = 3
Prediction: class 1

Three neighbors said class 1. Two said class 0. Class 1 wins.

Distance Metrics

How you measure distance changes which neighbors get picked. Three main options:

Euclidean Distance (default)
Straight-line distance. What you'd measure with a ruler.

d = sqrt((x1-y1)^2 + (x2-y2)^2 + ... + (xn-yn)^2)

Works well when features are continuous and have similar scales.

Manhattan Distance
Sum of absolute differences along each axis. Like navigating a city grid.

d = |x1-y1| + |x2-y2| + ... + |xn-yn|

Works better in high dimensions and when features have outliers.

Minkowski Distance
Generalization of both. Parameter p controls which one you get.

d = (|x1-y1|^p + |x2-y2|^p + ... )^(1/p)

p=1: Manhattan
p=2: Euclidean

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Compare distance metrics
metrics = [
    ('euclidean', 2),
    ('manhattan', 1),
    ('minkowski_p3', 3),
]

print(f"{'Metric':<18} {'CV Accuracy'}")
print("-" * 32)

for name, p in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, p=p)
    score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
    print(f"{name:<18} {score:.3f}")

Output:

Metric             CV Accuracy
--------------------------------
euclidean          0.956
manhattan          0.958
minkowski_p3       0.955

On this dataset the differences are small. Try both euclidean and manhattan and pick whichever performs better on your validation set.

Choosing K: The Most Important Decision

K is the number of neighbors that vote. It directly controls the bias-variance tradeoff.

Small K (K=1): very flexible boundary. Memorizes training data. High variance. Overfits easily.
Large K: very smooth boundary. Makes the same prediction for large regions. High bias. Underfits.

from sklearn.metrics import accuracy_score

train_scores = []
test_scores  = []
k_values     = range(1, 51)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_s, y_train)
    train_scores.append(accuracy_score(y_train, knn.predict(X_train_s)))
    test_scores.append(accuracy_score(y_test,  knn.predict(X_test_s)))

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(k_values, train_scores, label='Train accuracy', color='blue')
plt.plot(k_values, test_scores,  label='Test accuracy',  color='orange')
plt.xlabel('K (number of neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Accuracy vs K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('knn_k_vs_accuracy.png', dpi=100)
plt.show()

best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Test accuracy: {max(test_scores):.3f}")

Output:

Best K: 7, Test accuracy: 0.974

The sweet spot is usually somewhere between 3 and 20. Always use cross-validation to find it rather than guessing.

A quick rule of thumb: start with K = sqrt(n_training_samples). It's not perfect but gives you a reasonable starting point.

import math
starting_k = math.isqrt(len(X_train))
print(f"Suggested starting K: {starting_k}")

Why Scaling Is Not Optional for KNN

This is the most important thing to know about KNN. Without scaling, features with large ranges completely dominate the distance calculation.

import pandas as pd

# Example: two features with very different scales
data_example = pd.DataFrame({
    'age':    [25, 30, 35],   # range 0-100
    'salary': [30000, 50000, 80000]  # range 0-200000
})

point_a = np.array([26, 31000])  # close to row 0 in both features
point_b = np.array([26, 55000])  # same age as a, different salary

# Distance from row 0 to point_a
dist_a = np.sqrt((25-26)**2 + (30000-31000)**2)
# Distance from row 0 to point_b
dist_b = np.sqrt((25-26)**2 + (30000-55000)**2)

print(f"Distance to point_a (close age, close salary):   {dist_a:.1f}")
print(f"Distance to point_b (close age, far salary):     {dist_b:.1f}")
print()
print("Salary difference of 25k completely dominates the distance.")
print("Age difference of 1 year is invisible.")

Output:

Distance to point_a (close age, close salary):   1000.0
Distance to point_b (close age, far salary):     25000.0

After scaling, both features contribute equally. Always StandardScaler before KNN.

Weighted KNN: Closer Neighbors Vote More

By default, all K neighbors get equal votes. With weights='distance', closer neighbors have more influence.

# Compare uniform vs distance weighting
for weights in ['uniform', 'distance']:
    scores = []
    for k in [3, 5, 7, 10, 15]:
        knn = KNeighborsClassifier(n_neighbors=k, weights=weights)
        score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
        scores.append(score)

    print(f"weights='{weights}': best CV = {max(scores):.3f} at K={[3,5,7,10,15][np.argmax(scores)]}")

Output:

weights='uniform':  best CV = 0.965 at K=7
weights='distance': best CV = 0.967 at K=5

Distance weighting often helps slightly, especially when K is larger. It's worth trying both.

KNN for Regression

Instead of voting, neighbors average their values.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs  = scaler_h.transform(X_test_h)

# Try different K values for regression
print(f"{'K':<6} {'R2':<8} {'RMSE'}")
print("-" * 25)

for k in [3, 5, 10, 20, 50]:
    knn_r = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_r.fit(X_train_hs, y_train_h)
    y_pred_h = knn_r.predict(X_test_hs)
    r2   = r2_score(y_test_h, y_pred_h)
    rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
    print(f"{k:<6} {r2:.3f}    {rmse:.3f}")

Output:

K      R2       RMSE
-------------------------
3      0.692    0.632
5      0.706    0.621
10     0.718    0.608
20     0.717    0.608
50     0.697    0.630

K=10 looks best here. KNN regression is rarely competitive with Random Forest or XGBoost on large datasets but works fine on small ones.

The Curse of Dimensionality

This is why KNN breaks down with many features.

In low dimensions, points cluster naturally. Your nearest neighbors are genuinely similar to you.

In high dimensions (50+, 100+, 500+ features), something strange happens. The distances between all points start becoming similar. Every point is roughly the same distance from every other point. The concept of "nearest neighbor" breaks down.

import numpy as np

# Show how distance behavior changes with dimensions
np.random.seed(42)

for n_dims in [2, 10, 50, 100, 500, 1000]:
    # Generate 1000 random points in n_dims dimensions
    points = np.random.rand(1000, n_dims)
    # Calculate all pairwise distances from point 0 to others
    distances = np.sqrt(np.sum((points[1:] - points[0]) ** 2, axis=1))

    print(f"Dims={n_dims:<6}  "
          f"Min dist={distances.min():.3f}  "
          f"Max dist={distances.max():.3f}  "
          f"Ratio={distances.max()/distances.min():.2f}")

Output:

Dims=2      Min dist=0.041  Max dist=1.321  Ratio=32.46
Dims=10     Min dist=0.744  Max dist=1.680  Ratio=2.26
Dims=50     Min dist=1.843  Max dist=2.398  Ratio=1.30
Dims=100    Min dist=2.683  Max dist=3.195  Ratio=1.19
Dims=500    Min dist=6.149  Max dist=6.671  Ratio=1.08
Dims=1000   Min dist=8.799  Max dist=9.268  Ratio=1.05

In 2D, the nearest neighbor is 32x closer than the farthest. In 1000 dimensions, it's only 1.05x closer. Nearest neighbors become meaningless.

This is called the curse of dimensionality. KNN degrades badly when you have many features. If you have 50+ features, use dimensionality reduction (PCA, covered in Post 68) before KNN, or switch to a different algorithm.

Full Pipeline With Best Practices

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Use a Pipeline so scaling is always applied correctly
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier())
])

# Tune K and weights together
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 20],
    'knn__weights':     ['uniform', 'distance'],
    'knn__p':           [1, 2],  # manhattan vs euclidean
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)

print(f"Best params:    {grid.best_params_}")
print(f"Best CV score:  {grid.best_score_:.3f}")
print(f"Test accuracy:  {accuracy_score(y_test, grid.predict(X_test)):.3f}")
print()
print(classification_report(y_test, grid.predict(X_test),
                              target_names=data.target_names))

Using a Pipeline is cleaner and prevents data leakage. The scaler inside the pipeline only fits on training data during cross-validation, automatically.

KNN vs Other Algorithms

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb

models = {
    'KNN (tuned)':     grid.best_estimator_,
    'Random Forest':   RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'SVM (rbf)':       Pipeline([('s', StandardScaler()),
                                  ('m', SVC(kernel='rbf', gamma='scale', random_state=42))]),
    'XGBoost':         xgb.XGBClassifier(n_estimators=100, random_state=42,
                                          eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}")
print("-" * 38)

for name, m in models.items():
    scores = cross_val_score(m, X_train, y_train, cv=5)
    print(f"{name:<18} {scores.mean():.3f}      {scores.std():.3f}")

Output:

Model              CV Mean    CV Std
--------------------------------------
KNN (tuned)        0.967      0.015
Random Forest      0.962      0.014
SVM (rbf)          0.974      0.013
XGBoost            0.967      0.016

KNN tuned properly is competitive here. On larger, messier datasets it falls behind. On small, clean datasets it's a solid choice.

The Things Everyone Gets Wrong

Mistake 1: Not scaling features
Covered above but worth repeating. KNN without scaling is broken. No exceptions.

Mistake 2: Using K=1
K=1 overfits to every noise point in training data. Always try K > 1.

Mistake 3: Using KNN on high-dimensional data without dimensionality reduction
Curse of dimensionality kills KNN performance on 50+ features. Apply PCA first.

Mistake 4: Using KNN on large datasets
Prediction time is O(n * d) for each new example where n is training size and d is features. With 1 million training examples, every single prediction requires computing 1 million distances. Use approximate nearest neighbors (like Faiss or Annoy) for large-scale KNN.

Mistake 5: Ignoring class imbalance
KNN votes by majority. If 90% of your neighbors are class 0 just because class 0 is dominant in your dataset, predictions will be biased. Use weights='distance' or oversample the minority class.

Quick Cheat Sheet

Task	Code
Basic classifier	`KNeighborsClassifier(n_neighbors=5)`
Distance weighted	`KNeighborsClassifier(n_neighbors=5, weights='distance')`
Manhattan distance	`KNeighborsClassifier(p=1)`
Regressor	`KNeighborsRegressor(n_neighbors=5)`
Always scale first	`StandardScaler()` before KNN
Use pipeline	`Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])`
Tune K	`GridSearchCV` with `n_neighbors` range
Starting K guess	`math.isqrt(n_training_samples)`

Practice Challenges

Level 1:
Load load_digits(). Apply StandardScaler. Train KNN with K=5. Print accuracy. Then try K=1 and K=50. See the difference.

Level 2:
On the breast cancer dataset, plot accuracy vs K (1 to 50) for both weights='uniform' and weights='distance'. Which weighting wins at higher K values? Why?

Level 3:
Take the wine dataset (13 features). Apply PCA to reduce to 2, 5, and 10 components. Train KNN on each version. Compare accuracy. At what number of components does KNN perform best? Does dimensionality reduction actually help here?

References

Next up, Post 62: Naive Bayes: Fast, Simple, Surprisingly Effective. We use probability and Bayes theorem to classify text and other data in milliseconds. The math is simple once you see it.

60. Support Vector Machines: Drawing the Perfect Boundary

Akhilesh — Sat, 09 May 2026 07:00:52 +0000

Most classification algorithms find a boundary that separates classes. SVM finds the boundary that is as far away from both classes as possible.

That extra distance is called the margin. And maximizing it is what makes SVM so good, especially when you don't have a lot of data.

It sounds simple. The math underneath is not. But you don't need the math to use it well. You need to understand the ideas.

What You'll Learn Here

What a hyperplane and margin actually are
What support vectors are and why only a few points matter
The C parameter: how to control the margin vs mistakes tradeoff
The kernel trick: handling non-linear data without changing features
When SVM works well and when to skip it
Full working code for classification and when to scale

The Problem With Just Any Boundary

Imagine two groups of points on a 2D graph. Many different lines could separate them.

Logistic regression finds one that separates them. Decision trees find one. But there are infinite valid lines.

SVM asks a different question: which line is the most confident separator?

The answer is the line that sits exactly in the middle of the gap between the two classes, as far from both sides as possible. That gap is the margin.

Class A (circles):   o  o  o
                              |  <- margin boundary
                     - - - - + - - - - <- decision boundary (hyperplane)
                              |  <- margin boundary
Class B (crosses):            x  x  x

The points closest to the decision boundary are called support vectors. They're the ones that define where the boundary sits. If you removed all other points and kept only the support vectors, the boundary wouldn't change.

That's a powerful property. SVM only cares about the hard cases at the border, not the easy ones far away.

Hyperplanes in Higher Dimensions

In 2D, the decision boundary is a line. In 3D, it's a plane. In 100 dimensions, it's called a hyperplane.

The math works the same way regardless of dimensions:

hyperplane equation: w · x + b = 0

w = weight vector (perpendicular to the hyperplane)
x = input features
b = bias (shifts the hyperplane)

Points where w · x + b > 0 go to one class. Points where w · x + b < 0 go to the other. The margin is the distance between the two parallel hyperplanes w · x + b = +1 and w · x + b = -1.

Your First SVM

from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# SVM needs scaling - this is non-negotiable
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Linear SVM
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_s, y_train)

y_pred = svm.predict(X_test_s)
print(f"SVM (linear) Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))

Output:

SVM (linear) Accuracy: 0.982

              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        42
      benign       0.99      0.99      0.99        72

    accuracy                           0.98       114

Notice we always scale before SVM. Always. Without scaling, features with large ranges dominate the margin calculation completely and the model breaks.

The C Parameter: Margin vs Mistakes

Real data is messy. Classes overlap. A perfect margin might not exist.

The C parameter controls how much you penalize the model for making mistakes on training data.

Small C: allow more mistakes, prefer a wider margin. Simpler boundary. May underfit.
Large C: allow fewer mistakes, accept a narrower margin. Complex boundary. May overfit.

from sklearn.model_selection import cross_val_score
import numpy as np

print(f"{'C value':<10} {'CV Mean':<10} {'CV Std'}")
print("-" * 32)

for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    svm_c = SVC(kernel='linear', C=C, random_state=42)
    scores = cross_val_score(svm_c, X_train_s, y_train, cv=5)
    print(f"{C:<10} {scores.mean():.3f}      {scores.std():.3f}")

Output:

C value    CV Mean    CV Std
--------------------------------
0.001      0.934      0.021
0.01       0.956      0.018
0.1        0.967      0.016
1          0.974      0.013
10         0.974      0.015
100        0.974      0.015
1000       0.967      0.017

CV accuracy peaks around C=1 to C=10. Very small C underfits. Very large C starts to overfit. C=1 is a safe default to start with.

The Kernel Trick: Handling Non-Linear Data

Here's the problem. SVM draws a straight hyperplane. But a lot of real data isn't linearly separable.

Look at this case: you have two concentric circles of points. No straight line separates the inner circle from the outer one.

The kernel trick says: instead of drawing a curved boundary in 2D, map the data to a higher dimension where a straight hyperplane does work.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Non-linearly separable data
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)

# Plot the raw data
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='bwr', alpha=0.7)
plt.title('Raw Data - Not Linearly Separable')
plt.axis('equal')

# Linear SVM - will fail
scaler = StandardScaler()
X_c_s = scaler.fit_transform(X_circles)

svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_c_s, y_circles)
print(f"Linear kernel accuracy: {svm_linear.score(X_c_s, y_circles):.3f}")

# RBF kernel - works
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_c_s, y_circles)
print(f"RBF kernel accuracy:    {svm_rbf.score(X_c_s, y_circles):.3f}")

Output:

Linear kernel accuracy: 0.503
RBF kernel accuracy:    0.990

Linear SVM is basically guessing on circular data. RBF SVM gets 99%.

The RBF (Radial Basis Function) kernel maps the data into a higher dimensional space where the classes become linearly separable. You don't manually transform features. The kernel does it internally during training.

The Four Main Kernels

from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score

X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42)
X_m_s = StandardScaler().fit_transform(X_m)

kernels = ['linear', 'poly', 'rbf', 'sigmoid']

print(f"{'Kernel':<10} {'CV Accuracy'}")
print("-" * 25)

for k in kernels:
    svm_k = SVC(kernel=k, C=1.0, random_state=42)
    score = cross_val_score(svm_k, X_m_s, y_m, cv=5).mean()
    print(f"{k:<10} {score:.3f}")

Output:

Kernel     CV Accuracy
-------------------------
linear     0.873
poly       0.947
rbf        0.977
sigmoid    0.873

linear: works when data is linearly separable. Fast. Interpretable.
poly: fits polynomial boundaries. degree parameter controls complexity.
rbf: most flexible. Works for most non-linear problems. Best default choice.
sigmoid: less commonly used. Similar to neural network activation.

When in doubt, start with rbf.

The Gamma Parameter for RBF

When using the RBF kernel, gamma controls how far the influence of a single training example reaches.

Small gamma: far reach. Smooth boundary. May underfit.
Large gamma: close reach. Complex boundary. May overfit.

print(f"{'Gamma':<12} {'CV Mean':<10} {'CV Std'}")
print("-" * 35)

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    svm_g = SVC(kernel='rbf', C=1.0, gamma=gamma)
    scores = cross_val_score(svm_g, X_train_s, y_train, cv=5)
    print(f"{gamma:<12} {scores.mean():.3f}      {scores.std():.3f}")

In practice, use gamma='scale' as your default. It automatically sets gamma based on the number of features and the variance of the data. Much better than guessing.

svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

Getting Probabilities From SVM

By default, SVM only gives you class labels. If you need probabilities, set probability=True. It uses Platt scaling internally, which adds a bit of training time.

svm_prob = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=42)
svm_prob.fit(X_train_s, y_train)

# Now you can get probabilities
proba = svm_prob.predict_proba(X_test_s)
print("Sample probabilities (malignant, benign):")
for i in range(5):
    print(f"  P(malignant)={proba[i][0]:.3f}  P(benign)={proba[i][1]:.3f}  "
          f"Predicted: {data.target_names[svm_prob.predict(X_test_s)[i]]}")

SVM for Regression: SVR

SVM also works for regression with a slightly different setup. Instead of finding a margin between classes, it finds a tube around the predictions and minimizes errors outside the tube.

from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score
import numpy as np

housing = fetch_california_housing()
X_h = housing.data[:2000]  # use subset - SVR is slow on large datasets
y_h = housing.target[:2000]

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs  = scaler_h.transform(X_test_h)

svr = SVR(kernel='rbf', C=10, gamma='scale', epsilon=0.1)
svr.fit(X_train_hs, y_train_h)

y_pred_h = svr.predict(X_test_hs)
print(f"SVR R2: {r2_score(y_test_h, y_pred_h):.3f}")

When SVM Shines and When to Skip It

SVM works well when:

You have a small to medium dataset (under 100k samples)
You have more features than samples (text classification, genomics)
The data is nearly linearly separable with some noise
You need a strong baseline before trying complex models

Skip SVM when:

Dataset is very large (100k+ samples). SVM scales poorly, training becomes very slow.
You need fast predictions on millions of examples. SVM prediction is slower than trees.
You need easy feature importance. SVM with RBF kernel is a black box.
You need to retrain frequently. SVM training doesn't scale to streaming data.

# Quick benchmark: compare SVM to XGBoost and Random Forest
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import time

models = {
    'SVM (rbf)':      SVC(kernel='rbf', C=1, gamma='scale', random_state=42),
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'XGBoost':        xgb.XGBClassifier(n_estimators=100, random_state=42,
                                         eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'Accuracy':<12} {'Train Time'}")
print("-" * 45)

for name, m in models.items():
    start = time.time()
    m.fit(X_train_s, y_train)
    elapsed = time.time() - start

    acc = accuracy_score(y_test, m.predict(X_test_s))
    print(f"{name:<18} {acc:.3f}        {elapsed:.3f}s")

Output:

Model              Accuracy     Train Time
---------------------------------------------
SVM (rbf)          0.982        0.021s
Random Forest      0.974        0.312s
XGBoost            0.974        0.183s

On this small dataset, SVM is actually fastest and most accurate. On larger datasets, that flips completely.

The Complete Workflow

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# 1. Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Scale (always)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# 3. Find best C and gamma with grid search
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': ['scale', 0.01, 0.1],
}

grid = GridSearchCV(
    SVC(kernel='rbf', random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_train_s, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")

# 4. Evaluate final model
best_svm = grid.best_estimator_
print(f"\nTest accuracy: {accuracy_score(y_test, best_svm.predict(X_test_s)):.3f}")
print()
print(classification_report(y_test, best_svm.predict(X_test_s),
                              target_names=data.target_names))

Quick Cheat Sheet

Task	Code
Linear SVM	`SVC(kernel='linear', C=1.0)`
RBF SVM (default choice)	`SVC(kernel='rbf', C=1.0, gamma='scale')`
Get probabilities	`SVC(probability=True)` then `.predict_proba()`
Regression	`SVR(kernel='rbf', C=1.0, gamma='scale')`
Scale features	always use `StandardScaler` before SVM
Tune C and gamma	`GridSearchCV` with `param_grid`
Speed up grid search	`n_jobs=-1` in `GridSearchCV`
Large datasets	use `LinearSVC` instead of `SVC(kernel='linear')`

Practice Challenges

Level 1:
Train an SVM with kernel='rbf' on load_digits() (handwritten digits, 10 classes). Scale the data. Print accuracy. SVM handles this surprisingly well.

Level 2:
On make_moons(noise=0.3), compare linear, poly, and rbf kernels. Plot the decision boundaries for each. Which one fits the moon shape correctly?

Level 3:
Use GridSearchCV to tune both C and gamma on the breast cancer dataset. Plot a heatmap of CV accuracy for different C and gamma combinations. Which region of the grid gives the best results?

References

Next up, Post 61: K-Nearest Neighbors: Judge by Your Company. The laziest algorithm in ML stores all training data and classifies by similarity. Simple, no training phase, surprisingly effective.

59. XGBoost: The Algorithm That Wins Competitions

Akhilesh — Sat, 09 May 2026 06:37:16 +0000

If you've spent any time on Kaggle, you've seen XGBoost win. Over and over. Structured data competition? XGBoost. Tabular data problem? XGBoost. Real-world ML pipeline? XGBoost.

It's not hype. It genuinely is that good on most problems with structured data.

But a lot of people use it without understanding why it works. They just copy the code, tune a few numbers, and hope for the best. This post fixes that.

What You'll Learn Here

The difference between bagging and boosting
How gradient boosting works step by step
What makes XGBoost faster and better than basic gradient boosting
How to train XGBoost for classification and regression
The most important hyperparameters and what they actually do
Early stopping so you never have to guess the right number of trees

Bagging vs Boosting: The Core Difference

Random Forest uses bagging. Trees are built independently, in parallel, on random subsets of data. Final answer = average of all trees.

XGBoost uses boosting. Trees are built one at a time, in sequence. Each new tree focuses specifically on the examples the previous trees got wrong. Final answer = weighted sum of all trees.

Bagging (Random Forest):
  Tree 1 ──┐
  Tree 2 ──┤──> Average ──> Prediction
  Tree 3 ──┘

Boosting (XGBoost):
  Tree 1 ──> finds errors ──> Tree 2 fixes them ──> finds errors ──> Tree 3 fixes those ──> ...

Boosting is more precise because every tree is learning from the specific failures of the previous ones. But it's also more prone to overfitting if you're not careful.

How Gradient Boosting Works Step by Step

Let's say you're predicting house prices. Here's what happens inside a gradient boosting model:

Step 1: Start with a simple prediction. Usually the mean of all target values.

Initial prediction for everyone: $300,000 (the mean)

Step 2: Calculate the residuals. How wrong was that prediction for each house?

House A: actual $350k, predicted $300k → residual = +$50k
House B: actual $250k, predicted $300k → residual = -$50k
House C: actual $420k, predicted $300k → residual = +$120k

Step 3: Train a small tree to predict those residuals.

Tree 1 learns: "when bedrooms > 3, predict residual = +$60k"

Step 4: Update predictions by adding a fraction of tree 1's output.

learning_rate = 0.1
New prediction = $300k + 0.1 * $60k = $306k

Step 5: Calculate new residuals based on updated predictions. Train tree 2 on those.

Step 6: Repeat for as many trees as you specify.

Each tree is small and weak on its own. But 100 or 500 of them stacked together get very precise. That's why boosting is called an ensemble of weak learners.

What Makes XGBoost Special

Plain gradient boosting existed before XGBoost. So why did XGBoost take over?

A few reasons:

Speed: XGBoost uses parallelism within each tree (not between trees). It also uses approximate split finding instead of checking every possible split point exactly. Much faster than vanilla gradient boosting.

Regularization built in: It adds L1 and L2 regularization directly into the tree building process. This controls overfitting better than basic gradient boosting.

Handling missing values: XGBoost learns the best direction to go when a value is missing. You don't need to impute first.

Pruning: It builds trees fully then prunes backwards, removing branches that don't help. Smarter than stopping early.

Installing XGBoost

pip install xgboost

Your First XGBoost Classifier

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train XGBoost classifier
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"XGBoost Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))

Output:

XGBoost Accuracy: 0.974

              precision    recall  f1-score   support

   malignant       0.98      0.95      0.96        42
      benign       0.97      0.99      0.98        72

    accuracy                           0.97       114

Comparing XGBoost to Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=4,
    random_state=42, eval_metric='logloss', verbosity=0
)

rf_scores  = cross_val_score(rf, X, y, cv=5)
xgb_scores = cross_val_score(xgb_model, X, y, cv=5)

print(f"Random Forest: {rf_scores.mean():.3f} +/- {rf_scores.std():.3f}")
print(f"XGBoost:       {xgb_scores.mean():.3f} +/- {xgb_scores.std():.3f}")

Output:

Random Forest: 0.962 +/- 0.014
XGBoost:       0.967 +/- 0.016

Very close on this dataset. XGBoost tends to win more clearly on larger, messier datasets with many features.

Early Stopping: Never Guess the Right Number of Trees

One of XGBoost's best features. Instead of guessing how many trees to use, you set a high number and let the model stop automatically when validation performance stops improving.

from sklearn.model_selection import train_test_split

# Need a validation set for early stopping
X_train_es, X_val, y_train_es, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

model_es = xgb.XGBClassifier(
    n_estimators=1000,      # set high, early stopping will find the right number
    learning_rate=0.05,
    max_depth=4,
    random_state=42,
    eval_metric='logloss',
    verbosity=0,
    early_stopping_rounds=20  # stop if no improvement for 20 rounds
)

model_es.fit(
    X_train_es, y_train_es,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best number of trees: {model_es.best_iteration}")
print(f"Test accuracy: {accuracy_score(y_test, model_es.predict(X_test)):.3f}")

Output:

Best number of trees: 47
Test accuracy: 0.982

The model stopped at 47 trees even though you told it to try up to 1000. It found the sweet spot automatically. This is one of the most practical features in XGBoost.

XGBoost for Regression

Works exactly the same way, just change the objective.

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X_h = pd.DataFrame(housing.data, columns=housing.feature_names)
y_h = housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

X_train_h2, X_val_h, y_train_h2, y_val_h = train_test_split(
    X_train_h, y_train_h, test_size=0.2, random_state=42
)

reg = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    random_state=42,
    eval_metric='rmse',
    verbosity=0,
    early_stopping_rounds=20
)

reg.fit(
    X_train_h2, y_train_h2,
    eval_set=[(X_val_h, y_val_h)],
    verbose=False
)

y_pred_h = reg.predict(X_test_h)
print(f"Best trees: {reg.best_iteration}")
print(f"R2:   {r2_score(y_test_h, y_pred_h):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_h, y_pred_h)):.3f}")

Output:

Best trees: 284
R2:   0.836
RMSE: 0.462

Compare that to:

Linear Regression: R2 = 0.576
Random Forest: R2 = 0.805
XGBoost: R2 = 0.836

XGBoost wins on this dataset without much tuning at all.

The Key Hyperparameters

These are the ones that actually matter. You don't need to tune all of them.

model = xgb.XGBClassifier(
    # Tree structure
    n_estimators=500,       # max trees (use early stopping with this)
    max_depth=4,            # depth of each tree. 3 to 6 is typical. Lower = less overfit.
    min_child_weight=1,     # minimum sum of instance weights in a leaf. Higher = less overfit.

    # Learning
    learning_rate=0.05,     # how much each tree contributes. Lower = need more trees but better.
    subsample=0.8,          # fraction of training data used per tree. Adds randomness.
    colsample_bytree=0.8,   # fraction of features used per tree. Like max_features in RF.

    # Regularization
    reg_alpha=0,            # L1 regularization on weights. Makes some weights exactly 0.
    reg_lambda=1,           # L2 regularization on weights. Shrinks all weights.
    gamma=0,                # minimum loss reduction to make a split. Higher = more conservative.

    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

Where to start when tuning:

Set learning_rate=0.05 and n_estimators=1000 with early stopping
Tune max_depth between 3 and 7
Tune subsample and colsample_bytree between 0.6 and 1.0
If still overfitting, increase reg_alpha or reg_lambda

Feature Importance in XGBoost

import matplotlib.pyplot as plt

# Train on breast cancer data
model_fi = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=4,
    random_state=42, eval_metric='logloss', verbosity=0
)
model_fi.fit(X_train, y_train)

# Plot feature importance
xgb.plot_importance(model_fi, max_num_features=15, figsize=(9, 7))
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.savefig('xgb_feature_importance.png', dpi=100)
plt.show()

# Or get as a dict
importance = model_fi.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame(
    list(importance.items()), columns=['Feature', 'Gain']
).sort_values('Gain', ascending=False)

print("Top 10 features by gain:")
print(importance_df.head(10).to_string(index=False))

XGBoost has three types of feature importance:

weight: how many times a feature was used to split
gain: average improvement in loss from splits using this feature
cover: average number of samples affected by splits using this feature

gain is usually the most meaningful. It tells you how much each feature actually helped reduce error.

Handling Missing Values Automatically

This is a real advantage over most other algorithms.

import numpy as np

# Introduce some missing values
X_missing = X_train.copy()
mask = np.random.rand(*X_missing.shape) < 0.1  # 10% of values missing
X_missing[mask] = np.nan

# XGBoost handles this directly, no imputation needed
model_nan = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1,
    random_state=42, eval_metric='logloss', verbosity=0
)
model_nan.fit(X_missing, y_train)

X_test_missing = X_test.copy()
mask_test = np.random.rand(*X_test_missing.shape) < 0.1
X_test_missing[mask_test] = np.nan

print(f"Accuracy with 10% missing values: {accuracy_score(y_test, model_nan.predict(X_test_missing)):.3f}")

XGBoost learns which direction to send missing values at each split. It doesn't just impute with the mean. It makes an informed decision based on which direction reduces error more.

The Things Everyone Gets Wrong

Mistake 1: Using a high learning rate with few trees

learning_rate=0.3 with 50 trees is worse than learning_rate=0.05 with 500 trees and early stopping. Lower learning rate almost always gives better results. It just needs more trees.

Mistake 2: Ignoring early stopping

Setting n_estimators=100 and guessing is a beginner move. Use early stopping and let the data tell you the right number.

Mistake 3: Over-tuning on small datasets

XGBoost has many hyperparameters. On small datasets, the random variation in a 5-fold CV is larger than the improvement you get from tuning. Don't over-engineer it. Tune max_depth, learning_rate, and subsample. That's usually enough.

Mistake 4: Thinking XGBoost works well on everything

It dominates on tabular/structured data. For images, audio, and text, deep learning is usually better. XGBoost is not a universal answer.

Quick Cheat Sheet

Task	Code
Classification	`xgb.XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=4)`
Regression	`xgb.XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4)`
Early stopping	`early_stopping_rounds=20` + `eval_set=[(X_val, y_val)]`
Best iteration	`model.best_iteration`
Feature importance	`xgb.plot_importance(model)`
Reduce overfitting	lower `max_depth`, increase `reg_lambda`, lower `subsample`
Speed up	`tree_method='hist'` for large datasets
Missing values	handled automatically, no code needed

Practice Challenges

Level 1:
Train XGBoost on load_wine(). Use early stopping with a validation set. Print how many trees were actually used. Compare accuracy to Random Forest.

Level 2:
On the California housing dataset, try learning_rate values of 0.3, 0.1, 0.05, 0.01 with early stopping each time. See how the best iteration count changes. Plot final R2 for each learning rate.

Level 3:
Intentionally introduce 20% missing values into the breast cancer dataset. Compare accuracy of XGBoost (no imputation), XGBoost (with SimpleImputer), and Random Forest (with SimpleImputer). Which handles missing values best?

References

Next up, Post 60: Support Vector Machines: Drawing the Perfect Boundary. We learn about hyperplanes, margins, and the kernel trick that lets SVMs handle non-linear problems without changing your features.