Forem: Julie Fisher

Fitting KNN: From Overfit to Underfit and Everything Between

Julie Fisher — Thu, 30 Oct 2025 16:00:00 +0000

Machine learning models, like clothes, are all about fit — too tight, and they can’t move; too loose, and they lose their shape.

TLDR;

Fit Type	Characteristics	Model Behavior
Overfit	High variance, low bias, jagged boundaries	Memorizes noise
Generalizable	Balanced bias/variance	Follows data boundaries
Underfit	Low variance, high bias, smooth boundaries	Overgeneralizes

Finding the Right Fit

Just like choosing clothes that flatter the shape of the wearer, a well-fit model captures the underlying pattern of data without clinging too tightly to its quirks or hanging too loosely from its actual structure. In this post, we’ll explore what it means for a KNN model to fit “just right” — not too tight, not too loose — and how we can visualize that balance in action.

There are three terms to be familiar with when discussing model fit:

Generalization: a model is able to make accurate predictions on unseen data, i.e. it is able to generalize from the training set to the test set
Overfit: a model is fit too strictly to the training set, including its noise and outliers, making it perform poorly on new data
Underfit: a model is fit too loosely / simply and can't capture the underlying patterns in the training data, so it also performs poorly on new data

A model that is generalizable is the goal of model training. You want a trained model that can correctly predict some outcome.

There are many reasons that a model may not be able to generalize to new, unseen data. The two reasons we'll explore in this post are overfitting and underfitting.

The outcome of overfitting and underfitting is the same: the model isn't able to generalize to unseen data. However, the reasons for this failure are different and the methods to fix it are different.

Taking Model Measurements

Before you tailor anything, you need good measurements. Here, those “measurements” come from our dataset, our preprocessing, and our choice of model parameters. We’ll prepare the data, define our helper functions, and run KNN models across a range of neighbor values to see how the fit changes.

Just like in the last post, we'll train KNN models using a number of neighbors ranging from 2 - 100. Why 100? Because this range covers the full spectrum from overfit to generalizable to underfit.

I picked accuracy as the single performance metric to use. From the Evaluating KNN: From Training Field to Scoreboard post, you'll recognize this as the "impress stakeholders" (i.e., you the reader) metric. I'm (kinda) joking. We'll explore all the metrics in a later post once we've tackled model fit, variance, and bias, but accuracy is a very common metric, so we'll develop the code using accuracy.

We'll use the same helper functions from the last post and add a couple more. We'll be doing a lot of visualizations in this post, so I threw the accuracy plots and fit plots into functions as well.

I updated the fit plots to display a grid of results at selected neighbor values for each random_state. This helps us confirm that observed patterns hold across different data splits rather than being artifacts of one specific split.

# Load in our libraries
# These should always be at the top of your notebook/script
import os
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import kagglehub

# Helper functions
def split_data(df, random_state=52, train_size=0.7, target_col='Purchased'):
    train_data, test_data = train_test_split(df, train_size=train_size, random_state=random_state, stratify=df[target_col])

    trainX = train_data.drop(target_col, axis=1)
    trainY = train_data[target_col]

    testX = test_data.drop(target_col, axis=1)
    testY = test_data[target_col]
    return trainX, trainY, testX, testY

def train_eval(model, trainX, trainY, testX, testY):
    model.fit(trainX, trainY)
    test_preds = model.predict(testX)
    accuracy = accuracy_score(testY, test_preds)
    return accuracy

def plot_accuracy_vs_neighbors(df, nbr_neighbors = range(2, 100), random_states=[52, 9, 130, 404, 20, 119]):
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
    axes = axes.flatten()

    all_accuracies = []

    for idx, random_state in enumerate(random_states):
        trainX, trainY, testX, testY = split_data(df, random_state=random_state)

        scaler = StandardScaler()
        trainX_scaled = scaler.fit_transform(trainX)
        testX_scaled = scaler.transform(testX)


        accuracies = []
        for x in nbr_neighbors:
            model = KNeighborsClassifier(n_neighbors=x)
            metrics = train_eval(model, trainX_scaled, trainY, testX_scaled, testY)
            accuracies.append(metrics)

        all_accuracies.extend(accuracies)

        ax = axes[idx]
        ax.plot(nbr_neighbors, accuracies, marker='o')
        ax.set_title(f'Random state: {random_state}')
        ax.set_xlabel('Number of Neighbors (k)')
        ax.set_ylabel('Accuracy')
        ax.grid(True)

    # Set consistent y-axis limits
    y_min = min(all_accuracies) - 0.01
    y_max = max(all_accuracies) + 0.01
    for ax in axes:
        ax.set_ylim([y_min, y_max])

    plt.tight_layout()
    plt.show()

def plot_fit_at_fixed_neighbors(df, fixed_n_neighbors, random_states = [52, 9, 130, 404, 20, 119]):
    # Create subplots
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
    axes = axes.flatten()

    for idx, random_state in enumerate(random_states):
        # Split and scale the data
        trainX, trainY, testX, testY = split_data(df, random_state=random_state)

        scaler = StandardScaler()
        trainX_scaled = scaler.fit_transform(trainX)
        testX_scaled = scaler.transform(testX)

        # Create mesh grid for decision boundary
        x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
        y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                            np.linspace(y_min, y_max, 100))

        # Scale mesh grid
        mesh_scaled = scaler.transform(np.c_[xx.ravel(), yy.ravel()])

        # Fit KNN model
        model = KNeighborsClassifier(n_neighbors=fixed_n_neighbors)
        model.fit(trainX_scaled, trainY)

        # Predict on mesh grid
        Z = model.predict(mesh_scaled)
        Z = Z.reshape(xx.shape)

        # Plot decision boundary and scatter plot
        ax = axes[idx]
        ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
        sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary',
                        hue='Purchased', palette='bright', ax=ax)
        ax.set_title(f'Random State: {random_state}, Nbr Neighbors {fixed_n_neighbors}')
        ax.set_xlabel('Age')
        ax.set_ylabel('EstimatedSalary')
        ax.grid(True)

    plt.tight_layout()
    plt.show()

The code to load the data and build a model should look familiar now from the previous posts.

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

df = df.drop(columns=["User ID", "Gender"], axis=1)

As a reminder from the last post, here is the set of plots showing model performance for different random_states.

plot_accuracy_vs_neighbors(df)

I see three distinct areas in each of these plots:

Few neighbors: unstable, lower performance: overfit
Moderate neighbors: stable, higher performance: generalizable
Many neighbors: declining performance: underfit

Overfit: The Restrictive Fit

Sometimes a model clings the data so closely that it captures every wrinkle and crease, even the ones that shouldn’t matter. That’s overfitting: when a KNN model memorizes the training data instead of learning its general shape. The result looks impressive on known data but uncomfortable and restrictive when faced with something new.

Let's zoom in on the region with few neighbors to see this in action.

plot_accuracy_vs_neighbors(df, range(2, 15))

Each of the random_state plots is slightly different, but we see increasing performance that stabilizes somewhere between 5 - 12 neighbors.

I'm interested in how fit changes at these low values depending on the train/test split. Specifically, whether a KNN model really can find a generalizable fit with as few as 5 neighbors.

For a baseline that we know is overfit, let's first take a look at the fit at 2 neighbors for each of these random_states.

plot_fit_at_fixed_neighbors(df, 2)

At n_neighbors == 2 the overfitting is obvious for all random_states. The decision boundary is jagged and even has islands of fit for single points that are mixed in with the opposing class. With such a small number of neighbors, each prediction is heavily influenced by just one or two nearby points, leading to decision boundaries that perfectly trace the training data but fail to generalize.

Now let's take a look at 5 neighbors. This is where the plots for random_state values 20, 119, and 9 seemed like they might have started generalizing.

plot_fit_at_fixed_neighbors(df, 5)

These plots show improved ability to generalize to the patters we can see. However, random_state values 9 and 119 still have islands of prediction mixed in and all of the boundaries are still pretty jagged. These models all still look overfit to me.

Generalizable: The Perfect Fit

A perfectly tailored model moves with the data, it's flexible enough to adapt, but structured enough to hold its shape. In this middle zone, the KNN model generalizes well: it captures the key relationships without being distorted by noise. Here, we’ll look at what that balance looks like in both accuracy plots and decision boundaries.

Let's jump to a number of neighbors of 12. By this point, all the accuracy plots show stabilization in performance, indicating that the model has reached a more generalizable state.

plot_fit_at_fixed_neighbors(df, 12)

At this point we can see that the decision boundary has become much more stable and good at differentiating between the different regions for our classes.

For most of our data splits, this stability lasts until around 50 neighbors. Let's zoom in on the 12 - 50 neighbor region of the performance.

plot_accuracy_vs_neighbors(df, range(12, 51))

The random_states of 52 and 20 see a sharp decline in performance around 45 neighbors, while random_states 9 and 130 look like they continue to enjoy stability beyond 50 neighbors.

Let's look at 40 neighbors. This number of neighbors should show good results for all of our random_states and give us a comparison against the beginning of our stable range of 12 neighbors.

plot_fit_at_fixed_neighbors(df, 40)

The decision boundaries here all look pretty clean, and still very similar to the plots from n_neighbors=12. There are no extreme attempts at trying to include or exclude any particular point.

Underfit: The Baggy Fit

At the other far end of the spectrum, an underfit model is like clothing that’s too baggy, it smooths over every detail, losing definition and shape. In KNN terms, this happens when we use too many neighbors. The model becomes overly simple, predicting broad averages instead of meaningful distinctions.

Let's zoom in on our third region, and see what happens during declining accuracy.

plot_accuracy_vs_neighbors(df, range(45, 101))

The decline in accuracy is obvious in all of our random_states. Since we already know what a good fit looks like, let's jump right to the underfit extreme n_neighbors=100.

plot_fit_at_fixed_neighbors(df, 100)

All of our plots for the different random_states show that we've lost the ability to predict 1 along a wide band where our previous decision boundaries had existed.

We’ve lost too much detail in the decision boundary. It becomes overly smooth and shifts toward the upper-right region of the plots, showing that the model is averaging across both classes rather than distinguishing between them.

If we look back at the count of purchased 0 vs 1, we can see that 0 makes up about 2/3rds of our observations/rows. As such, our model will default more and more toward the majority value as it gets more underfit.

We often use the class distribution itself as a baseline, sometimes called a "naive" or "majority class" model. If our trained model performs better than simply predicting the majority class, we’ve successfully improved beyond baseline.

p_counts = df['Purchased'].value_counts()
p_ratios = df['Purchased'].value_counts(normalize=True)

frequency_df = pd.DataFrame({
    'Count': p_counts,
    'Ratio': p_ratios
})
frequency_df

Purchased	Count	Ratio
0	257	0.6425
1	143	0.3575

Tailoring Model Fit

Every good fit, whether in fashion or machine learning modeling, comes from iteration. We measure, test, adjust, and refine until the result balances structure and flexibility. By exploring overfitting and underfitting side by side, we’ve built an intuition for what “fit” really means in KNN and how to choose parameters that let the model move gracefully between precision and generalization.

By visualizing model performance and fit, we were able to see three distinct areas:

Overfit: 2 - 12 neighbors
Generalizable: 12 - 45 neighbors
Underfit: 45+ neighbors

These visualizations help give us an intuitive understanding of fit, from overfit to generalizable to underfit. This gives us a foundation for building an intuitive mental model of what’s happening under the hood.

Based on these visualizations, if I were going to choose a number of neighbors for a production model for this use case, I would pick something in the range of 20 - 40 neighbors.

However, this model was built with only two features. Two features are easy to visualize. As we add more features, which is common in machine learning, it gets harder and harder to visualize how the features interact and how that impacts model performance. In the next post, we'll explore how we can determine model fit based on performance metrics alone.

KNN: The Importance of Being Scaled

Julie Fisher — Fri, 24 Oct 2025 15:30:00 +0000

What is Scaling?

In short, scaling is a data preprocessing step that transforms features so they have similar ranges or distributions. This is especially important when features have vastly different units or magnitudes. For example, in our Social_Network_Ads dataset, the Age column has values ranging from 18 - 60. Whereas the EstimatedSalary column has values ranging from 15,000 - 150,000. That's a pretty substantial difference in magnitude.

Without scaling, models that rely on distance (like KNN) will be biased toward features with larger numeric ranges. According to Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurelien Geron: “With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.”

In my experience, that's not entirely true. Tree-based algorithms generally don't care about scale and those are the algorithms that tend to fit my use case best. The way I think about tree-based models is that they bin features, then split based on the bins. All to say, basically I've stopped bothering to scale features.

When using a distance based algorithm such as KNN, it turns out that ignoring this difference in scale is a mistake. When calculating the distance between points, if one feature has a much larger range than another, it can dominate the distance calculation, skewing the results. We'll look at the different distance metrics in another post of the series. For now, let's prove that scaling matters.

Want to know more about scaling? I covered it more thoroughly in an old blog post Entry 8: Centering and Scaling. Or feel free to turn to your favorite data science/machine learning book. They pretty much all cover scaling, as does section 7.3 Preprocessing Data of the scikit-learn documentation. It's a common data transformation.

Unscaled Features: KNN Baseline

We're going to train ~100 models using neighbor ranges from 2 - 100 and visualize their fit. We'll do this using the untransformed features, then compare it to a scaled version of the features.

The code to load the modules we'll need should look familiar by now. I turned the train/test split code into a function, as well as the code to fit a model and predict on our test values. These helper functions will let us easily run the ~200 different models we'll be training.

# Load in our libraries
# These should always be at the top of your notebook/script
import os
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import kagglehub

def split_data(df, random_state=52, train_size=0.7, target_col='Purchased'):
    train_data, test_data = train_test_split(df, train_size=train_size, random_state=random_state, stratify=df[target_col])

    trainX = train_data.drop(target_col, axis=1)
    trainY = train_data[target_col]

    testX = test_data.drop(target_col, axis=1)
    testY = test_data[target_col]
    return trainX, trainY, testX, testY

def train_eval(model, trainX, trainY, testX, testY):
    model.fit(trainX, trainY)
    test_preds = model.predict(testX)
    accuracy = accuracy_score(testY, test_preds)
    return accuracy

Next we load the data. This code should look familiar from the last two posts.

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

df = df.drop(columns=["User ID", "Gender"], axis=1)

Here comes the fun part. (Yes, I'm aware of how that sounds. Don't worry, I already know that I'm a nerd.)

Now we'll iterate through the neighbor range of 2 -100 and store the accuracies in a list. Just because we can, we'll also run it using 6 different random_state values. Using multiple random_state values helps us understand how sensitive our model is to different train/test splits. If performance varies wildly, it may indicate instability in the model or dataset.

random_states = [52, 9, 130, 404, 20, 119]
nbr_neighbors = range(2, 100)

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 18))
axes = axes.flatten()

all_accuracies = []

for idx, random_state in enumerate(random_states):
    trainX, trainY, testX, testY = split_data(df, random_state=random_state)

    accuracies = []
    for x in nbr_neighbors:
        model = KNeighborsClassifier(n_neighbors=x)
        metrics = train_eval(model, trainX, trainY, testX, testY)
        accuracies.append(metrics)

    all_accuracies.extend(accuracies)

    ax = axes[idx]
    ax.plot(nbr_neighbors, accuracies, marker='o')
    ax.set_title(f'Random state: {random_state}')
    ax.set_xlabel('Number of Neighbors (k)')
    ax.set_ylabel('Accuracy')
    ax.grid(True)

# Set consistent y-axis limits
y_min = min(all_accuracies) - 0.01
y_max = max(all_accuracies) + 0.01
for ax in axes:
    ax.set_ylim([y_min, y_max])

plt.tight_layout()
plt.show()

Across the different random_state values, there is high variability in the first 10 or so neighbors. Then the pattern changes based on the random_state we used. The range of accuracy across the different versions, is pretty consistently between 0.725 - 0.9, mostly declining depending on the number of neighbors used to fit the model. In short, in the current, unscaled version, KNN’s performance is sensitive to both the number of neighbors and the train/test split.

Visualizing Fit: An Unscaled Hot Mess

The real kicker to prove out how scaling impacts our model though turns out to be the visualization of fit.

To create the below plot, I gave CoPilot the code for the scatter plot from the first post of the series (sns.scatterplot(data=df, x="Age", y="EstimatedSalary", hue="Purchased")) and asked it to give me code that would visualize the boundary.

When I first ran this code, I thought it was because CoPilot gave me bad code. Take a look for yourself.

trainX, trainY, testX, testY = split_data(df, random_state=random_state)

# Create mesh grid for decision boundary
x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 22))
axes = axes.flatten()

for idx, nbr_neighbors in enumerate([2, 5, 7, 10, 20, 40, 70, 100]):
    model = KNeighborsClassifier(n_neighbors=nbr_neighbors)
    model.fit(trainX, trainY)

    # Predict on mesh grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and scatter plot
    ax = axes[idx]
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
    sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary', hue='Purchased', palette='bright', ax=ax)
    ax.set_title(f'KNN Decision Boundary for {nbr_neighbors} Neighbors')
    ax.set_xlabel('Age')
    ax.set_ylabel('EstimatedSalary')
    ax.grid(True)

plt.tight_layout()
plt.show()

What the what is that hot mess? The decision boundaries look more like the print out from an InkJet that was running out of ink. All of the boundaries are horizontal and don't adhere to our value very well at all. It doesn't look anything like the nice neat, highly intuitive plot that was in An Introduction to Statistical Learning with Applications in Python.

I took a screenshot of one of the above plots and fed it back into CoPilot with the highly restrained and professional remark of "Why am I not getting a nice decision boundary?" I 100% expected it to give me new code.

Instead I got a paragraph back on the "scale of your features" along with a suggestion to use sklearn.preprocessing.StandardScaler.

Visualizing Fit: Beautiful Scaling

Ah ha.

Turns out our models were weighting the EstimatedSalary feature so much, that the boundaries were almost exclusively determined on that single feature. As CoPilot succinctly stated "the model is overly influenced by the EstimatedSalary feature."

That's easy enough to test with some minor alterations to our early plotting code. We can just transform our features using StandardScaler.

StandardScaler doesn't have a set value range (i.e. it won't always fall between 0 and 1). Technically, it could fall anywhere from negative infinity to positive infinity. However, it uses mean and standard deviation to bring it into a generally useful magnitude. The actual equation is $\frac{x - u}{s}$ where:

z: standard score
x: the observed value
u: the mean of the training samples
s: the standard deviation of the training samples

trainX, trainY, testX, testY = split_data(df, random_state=random_state)

scaler = StandardScaler()
trainX_scaled = scaler.fit_transform(trainX)
testX_scaled = scaler.transform(testX)

# Create mesh grid for decision boundary
x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

xx_scaled, yy_scaled = scaler.transform(np.c_[xx.ravel(), yy.ravel()]).T

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 22))
axes = axes.flatten()

for idx, nbr_neighbors in enumerate([2, 5, 7, 10, 20, 40, 70, 100]):
    model = KNeighborsClassifier(n_neighbors=nbr_neighbors)
    model.fit(trainX_scaled, trainY)

    # Predict on mesh grid
    Z = model.predict(np.c_[xx_scaled.ravel(), yy_scaled.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and scatter plot
    ax = axes[idx]
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
    sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary', hue='Purchased', palette='bright', ax=ax)
    ax.set_title(f'KNN Decision Boundary for {nbr_neighbors} Neighbors')
    ax.set_xlabel('Age')
    ax.set_ylabel('EstimatedSalary')
    ax.grid(True)

plt.tight_layout()
plt.show()

Those plots look much better. We can now clearly see how the model is overfitting at neighbor numbers 2 and 7. We can also see underfitting when we get up into the neighbor number range of 75 and 100.

Scaling and Model Performance

Let's do a quick test to see if this impacts our accuracy too.

random_states = [52, 9, 130, 404, 20, 119]
nbr_neighbors = range(2, 100)

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 18))
axes = axes.flatten()

all_accuracies = []

for idx, random_state in enumerate(random_states):
    trainX, trainY, testX, testY = split_data(df, random_state=random_state)

    scaler = StandardScaler()
    trainX_scaled = scaler.fit_transform(trainX)
    testX_scaled = scaler.transform(testX)

    accuracies = []
    for x in nbr_neighbors:
        model = KNeighborsClassifier(n_neighbors=x)
        metrics = train_eval(model, trainX_scaled, trainY, testX_scaled, testY)
        accuracies.append(metrics)

    all_accuracies.extend(accuracies)

    ax = axes[idx]
    ax.plot(nbr_neighbors, accuracies, marker='o')
    ax.set_title(f'Random state: {random_state}')
    ax.set_xlabel('Number of Neighbors (k)')
    ax.set_ylabel('Accuracy')
    ax.grid(True)

# Set consistent y-axis limits
y_min = min(all_accuracies) - 0.01
y_max = max(all_accuracies) + 0.01
for ax in axes:
    ax.set_ylim([y_min, y_max])

plt.tight_layout()
plt.show()

Our accuracy now ranges from around 0.77 to 0.95. This is better than when the features were unscaled. More importantly, we have a more stable trend in the change of accuracy across the different random_state and n_neighbors parameters:

Fewer than 10 neighbors gives low, or highly variable, results because the model has overfit to the data
Then there is a steady section where we mostly get the same accuracy due to the model fitting well
Finally we see a steady decline as the model underfits

We'll discuss what overfitting and underfitting mean in the next post. For now, the take away is that while tree-based models may shrug off unscaled features, distance-based models like KNN demand careful preprocessing. Scaling isn’t just a formality, it can make or break your model’s performance.

Evaluating KNN: From Training Field to Scoreboard

Julie Fisher — Wed, 22 Oct 2025 15:30:00 +0000

In the last post we looked at the Social_Network_Ads dataset and figured out what features we'll use. In this post, we'll build a KNN model and discuss how to figure out if it's any good.

TLDR;

To train a model and return performance metrics, follow these steps:

1. Load the data

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

2. Prepare the data

df = df.drop(columns=["User ID", "Gender"], axis=1)

3. Split the data into train and test sets

train_data, test_data = train_test_split(df, train_size=.7, random_state=52, stratify=df['Purchased'])

4. Separate features from the target

trainX = train_data.drop('Purchased', axis=1)
trainY = train_data['Purchased']

testX = test_data.drop('Purchased', axis=1)
testY = test_data['Purchased']

5. Train/Fit the model

model = KNeighborsClassifier(n_neighbors=2)
model.fit(trainX, trainY)

6. Predict on the test set

test_preds = model.predict(testX)

7. Evaluate the model's performance

accuracy = accuracy_score(testY, test_preds)
precision = precision_score(testY, test_preds)
recall = recall_score(testY, test_preds)
f1 = f1_score(testY, test_preds)
roc_auc = roc_auc_score(testY, test_preds)

Pre-Train Prep: Loading and Splitting the Data

The first step is to load the data and reduce it to just the columns we'll be using. This code was created and explained in the last post. Review Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection for more information on this process.

# import all the libraries
# imports always go at the top of the file so they're easy to find
import os
import pandas as pd
import kagglehub
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    confusion_matrix)
from IPython.display import display, HTML

# load the data
path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
# drop unneeded columns
df = df.drop(columns=["User ID", "Gender"], axis=1)

Draft Day: Picking the Training and Test Sets

Next we need to split to data into train and test datasets. Here are some terms I'll be using:

Dataset: the full set of observations
Training dataset or training set: a group of observations, usually comprising 70-90% of the dataset
Test dataset or test set: a group of observations, usually comprising 10-30% of the dataset

Why split the data, you ask? Surely we can get better results if we train with all the data, you say?

Using all the data to train a model is like a practice test that never changes the questions. Regardless of your intentions, you'll end up memorizing most of the answers without understanding how to solve the problem.

Models have the same problem. They'll memorize the answers to the training data they're given, but then fall flat when you put them into production and they start seeing observations that weren't in the training data. Don't believe me? Scroll down to the section titled "Practice Scores Don’t Win Championships" to see this phenomenon in action.

To get an accurate approximation for how a model will perform after being put into production, you need to test it on data it's never seen before. This test set is sometimes called a "hold-out" dataset because you keep it in reserve and never let the model see it during training.

The Draft Board: Choosing Players for Training and Testing

Splitting the dataset into training and test datasets is straight forward with scikit-learn's train_test_split function. Just feed it your dataset and tell it the portion of data you want to use for training by specifying the train_size parameter (alternately you can tell it what portion to hold out for testing called test_size; only use one of these parameters, not both).

train_test_split takes care of other housekeeping issues like randomizing the data too. This means that if your data was sorted on some value that the value won't be overrepresented in either the training or test dataset.

You can also include the optional random_state parameter. I recommend using this parameter as it allows you to reproduce the same results over and over. In the context of a tutorial like this, it also allows you to get the same results I did (assuming you use the same dataset).

The last parameter I recommend is stratify. The datasets I use are large and highly imbalanced. This means that the number of true observations are significantly outnumbered by the false observations. Only 2-5% of my data tends to be true. If I'm looking at 1 million observations and only 2% of them are true, then those true values could mostly end up in either the training or the test dataset. The stratify parameter makes sure that doesn't happen. It ensures that the proportion of values are evenly distributed between the two datasets.

If you need to ensure some other feature is evenly distributed between training and test datasets, you can specify this feature instead. I only ever use it to make sure my target values are evenly represented.

train_data, test_data = train_test_split(df, train_size=.7, random_state=52, stratify=df['Purchased'])

Training Day: Fitting the KNN Model

Next we train the model. Pretty much all of scikit-learn's algorithms expect the features and the target to be in different dataframes. So first we'll split them into features (commonly referred to as "X") and target (commonly referred to as "y").

There are various naming schemas out there for what to call the X and y dataframes for the training and test datasets. I lean toward the "Readability counts" precept from the Zen of Python. Code is read much more often that it's written, do your future self a favor and name all of your variables something that will still make sense to you six months from now.

trainX = train_data.drop('Purchased', axis=1)
trainY = train_data['Purchased']

testX = test_data.drop('Purchased', axis=1)
testY = test_data['Purchased']

Now comes the easy part. Training our KNN model using scikit-learn's kNeighborsClassifier.

We're going to start simple with a single KNN model that uses a parameter of two n_neighbors to fit the model. This basically just means that the model uses the two closest neighbors to decide what it should label the observation it's currently predicting on.

There are a bunch of options called distance metrics for how to calculate which neighbors are "closest." We'll be getting into how those metrics effect the results more later in the series.

model = KNeighborsClassifier(n_neighbors=2)
model.fit(trainX, trainY)

There's only so much I can cover in a blog post. If you want to know more about how KNN works or the math behind it, I recommend the following resources:

Introduction to Statistical Learning with Applications in Python: freely available online
The Elements of Statistical Learning: freely available online
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition
- All of the editions are good, but in a fast moving field like machine learning, the newer the better
- Not free, you'll have to buy this one
- But there is an accompanying GitHub repo that's free
K-Nearest Neighbors for Machine Learning: an older blog that I learned a lot from
scikit-learn's 1.6. Nearest Neighbors: for anyone that likes reading documentation

Game Time: Evaluating KNN’s Performance

There are a lot of metrics for measuring both regression models (the output prediction is a number) and classification models (the output prediction is whether the observation belongs to a category). The ones implemented in scikit-learn can be found on the sklearn.metrics page.

In my experience, the most useful metrics for classification are F1, precision, and recall - not necessarily in that order. When I use these metrics on real world data they allow me to pick a "best" model based on my needs:

Recall: targets models that are better at identifying true positives/reducing false negatives
Precision: targets models that reduce false positives
F1: targets models that are the best of both worlds (F1 is the harmonic mean of precision and recall)

Accuracy and ROC AUC are very common metrics, so I'm including them in our analysis.

I'd love to tell you about all of the metrics, but I already did that in my old blog:

Entry 16: Model Evaluation: an overview type post
Entry 21: Scoring Regression Models - Theory: concepts and equations
Entry 22: Scoring Regression Models - Implementation: some additional equations, plus an accompanying notebooks that implements model evaluation
Entry 23: Scoring Classification Models - Theory: concepts and equations
Entry 24: Scoring Classification Models - Implementation: some additional equations, plus an accompanying notebooks that implements model evaluation
Entry 26: Setting thresholds - precision, recall, and ROC: the accompanying notebooks have precision-recall curves and ROC AUC curves if you'd like to see those applied to data

Yes, I dedicated six blog posts to evaluation metrics. Evaluating how well a model performs has two very important uses:

Choose a "best" model for use/deployment
Approximate how well a model will perform once deployed

So it's extremely important to choose the right metric for your use case.

* Pro Tip *: Keep in mind that accuracy can be misleading. In my dataset example where only 2-5% has a positive value for the target, if I guess that everything is false I'll be right 95-98% of the time. This is super hard to beat as far as the score goes. However, it's totally useless as an actual predictive model.

* Advanced Topic Introduction *: ROC AUC is usually the top choice for imbalanced data. In practice I've found that it performs similarly to F1, but I get slightly better results with F1 for my use case. If you have imbalanced data, I'd recommend looking at the results from both and figuring out which works best for your particular situation.

Detailed descriptions of the five metrics we'll be using as well as their equations are below. Most of these metrics are derived from the confusion matrix though, so we'll discuss that first.

The Confusion Matrix: Fundamentals First

Aptly named, the confusion matrix inspires many a glassy-eyed stare whenever I present it. Joking aside, the term was used by JT Townsend in 1971 and spread through the literature and was in popular use by the 1980s. It's purported to be called "confusion" because it shows the errors (or confused) values.

In practice, it's pretty simple, and very helpful.

When making a prediction, there are two states that any single prediction can be: the prediction was correct or the prediction was incorrect.

In my experience, binary classification is the most common type of classification problem. This just means that there are only two different values for the target. For example, in our Social_Network_Ads dataset, the Purchased value can only be 1 or 0. This means that there are only two states that an observation can be: the target value or not the target value.

When we combine these states, we get a 2 x 2 grid:

	Predicted Values
		Negative	Positive
Actual Values	Negative	True Negative (TN)	False Positive (FP)
Actual Values	Positive	False Negative (FN)	True Positive (TP)

The true values are easiest to understand:

True Positive: the prediction was Positive and the observation was Positive, i.e. the model correctly identified a Positive observation
True Negative: the prediction was Negative and the observation was Negative, i.e. the model correctly identified a Negative observation

The false values are a little trickier to keep straight:

False Positive: the prediction was Positive and the observation was Negative
False Negative: the prediction was Negative and the observation was Positive

Let's say we're trying to predict whether a vegetable is a carrot or not a carrot and we have pictures that contain either a carrot or an eggplant. Here are the possibilities for our model's prediction:

True Positive: the model predicts that a carrot is a carrot
True Negative: the model predicts that an eggplant is not a carrot
False Positive: the model predicts that an eggplant is a carrot
False Negative: the model predicts that a carrot is not a carrot

For more details, explanations, and examples check out the Wikipedia Confusion Matrix entry. I also discuss it in a little more detail in Entry 23: Scoring Classification Models - Theory.

Accuracy, Precision, Recall, and More — the Stats that Separate MVPs from Benchwarmers

The evaluation metric impacts what your model is good at, especially once we get into hyperparameter tuning. Chose the right metric for your use case. When in doubt, run multiple metrics and compare them to determine which best fits what you need out of your model.

Here's a quick reference guide to the most common classification metrics. For more information on each of these metrics, just click on the metric name, which links to the specific section of Entry 23: Scoring Classification Models - Theory for that metric.

Metric	Description	Use Case	Equation
Accuracy	How often predictions were correct, regardless of whether that correct prediction was for the positive or negative class	When you want to impress your boss, your boss's boss, or stakeholders	$TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}$
Precision	Of all positive predictions, how often was that prediction correct (rate of correctly identified positive predictions)	When you need to reduce false positives	$TPTP+FP\frac{TP}{TP + FP}$
Recall	Of all positive observations, how often were they correctly identified by the model (true positive rate)	When you need to increase true positives	$TPTP+FN\frac{TP}{TP + FN}$
F1	The harmonic mean of precision and recall.	When you need to find the sweet spot between increasing true positives and reducing false positives	$\times \frac{Precision \times Recall}{Precision + Recall} = \frac{2TP}{2TP + FP + FN}$
ROC AUC	ROC plots TPR (recall) vs. FPR; AUC is the area under the curve (higher is better)	Recommended for imbalanced datasets	—

Practice Scores Don’t Win Championships

Now that we have our trained model and understand the metrics we'll use to evaluate it, let's compare how the model does on the training data and on the test data.

train_preds = model.predict(trainX)
test_preds = model.predict(testX)

train_conf_matrix = confusion_matrix(trainY, train_preds)
train_accuracy = accuracy_score(trainY, train_preds)
train_precision = precision_score(trainY, train_preds)
train_recall = recall_score(trainY, train_preds)
train_f1 = f1_score(trainY, train_preds)
train_roc_auc = roc_auc_score(trainY, train_preds)

test_conf_matrix = confusion_matrix(testY, test_preds)
test_accuracy = accuracy_score(testY, test_preds)
test_precision = precision_score(testY, test_preds)
test_recall = recall_score(testY, test_preds)
test_f1 = f1_score(testY, test_preds)
test_roc_auc = roc_auc_score(testY, test_preds)

For anyone that doubted the importance of evaluating a model on data it's never seen before, the proof is in the pudding. Even for our simple example, the model does substantially better on the training data that it did on the test data.

metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
train_values = [train_accuracy, train_precision, train_recall, train_f1, train_roc_auc]
test_values = [test_accuracy, test_precision, test_recall, test_f1, test_roc_auc]

df = pd.DataFrame({
    'Metric': metrics,
    'Training': train_values,
    'Testing': test_values
})

df

	Metric	Training	Testing
0	Accuracy	0.900000	0.808333
1	Precision	1.000000	0.833333
2	Recall	0.720000	0.581395
3	F1 Score	0.837209	0.684932
4	ROC AUC	0.860000	0.758230

A precision score of 1 should always raise a red flag. It’s extremely rare for a model to perfectly predict all positives without any false positives. This could indicate data leakage—where the model has access to information it shouldn’t—or another issue in your pipeline.

Additionally, the large gap between our training and test metrics suggest that our model may be overfitting. For example, the drop in recall from 0.72 to 0.58 means it's missing more true positives in the test set. In the next post, we’ll discuss identifing overfitting and underfitting and what they mean for our trained model. Later in the series we'll explore how to tune our model to better generalize beyond the training set.

We can also look at the confusion matrices, but they're harder to compare since there were different quantities between the training and test datasets.

Training

Testing

Confusion Matrix

180	0
28	72

72	5
18	25

Recap

In this post, we trained a KNN model and saw how performance metrics like precision, recall, F1, and ROC AUC help us evaluate it. The key takeaway? Models often perform better on training data than on unseen data, and that gap is crucial to understand before deploying. In the next post we'll dive into overfitting and underfitting—what they mean, how to detect them, and why they matter when building robust models.

Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection

Julie Fisher — Mon, 20 Oct 2025 15:30:00 +0000

TLDR;

The Gender feature is noise and the User ID column is a unique identifier that's unnecessary for model training purposes. These columns get dropped, leaving Age and EstimatedSalary as features to predict Purchased.

Abra-data-dabra: Data and EDA

To be perfectly honest, I considered skipping EDA entirely for this series of posts. I want to get to the good stuff and we all know those kinds of topics include overfitting and underfitting, variance, bias, performance metrics, and distance metrics. When it came to the data, I wanted to say "Abracadabra" here's our dataset. That's why curated datasets are so popular, right?

However, at a bare minimum, you really do need at least a basic understanding of any dataset you work with, especially in industry or production settings. Any job you land as a data scientist will have real world, messy data that you'll probably need the help of a domain expert to understand and that you'll be responsible for transforming into neatly ordered rows from the chaos you start with.

So I'll do a very brief exploratory data analysis (EDA) of the data and select the features we'll use. Data wrangling, cleaning, and transformation deserve their own series of posts, but this will get us started.

Smoke, Mirrors, and Mysterious Data Provenance

I was introduced to the Social_Network_Ads dataset in the first week of the first course of the University of Washington's Machine Learning Certificate. I like this dataset for reasons explained below, but having worked with network data, I'm still confused how this data is related to a social network.

The Social_Network_Ads dataset is available from several different users on Kaggle, but the main one seems to be this dataset loaded in 2017 by user "rakeshrau". The Data Card has no explanation of where the data originated, what it's purpose was, who collected/created it, or how it was intended to be used. A Google search only produced more places to find it with no additional provenance information to be had.

While the dataset’s origin is unclear, for the purposes of exploring training a K-NN model, it serves as a useful example.

From the Kaggle page we can determine there are five columns with the following properties:

User ID: a unique identifier for each individual observation/row
Gender: a feature column
Age: a feature column
EstimatedSalary: a feature column
Purchased: the target we're trying to predict

Don't know how I determined those properties? At this point, don't worry about it. Those are the kinds of things I'll cover when discussing data wrangling, cleaning, and transformation. For now, let's keep moving so we can get to the good stuff I talked about earlier.

The Magic of Being Prepared

Before we look at the data, I have some housekeeping recommendations.

1. Keep It In The Code

I programmatically access the Kaggle data for this series using kagglehub.

Why access the data programmatically you ask? Why not just download it manually from Kaggle and drop it into our project's directory? Clicking the "Download" button is easier, you say?

Personally, I like to keep all aspects of a project in one place. If I have to do manual steps, I forget what they are when I come back to rerun the code or to reference some aspect of the project. Then the fantastic, portfolio worthy project that I spend hours and hours on becomes totally unusable because it's broken and I can't reproduce the results.

Don't get me wrong, programmatically loading the data isn't risk free. There's a danger that wherever I've loaded the data from will delete the dataset. However, in the 8 years I've been doing this, there's always somewhere that has the dataset stored online. I just update the data location and rerun my project.

2. Playing Well With Others

I recommend setting up a virtual environment to use for all of the posts in this series. I've created a requirements.txt with all of the packages you'll need and set the versions to the ones I used so they should all play well together.

You can find the requirements file, along with all of the notebooks and code for this series in my repo.

If you don't know how to set up a virtual environment, you can follow the instructions in my post Python Projects With Less Pain: Beginner's Guide to Virtual Environments.

If you don't want to do that, you can simply use this command in a Jupyter Notebook:

!pip install kagglehub

But I don't recommend the adhoc approach.

* Advanced Concept Introduction *

In a production environment I'd never recommend using Jupyter Notebooks. You'll want to take the time to convert your code and logic to a python script or a codebase containing a collection of scripts. The finished product will look much more like a software development project than the notebooks you see in tutorials.

However, in EDA or exploratory and experimental situations like this, I much prefer Jupyter Notebooks so that I can see my results, look at charts, and have the results easy to reference at a later date.

Eventually, I'd like to write a series on creating a production grade machine learning pipeline project. While I have a bunch of production code, it isn't exactly publicly sharable or extensible to public datasets. If this sounds interesting to you, keep your fingers crossed that I can find time to work up to that post series. In the meantime, just be aware that production grade workflows look radically different.

Behold: The Non-Network Social_Network_Ads Dataset

Now that all the housekeeping stuff is out of the way, let's load the data and take a look.

# import necessary libraries
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import kagglehub

# Download data from Kaggle and load into a DataFrame
path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
df.head()

	User ID	Gender	Age	EstimatedSalary
0	15624510	Male	19	19000
1	15810944	Male	35	20000
2	15668575	Female	26	43000
3	15603246	Female	27	57000
4	15804002	Male	19	76000

Feature Overview: Numbers Into Understanding

Earlier I mentioned that I really like this dataset. The reason is because none of the features are strong predictors on their own, but there is a strong relationship when you combine Age and EstimatedSalary.

How do we figure this out? First we visualize each feature on its own. We'll do simple counts (also called a frequency distribution) and color by our target value Purchased.

matplotlib tends to be the default visualization library in Python, but it doesn't have a great way to color by a variable out of the box. The seaborn package does, so that's the package we'll use for our exploratory data analysis.

* Pro Tip *: Packages like seaborn and the plotting capabilities natively available within pandas are handy for quick visualizations. If you get into more complex visualizations though, chances are high you'll end up using matplotlib.

Age

In the Age frequency distribution plot we can see that those under ~35 have Purchased==0. It also looks like there might be a positive relationship between Age and Purchased.

In the machine learning context positive and negative relationships refer to how the variables change in relation to each other:

Positive relationship: as one variable increases in value, so does the other
- Example: As the temperature increases, the number of ice cream sales increases
  - Temperature: 50; Ice Cream Sales: $20
  - Temperature: 85; Ice Cream Sales: $35,000
  - Temperature: 105; Ice Cream Sales: $100,000
Negative relationship: as one variable increases, the other one decreases
- Example: As the temperature decreases, the number of winter clothes sales increases
  - Temperature: 85; Winter Clothes Sales: $50
  - Temperature: 50; Winter Clothes Sales: $150,000
  - Temperature: 0; Winter Clothes Sales: $500,000

sns.histplot(data=df, x='Age', hue='Purchased', bins=20, multiple='stack')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Frequency Distribution')
plt.show()

We can visualize the positive relationship more easily by turning our bars into percentages. When we do this, the bar fills the whole height of the plot, then fills the color as a percentage (based on the y label, it's technically a ratio) of our target variable.

sns.histplot(data=df, x='Age', hue='Purchased', bins=20, multiple='fill', stat="percent")
plt.xlabel('Age')
plt.ylabel('Percentage')
plt.title('Age Percentage Distribution')
plt.show()

While this relationship is interesting and weakly predictive of our target, we clearly can't separate purchased vs not purchased by Age alone. For the majority of our observations we'd simply be guessing.

EstimatedSalary

From the plot we can see that EstimatedSalary is in about the same boat as Age. There is a decline in Purchased counts from around 40,000 - 60,000, but otherwise at best there's a weak relationship between EstimatedSalary and Purchased.

sns.histplot(data=df, x='EstimatedSalary', bins=20, hue='Purchased', multiple='stack')
plt.xlabel('EstimatedSalary')
plt.ylabel('Frequency')
plt.title('EstimatedSalary Frequency Distribution')
plt.show()

Just to confirm our findings, we can look at the distribution formatted as a percentage. This plot gives the same impression as the frequency distribution: there is some kind of weak relationship here. I wouldn't really call it a positive relationship though because of that weird decrease in the middle range.

sns.histplot(data=df, x='EstimatedSalary', bins=20, hue='Purchased', multiple='fill', stat="percent")
plt.xlabel('EstimatedSalary')
plt.ylabel('Percentage')
plt.title('EstimatedSalary Percentage Distribution')
plt.show()

It's tempting to try to come up with an explanation for this dip in the 40,000 - 80,000 salary range. It’s natural to want to explain patterns such as this one, but be cautious. Our brains are wired to find meaning, even when none exists. The explanation you come up with might make sense to you and your colleagues, but have nothing to do with the real world. If you absolutely have to explain something, always validate assumptions with data, not intuition.

Gender

This feature is a great example of noise.

In a machine learning context, "noise" is a feature that isn't correlated to the target value. After COVID and all the video calls we did, a good analogy for "noise" would be that person on the video call that didn't mute their mic and then proceeded to do dishes. All that extra sound made it hard to hear/understand what the presenter was saying.

Features that have no value add in predicting the target value are called "noise". They make it harder to find the patterns that explain our target.

In the Gender histogram plot we can see that our overall counts for both male and female are pretty equally distributed. And the counts of Purchased for each gender is very similar. There is no way to make an informed prediction on likelihood to purchase using this feature, i.e. this feature is noise and can be dropped.

sns.histplot(data=df, x='Gender', hue='Purchased', multiple='stack')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Frequency Distribution')
plt.show()

The Power of Working Together

I've seen combining two features to better explain a target's behavior in several tutorials. I think Andrew Ng covered it in his old Coursera Machine Learning course (which I can't find that class on Coursera anymore, but it was probably replaced by this specialization) and obviously my professor in this class has an example, but I generally don't look for combinations of features that become more predictive when combined.

The datasets I work with a messy real world data that end up turning into 500+ features, so I have mathematical tools that determine if they're individually useful. Now that I've been reminded that combined features can return better results, I might have to do more data analysis and feature engineering of my data.

Causation and Correlation

No data science blog is complete without at least a quick note on causation and correlation. In machine learning, we look for correlation: are two things related to each other. For example, do they have a positive or negative relationship, like we discussed earlier.

However, just because two things are correlated doesn't mean that one of the values causes the other.

In the examples I used to discuss positive and negative relationships, I'd be pretty comfortable saying that temperature does in fact cause changes in ice cream sales and winter clothes sales.

However, there are many, many examples of correlations that have no causation aspect. For example, as ice cream sales increase, so do shark attacks. Did the ice increase in cream sales cause the increase in shark attacks? Or maybe increased shark attacks cause increased ice cream sales? No, it's probably because both increase during summer when the temperatures are higher and people spend more time outside.

For more examples of correlations that have no causal relationship, including plots, I recommend the Spurious Correlations website.

`Age` and `EstimatedSalary`

This is the interesting combination. In the plot we can clearly see that the Purchased values have separated out into clusters or groups. Try drawing a line between the two groups. If you're anything like me, there is a boundary that stands out as the best to get the cleanest groups.

sns.scatterplot(data=df, x="Age", y="EstimatedSalary", hue="Purchased")
plt.title("Age vs Estimated Salary by Purchase Status")
plt.show()

`Age` and `Gender`

This plot just shows that a purchase is more common at older ages, but we already knew that from the Age distribution plots. Gender doesn't add any new information.

sns.scatterplot(data=df, x="Age", y="Gender", hue="Purchased")
plt.title("Age vs Gender by Purchase Status")
plt.show()

`EstimatedSalary` and `Gender`

This plot also only shows us the decline in purchase in that 40,000 - 80,000. Adding the Gender dimension shows us that the decrease is more prominent in males at the lower end of the range and more prominent in females at the higher end of the range. But this still doesn't explain much of the Purchased pattern.

sns.scatterplot(data=df, x="EstimatedSalary", y="Gender", hue="Purchased")
plt.title("Estimated Salary vs Gender by Purchase Status")
plt.show()

`Age`, `EstimatedSalary` and `Gender`

Because of the slight difference between genders in the last EstimatedSalary and Gender plot, let's check that Gender doesn't add any value when we visualize all three features.

# Plot code written by CoPilot
from mpl_toolkits.mplot3d import Axes3D

df["Gender_num"] = df["Gender"].map({"Male": 0, "Female": 1})

# Create a color map for Purchased
colors = df["Purchased"].map({0: "red", 1: "green"})  # or 'No'/'Yes' depending on your data

# Create the 3D scatter plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection="3d")

ax.scatter(
    df["Age"],
    df["EstimatedSalary"],
    df["Gender_num"],
    c=colors,
    s=50,
    alpha=0.8
)

# Label axes
ax.set_xlabel("Age")
ax.set_ylabel("Estimated Salary")
ax.set_zlabel("Gender (0=Male, 1=Female)")
ax.set_title("3D Plot of Age, Estimated Salary, and Gender Colored by Purchase Status")

# Add legend
for label, color in {"Purchased": "green", "Not Purchased": "red"}.items():
    ax.scatter([], [], [], c=color, label=label)
ax.legend()

plt.show()

Recap

Of the five columns we started with our EDA revealed:

We can get rid of User ID since it's a unique identifier and unnecessary for training purposes
We can get rid of Gender because it is neither predictive on its own, nor predictive in conjunction with the other features
Our target values are stored in the Purchased column
The useful features are Age and EstimatedSalary

To prepare our data for use with K-NN our code is:

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
df = df.drop(columns=["User ID", "Gender"], axis=1)

Up Next

Next we'll use the dataset to train a KNN model.

Python Projects With Less Pain: Beginner's Guide to Virtual Environments

Julie Fisher — Thu, 16 Oct 2025 14:59:24 +0000

TLDR;

To create a virtual environment:

Navigate to your project directory's root folder
Create a virtual environment
- python -m venv .venv
Activate it
- Mac: source .venv/bin/activate
- PC: .venv\Scripts\activate
Install packages
- pip install -r requirements.txt
Deactivate when done
- deactivate

Dead in the Water

I don't know about you, but my computer still has Python 3.9.6 installed. And just like every October, an old version of Python is being deprecated. This year just happens to be, you guessed it, Python 3.9. Surprise, surprise, stuff is starting to break.

To Update or Not to Update, is that the Question?

Sure, I could update Python. As tempting as that sounds, it won't solve all my problems.

Have you ever tried running a tutorial only to get an endless string of vague and frustrating error messages? Most of those are caused by:

Dependency conflicts (i.e. a pandas update breaks compatability)
Version conflict (a mismatch between your system and the tutorial)

Anyone remember pandas version 1.3.5? There was a breaking change around that version which wasn't backwards compatible with a critical piece of code I'd written. I'm almost embarassed to admit how long it took me to figure it out. But when you have code that was written more than three years ago, this is pretty common.

That's just part of code updates and improvements, right?

I should periodically review and rewrite my code, you say?

Well, kind of.

The Real World: Cloud Edition

I mainly work in cloud environments and need to be familiar both with seting up and using environments that are consistent across different cloud resources as well as work within the constraints of a system that has been predefined.

Ever heard a software developer say "But it works on my computer"? This is the cloud, except for everyone.

AWS is a great example of this. Sagemaker Studio AI usually has kernals that are pretty up-to-date, but the pre-configured resources for running pipelines tend to be just ahead of (or sometimes just behind) deprecation. As of October 15, 2025 here is a common set of environments:

Stable Python relase: Python 3.14
ml.t3.medium Sagemaker kernal: Python 3.12.9
Preconfigured pipeline resources: Python 3.9

The Price of Easy

Python is popular because it's easy, but also because there is a plethora of libraries that cover just about any use case or situation. The big libraries that most of us rely on (like pandas, numpy, scipy, and scikit-learn to name just a few) get updated regularly. And each update is only compatible with certain versions of Python, usually those available and active at the time the update was released.

What does this mean? It means that if I prototype in my local environment with a freshly updated version of Python (3.14) or even in a Sagemaker instance (Python 3.12), that same code is highly unlikely to run in my pipeline (Python 3.9).

Environments, Virtual Environments

Cloud environments are like Bond, everyone wants to do what they can do. And when it comes to data science, they're the defacto solution unless you're running tiny datasets and predicting manually and/or offline.

So just like Bond needs a license to kill, you need to use virtual environments.

Virtual environments make matching the pace of updates vs resources a breeze and gives you these benefits:

Isolate Python and package versions per project/code base
Make your code portable across systems and cloud platforms
Avoid dependency purgatory

May The `venv` Be With You

It's easy to get started. I learned a handly workflow in the University of Washington's Certificate in Python Programming.

On the command line, navigate to the directory of the project you're working on. If we use the folder structure of my old blog's repo datascience_diaries it'd look something like this: cd datascience_diaries/devto_posts/00_resources
Create a virtual environment: python -m venv .venv
- In your environment the first command might be python or might be python3 depending on how you have python set up on your system
- -m indicates that we want to run a python module/library/package
- venv is the module we want to run. This is what does all the heavy lifting for us and creates the environment
- .venv is what we want to call the virtual environment. You can call it anything, even something like platapus or tree_beard. The standard is to call it .venv
  - I've started naming my .venv file something related to my project so it's easy to differentiate from other environments. For ease of reference, I'll call this file .venv throughout the rest of the post
  - To get VSCode to recognize your environment, you'll need to open it at the root level of your project where the virtual environment is stored
Activate the environment
- On Mac: source .venv/bin/activate
- On PC: .venv\Scripts\activate

That's it. In your command line at the front of the line you should see something like (.venv), which let's you know that it's active. To deactivate it just type deactivate.

If you type pip list, as of October 15, 2025, you should only see two entries: pip and setuptools.

Mass `pip`duction

I recommend using a requirements.txt file to specify any/all packages to want to use in the environment so that it's easy to set up, break down, and change.

To populate your requirements.txt file, just type the packages you need and the version you're using. Freezing the package version is better for recreating the same environment over long periods of time and across different systems. Here's what it looks like:

matplotlib==3.9.4
numpy==2.0.2
pandas==2.3.3
scikit-learn==1.6.1
seaborn==0.13.2

If you want the latest and greatest version, just leave the version empty. This is better for recreating a set of packages you use regularly, but want to keep updated with the latest versions. Here's what it looks like:

matplotlib
numpy
pandas
scikit-learn
seaborn

* Pro Tip *: They're easier to find and update if you put them in alphabetically order.

* Pro Tip *: If you create a ton of virtual environments and worry about running out of space on your computer, you can simply delete the .venv folder when you're done using it. If you ever need that environment again, you can simply recreate it using the requirements.txt.

There are more comprehensive ways to set it up that involve commands like pip list, but to start, just include the libraries you need. When you need the more complicated version, google it.

To load the libraries to your active environment (you remembered to activate the virtual environment, right?) at the command line type:

pip install -r requirements.txt

The first command will probably be pip however it might be pip3 depending on the environment setup
-r tells pip install that we'll be supplying a file, not a package name
requirements.txt is the name of your requirements file. You can name this anything you want, but to make things easier for your future self and anyone else that reads your code, I suggest you stick with requirements.txt

That's all there is to it. You now have an active virtual environment.

Just don't forget to deactivate it when you're done with it.

Next Steps

One of the major limitations with venv in my opinion is that you can't easily specify a Python version: venv uses the Python version in the runtime environment in which it was created.

You can control the Python version using venv, it's just a little more involved. You need to have the version of Python you want loaded to your system's runtime environment. If you have multiple versions of Python installed, a common way to reference the version you want is to set it up so that you call it using the version number.

For example, in our python -m venv .venv example, you could have multiple versions of Python 3, such as 3.10, 3.11, and 3.14. To run the same call with the version you want, you'd update that command to python3.14 -m venv .venv.

Personally, I'd rather be able to just specify the version in the command itself and let the package I'm using handle the details. You can do this using Conda with the command conda create -n .venv python=3.14.

I haven't developed a workflow that I like with Conda yet. When I do, I'll look into creating an "intermediate's guide" version of this post.

In Case You Forgot

Navigate to your project directory's root folder
Create a virtual environment
- python -m venv .venv
Activate it
- Mac: source .venv/bin/activate
- PC: .venv\Scripts\activate
Install packages
- pip install -r requirements.txt
Deactivate when done
- deactivate

Cleaning Data: wrangling data for a SageMaker pipeline

Julie Fisher — Mon, 28 Nov 2022 18:20:56 +0000

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are generally true:

First, the information to understand and/or solve the proposed business problem is usually stored in multiple locations or files. Whether it's multiple excel files, different tables within the same database, or completely different systems, having to pull information from multiple sources and then combine them is common.

Second, the data is usually in its raw form. This means the data will need to be featurized. Free text needs to be turned into some kind of numeric representation, categories either need to be binned or encoded, numeric features may need to be standardized, lists of values turned into single number representations, or other custom transformations may need to be applied.

I chose the Insurance Company Benchmark (COIL 2000) dataset because it let's me demonstrate some of these real world problems:

It's stored in multiple files
It has both numeric and categorical features
It's a binary classification problem

Objectives

Like a production level problem, this dataset needs to be combined and put into a usable form. Here are the data wrangling steps I'll complete in this post:

Combine train, test, and label datasets into a single full dataset
- Production datasets don't usually come pre-split
- Figuring out when, where, and how to split the data is an important production consideration
Update column names to something human readable
- This makes my life easier when completing my exploratory data analysis to understand what the data is
- Blind data processing is generally a bad idea
Return categorical features to their textual representation
- Allows me to demonstrate both numeric and categorical handling as I move through creating a SageMaker Pipeline

Prerequisites

All prerequisites from the previous post, 1_read_from_s3.ipynb, still apply (SageMaker environment and applicable IAM roles are set up). The data that was loaded into S3 in the previous post will also be needed. Below is the code to pull the data from the UCI page and load it to S3 incase you deleted the files from the last post.

# Only run this if you deleted the output from the previous post or didn't run it at all
# If you run it accidentally, oh well, it'll just overwrite the previous files
import pandas as pd
import sagemaker

session = sagemaker.session.Session()
bucket = session.default_bucket()
prefix = '1_ins_dataset/raw'

train_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticdata2000.txt'
test_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticeval2000.txt'
gt_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/tictgts2000.txt'
cols_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/dictionary.txt'

train = pd.read_table(train_uri, header=None)
test = pd.read_table(test_uri, header=None)
ground_truth = pd.read_table(gt_uri, header=None)
columns = pd.read_table(cols_uri, encoding='latin-1')

train.to_csv(f's3://{bucket}/{prefix}/train.csv', index=False)
test.to_csv(f's3://{bucket}/{prefix}/test.csv', index=False)
ground_truth.to_csv(f's3://{bucket}/{prefix}/gt.csv', index=False)
columns.to_csv(f's3://{bucket}/{prefix}/metadata/col_info.csv', index=False)

Set AWS variables

This step needs to be completed pretty much any time I run anything in a SageMaker notebook. These variables are what allow me to talk to AWS resources and establish I have premission to use said resources.

import sagemaker
import boto3
import pandas as pd

session = sagemaker.session.Session()
bucket = session.default_bucket()
prefix = '1_ins_dataset/raw'

Read in Data

First I read in all the data from S3. After trying multiple options in the last post, I decided I prefer using boto3.resource. For context on how this function was created, including information on why I used the objects.filter and split functions as well as why I put the data into a dictionary, please read the pervious post.

def read_mult_txt(bucket, prefix):
    s3_resource = boto3.resource("s3")
    s3_bucket = s3_resource.Bucket(bucket)

    files = {}
    for object_summary in s3_bucket.objects.filter(Prefix=prefix):
        if (len(object_summary.key.rsplit('.')) == 2) & (len(object_summary.key.split('/')) <= 3):
            files[object_summary.key.split('/')[-1].split('.')[0]] = f"s3://{bucket}/{object_summary.key}"

    df_dict = {}
    for df_name in files.keys():
        df_dict[df_name] = pd.read_csv(files[df_name])

    return df_dict

df_dict = read_mult_txt(bucket, prefix)

Combine data

As discussed earlier, when working on a real world dataset the data isn't generally split out into test and train for you. I want to account for this real world fact in my SageMaker Pipeline, so I need to add the test data labels to the test data, then join that with the training data. This will allow me to address when and how to split my train/test/validate data for myself.

Note, the original split only divided the data into train and test datasets. I'll be splitting the data into train, test, and validate. This will allow me to reserve data to evaluate my final model (validate data) after training (training data) and optimizing hyperparameters (test data). Splitting the data myself allows me to make these kinds of decisions.

In order to join all of the data into a single dataset, I need to know what columns are where. The training dataset will be my template because it already holds all of the data. Standard practice is to include the target/label as either the first column or the last. I can determine what columns are what by bringing in the data dictionary and associating the column names to the data.

Data Dictionary

Bringing in the column names from the data dictionary isn't as easy as it sounds with this particular dataset. The column names are in a text file that isn't conducive to a dataframe. By reading it in as a table, I can get something usable. However, upon pulling it in I found that there is other information in addition to the column names.

pd.set_option('display.max_rows', 50)

col_info_uri = f"s3://{bucket}/{prefix}/metadata/col_info.csv"
data_info = pd.read_table(col_info_uri)
data_info

	DATA DICTIONARY
0	Nr Name Description Domain
1	1 MOSTYPE Customer Subtype see L0
2	2 MAANTHUI Number of houses 1 – 10
3	3 MGEMOMV Avg size household 1 – 6
4	4 MGEMLEEF Avg age see L1
...	...
165	5 f 500 – 999
166	6 f 1000 – 4999
167	7 f 5000 – 9999
168	8 f 10.000 - 19.999
169	9 f 20.000 - ?

170 rows × 1 columns

For brevity, I only returned the top and bottom five rows of data. But when examining the full set the data dictionary clearly contains 170 rows which is comprised of five different sets of data. The information includes headers for several of the datasets and a dataset name to separate the different sections of information.

To make it more usable, I'm going to clean this up by separating the datasets, splitting out columns, and giving them their appropriate headers. I manually reviewed the data for the row indexing to separate out the different datasets.

Next I split the rows into the appropriate number of columns. Fortunately, most of this can be done by splitting on the space character and limiting the number of splits. If the column names had been more than one word long, the solution would have been more complicated. However, a simple string split gets me what I need and the column names are isolated in data_dict['feat_info']['Name'].

data_dict = {}
data_dict['feat_info'] = data_info.iloc[1:87, 0].str.split(n=2, expand=True)
data_dict['feat_info'].columns = data_info.iloc[0, 0].split(maxsplit=2)

data_dict['L0'] = data_info.iloc[89:130, 0].str.split(n=1, expand=True)
data_dict['L0'].columns = data_info.iloc[88, 0].split()

data_dict['L1'] = data_info.iloc[131:137, 0].str.split(n=1, expand=True)
data_dict['L1'].columns = ['Value', 'Bin']

data_dict['L2'] = data_info.iloc[138:148, 0].str.split(n=1, expand=True)
data_dict['L2'].columns = ['Value', 'Bin']

data_dict['L3'] = data_info.iloc[149:159, 0].str.split(n=1, expand=True)
data_dict['L3'].columns = ['Value', 'Bin']

data_dict['L4'] = data_info.iloc[160:, 0].str.split(n=1, expand=True)
data_dict['L4'].columns = ['Value', 'Bin']

for key in data_dict.keys():
    print(key)
    display(data_dict[key].head())

feat_info

	Nr	Name	Description Domain
1	1	MOSTYPE	Customer Subtype see L0
2	2	MAANTHUI	Number of houses 1 – 10
3	3	MGEMOMV	Avg size household 1 – 6
4	4	MGEMLEEF	Avg age see L1
5	5	MOSHOOFD	Customer main type see L2

L0

	Value	Label
89	1	High Income, expensive child
90	2	Very Important Provincials
91	3	High status seniors
92	4	Affluent senior apartments
93	5	Mixed seniors

L1

	Value	Bin
131	1	20-30 years
132	2	30-40 years
133	3	40-50 years
134	4	50-60 years
135	5	60-70 years

L2

	Value	Bin
138	1	Successful hedonists
139	2	Driven Growers
140	3	Average Family
141	4	Career Loners
142	5	Living well

L3

	Value	Bin
149	0	0%
150	1	1 - 10%
151	2	11 - 23%
152	3	24 - 36%
153	4	37 - 49%

L4

	Value	Bin
160	0	f 0
161	1	f 1 – 49
162	2	f 50 – 99
163	3	f 100 – 199
164	4	f 200 – 499

Label column

Now I can determine which column holds the target label. This is the column I'll be trying to predict on for this machine learning problem. A peek at the first and last few rows gives me the column name and description (reminder, standard practice is to put the label as the first or last column).

display(data_dict['feat_info'].head(3))
display(data_dict['feat_info'].tail(3))

	Nr	Name	Description Domain
1	1	MOSTYPE	Customer Subtype see L0
2	2	MAANTHUI	Number of houses 1 – 10
3	3	MGEMOMV	Avg size household 1 – 6

	Nr	Name	Description Domain
84	84	AINBOED	Number of property insurance policies
85	85	ABYSTAND	Number of social security insurance policies
86	86	CARAVAN	Number of mobile home policies 0 - 1

* Quick tip *

Did you forget what the dataframes within the dictionary are called? I forget what I call things all the time. You could spend all day scrolling around your notebook looking them back up, or you could just call the .keys() method. It's much easier to delete a Jupyter cell than constantly scroll around the notebook trying to find where you named something, then trying to return to where you're currently working.

If you've forgotten the name of a variable but can recall how it starts, you can type the first few letters then hit "Tab". Jupyter will auto-complete the variable name. If there is more than one variable that starts with those letters, it will give you a list to choose from.

# df_dict.keys()

dict_keys(['gt', 'test', 'train'])

A review of the UCI Repo page for the target variable shows the following:

The training set contains over 5000 descriptions of customers, including the information of whether or not they have a caravan insurance policy. A test set contains 4000 customers of whom only the organisers know if they have a caravan insurance policy.

Combining this information with the column names and descriptions, I now know that the target variable is the last column CARAVAN, which according to its description is an indicator of whether or not the customer has policies with the insurance company.

Column headers

The column headers of all the datasets are just the column index. This wouldn't be a problem except that when I concatenated the target label onto the test data, it brought with it the column name 0, which is already taken. I now have two columns labelled 0.

test_df = pd.concat([df_dict['test'], df_dict['gt']], axis=1)
test_df.head()

	0	1	2	3	4	5	6	7	8	9	...	79	0
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0
3	9	1	2	3	3	2	3	2	4	5	...	1	0
4	31	1	2	4	7	0	2	0	7	9	...	1	0

5 rows × 86 columns

In order to concatenate the new test dataset with the training dataset, I may as well rename the columns with their appropriate column names now. It will make concatenating the test data onto the training data much easier. This is easily done by turning the Name column from the feat_info dataset in the data_dict dictionary into a list, then assigning that to my dataframe columns.

data_dict['feat_info']['Name'].to_list()[:5]

['MOSTYPE', 'MAANTHUI', 'MGEMOMV', 'MGEMLEEF', 'MOSHOOFD']

df_dict['train'].columns = data_dict['feat_info']['Name'].to_list()
test_df.columns = data_dict['feat_info']['Name'].to_list()

Now the test and train dataframes have the appropriate column names assigned as headers.

test_df.head(3)

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	ABRAND	CARAVAN
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0

3 rows × 86 columns

Since the number of columns and the column headers match, it is a matter of a single line to concatenate these datasets into a single dataframe.

df = pd.concat([df_dict['train'], test_df], ignore_index=True)
df

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	APERSONG	AGEZONG	AWAOREG	ABRAND	AZEILPL	APLEZIER	AFIETS	AINBOED	ABYSTAND	CARAVAN
0	33	1	3	2	8	0	5	1	3	7	...	0	0	0	1	0	0	0	0	0	0
1	37	1	2	2	8	1	4	1	4	6	...	0	0	0	1	0	0	0	0	0	0
2	37	1	2	2	8	0	4	2	4	3	...	0	0	0	1	0	0	0	0	0	0
3	9	1	3	3	3	2	3	2	4	5	...	0	0	0	1	0	0	0	0	0	0
4	40	1	4	2	10	1	4	1	4	7	...	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9817	33	1	2	4	8	0	7	2	0	5	...	0	0	0	1	0	0	0	0	0	0
9818	24	1	2	3	5	1	5	1	3	4	...	0	0	0	1	0	0	0	0	0	1
9819	36	1	2	3	8	1	5	1	3	7	...	0	0	0	1	0	0	0	1	0	0
9820	33	1	3	3	8	1	4	2	3	7	...	0	0	0	0	0	0	0	0	0	0
9821	8	1	2	3	2	4	3	0	3	5	...	0	0	0	1	0	0	0	0	0	0

9822 rows × 86 columns

Once the data has all been joined into a single dataframe, I want to run some basic "sniff" tests on it. This allows me to be confident that there are no errors in the code that introduce incorrect data into my dataset. The only thing I'm really interested in evaluating for this merge is the number of rows and columns.

I know that the number of columns should be the same as the number of rows in the Name column. At 86 columns, this is correct.

df.shape

To get the correct number of rows I first reviewed the UCI Repo page. It says there are "5000 descriptions of customers" in the training dataset and "4000 customers" in the test dataset. Which means there should be 9,000 rows in the combined dataset. This number doesn't match the number of rows listed in my dataframe, which is 9,822, almost 1,000 more rows than expected.

However, in looking at the shape of my original train and test datasets, I can see that the numbers on the UCI page for the training data was an estimate instead of an exact figure.

* Quick Tip *

It never hurts to double check. However, it sure can hurt if you don't.

print('Training data has', df_dict['train'].shape[0], 'rows')
print('Test data has', df_dict['test'].shape[0], 'rows')

Training data has 5822 rows
Test data has 4000 rows
Test data has 4000 rows

Column names

The provided column names are very cryptic. The variables that start with 'M' in this dataset are a perfect example. Per the Dataset Information:

Note: All the variables starting with M are zipcode variables. They give information on the distribution of that variable, e.g. Rented house, in the zipcode area of the customer.

I.E. These variables are aggregated data enrichments and don't necessarily directly reflect the customer in question. This type of information could easily lead to bias within a model. While this dataset is for direct marketing purposes, and thus sterotyping isn't necessarily a big deal, if this information were used to assess the insurability of a customer or cost of their policy, there could be direct consequences that unfairly impact customers.

Based on these types of considerations, I prefer column names that let me quickly understand what the data in a column represents.

I manually copied and pasted the column descriptions and altered them to my liking. With only 86 columns, this is a feasible solution. If I had more columns, I might not be so quick to resort to manual data entry. For convenience, I've included my column names below so the copy and paste solution doesn't have to be replicated by others.

col_names = ['zip_agg_customer_subtype',
             'zip_agg_number_of_houses',
             'zip_agg_avg_size_household',
             'zip_agg_avg_age',
             'zip_agg_customer_main_type',
             'zip_agg_roman_catholic',
             'zip_agg_protestant',
             'zip_agg_other_religion',
             'zip_agg_no_religion',
             'zip_agg_married',
             'zip_agg_living_together',
             'zip_agg_other_relation',
             'zip_agg_singles',
             'zip_agg_household_without_children',
             'zip_agg_household_with_children',
             'zip_agg_high_level_education',
             'zip_agg_medium_level_education',
             'zip_agg_lower_level_education',
             'zip_agg_high_status',
             'zip_agg_entrepreneur',
             'zip_agg_farmer',
             'zip_agg_middle_management',
             'zip_agg_skilled_labourers',
             'zip_agg_unskilled_labourers',
             'zip_agg_social_class_a',
             'zip_agg_social_class_b1',
             'zip_agg_social_class_b2',
             'zip_agg_social_class_c',
             'zip_agg_social_class_d',
             'zip_agg_rented_house',
             'zip_agg_home_owners',
             'zip_agg_1_car',
             'zip_agg_2_cars',
             'zip_agg_no_car',
             'zip_agg_national_health_service',
             'zip_agg_private_health_insurance',
             'zip_agg_income_<_30.000',
             'zip_agg_income_30-45.000',
             'zip_agg_income_45-75.000',
             'zip_agg_income_75-122.000',
             'zip_agg_income_>123.000',
             'zip_agg_average_income',
             'zip_agg_purchasing_power_class',
             'contri_private_third_party_ins',
             'contri_third_party_ins_(firms)',
             'contri_third_party_ins_(agriculture)',
             'contri_car_policies',
             'contri_delivery_van_policies',
             'contri_motorcycle/scooter_policies',
             'contri_lorry_policies',
             'contri_trailer_policies',
             'contri_tractor_policies',
             'contri_agricultural_machines_policies',
             'contri_moped_policies',
             'contri_life_ins',
             'contri_private_accident_ins_policies',
             'contri_family_accidents_ins_policies',
             'contri_disability_ins_policies',
             'contri_fire_policies',
             'contri_surfboard_policies',
             'contri_boat_policies',
             'contri_bicycle_policies',
             'contri_property_ins_policies',
             'contri_ss_ins_policies',
             'nbr_private_third_party_ins',
             'nbr_third_party_ins_(firms)',
             'nbr_third_party_ins_(agriculture)',
             'nbr_car_policies',
             'nbr_delivery_van_policies',
             'nbr_motorcycle/scooter_policies',
             'nbr_lorry_policies',
             'nbr_trailer_policies',
             'nbr_tractor_policies',
             'nbr_agricultural_machines_policies',
             'nbr_moped_policies',
             'nbr_life_ins',
             'nbr_private_accident_ins_policies',
             'nbr_family_accidents_ins_policies',
             'nbr_disability_ins_policies',
             'nbr_fire_policies',
             'nbr_surfboard_policies',
             'nbr_boat_policies',
             'nbr_bicycle_policies',
             'nbr_property_ins_policies',
             'nbr_ss_ins_policies',
             'nbr_mobile_home_policies']

df.columns = col_names
df.head()

	zip_agg Customer Subtype	zip_agg Number of houses	zip_agg Avg size household	zip_agg Avg age	zip_agg Customer main type	zip_agg Roman catholic	zip_agg Protestant	zip_agg Other religion	zip_agg No religion	zip_agg Married	...	Nbr fire policies
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

5 rows × 86 columns

Reverse entineer categorical transformations

I chose this dataset because it had both numeric and categorical variables. I determined this by looking at the "Attribute Characteristics" as listed on the Insurance Company Benchmark Data Set page. It lists "Attribute Characteristics: Categorical, Integer."

Based on the initial examination, this dataset has obviously already been turned into numeric variables, thus negating the whole "includes categorical variables" aspect. This is one of the biggest reasons I don't generally use the UCI datasets: too many of them are already preprocessed and aren't representative of what I see in the real world. I cover more of the preprocessing already done to the dataset in the EDA post.

The data dictionary holds the key to which columns used to be categorical, it's a simple indicator at the end of the Description Domain column. This only applies to the features at indices 0, 3, 4, 5, and 43, as shown below.

Note, Domain should be it's own column. However, there was no super easy way to split it out and it only applies to five rows, so I didn't bother worrying about it.

data_dict['feat_info'].iloc[[0, 3, 4, 5, 43], :]

	Nr	Name	Description Domain
1	1	MOSTYPE	Customer Subtype see L0
4	4	MGEMLEEF	Avg age see L1
5	5	MOSHOOFD	Customer main type see L2
6	6	MGODRK	Roman catholic see L3
44	44	PWAPART	Contribution private third party insurance see L4

The L0 - L4 datasets that are included in the Data Dictionary are starting to make more sense. They're the textual representation of the categorical features.

for key in ['L0', 'L1', 'L2', 'L3', 'L4']:
    print(key)
    display(data_dict[key].head())

L0

	Value	Label
89	1	High Income, expensive child
90	2	Very Important Provincials
91	3	High status seniors
92	4	Affluent senior apartments
93	5	Mixed seniors

L1

	Value	Bin
131	1	20-30 years
132	2	30-40 years
133	3	40-50 years
134	4	50-60 years
135	5	60-70 years

L2

	Value	Bin
138	1	Successful hedonists
139	2	Driven Growers
140	3	Average Family
141	4	Career Loners
142	5	Living well

L3

	Value	Bin
149	0	0%
150	1	1 - 10%
151	2	11 - 23%
152	3	24 - 36%
153	4	37 - 49%

L4

	Value	Bin
160	0	f 0
161	1	f 1 – 49
162	2	f 50 – 99
163	3	f 100 – 199
164	4	f 200 – 499

For brevity, I only printed the head of each of the Lx datasets. From this information I can determine that only two of the categorical features make sense to return to their textual representation.

L0 and L2 can be mapped back to their original text. I can one-hot-encode these variables in my SageMaker Pipeline. L1, L3, and L4 are binned representations of the original values. There is very little benefit to mapping these back to the textual representation, so I'll leave them alone.

Replacing values in a dataframe can be as simple as using a dictionary to map old value to new value. I can create this dictionary using the L0 and L2 dataframes. The Value column corresponds with the numeric representation in the dataset. The Label and Bin columns correspond with the original text. The only catch was to ensure that the Value column was numeric (to match the datatype in the dataframe) and to set Value to the index so it appropriately translated as the key in the dictionary.

data_dict['L0']['Value'] = pd.to_numeric(data_dict['L0']['Value'])
l0_dict = data_dict['L0'].set_index('Value').to_dict()['Label']
l0_dict

{1: 'High Income, expensive child',
 2: 'Very Important Provincials',
 3: 'High status seniors',
 4: 'Affluent senior apartments',
 5: 'Mixed seniors',
 6: 'Career and childcare',
 7: "Dinki's (double income no kids)",
 8: 'Middle class families',
 9: 'Modern, complete families',
 10: 'Stable family',
 11: 'Family starters',
 12: 'Affluent young families',
 13: 'Young all american family',
 14: 'Junior cosmopolitan',
 15: 'Senior cosmopolitans',
 16: 'Students in apartments',
 17: 'Fresh masters in the city',
 18: 'Single youth',
 19: 'Suburban youth',
 20: 'Etnically diverse',
 21: 'Young urban have-nots',
 22: 'Mixed apartment dwellers',
 23: 'Young and rising',
 24: 'Young, low educated ',
 25: 'Young seniors in the city',
 26: 'Own home elderly',
 27: 'Seniors in apartments',
 28: 'Residential elderly',
 29: 'Porchless seniors: no front yard',
 30: 'Religious elderly singles',
 31: 'Low income catholics',
 32: 'Mixed seniors',
 33: 'Lower class large families',
 34: 'Large family, employed child',
 35: 'Village families',
 36: "Couples with teens 'Married with children'",
 37: 'Mixed small town dwellers',
 38: 'Traditional families',
 39: 'Large religous families',
 40: 'Large family farms',
 41: 'Mixed rurals'}

data_dict['L2']['Value'] = pd.to_numeric(data_dict['L2']['Value'])
l2_dict = data_dict['L2'].set_index('Value').to_dict()['Bin']
l2_dict

{1: 'Successful hedonists',
 2: 'Driven Growers',
 3: 'Average Family',
 4: 'Career Loners',
 5: 'Living well',
 6: 'Cruising Seniors',
 7: 'Retired and Religeous',
 8: 'Family with grown ups',
 9: 'Conservative families',
 10: 'Farmers'}

display(data_dict['feat_info'].iloc[[0, 4], :])
print('L0:', df.columns[0])
print('L2:', df.columns[4])

	Nr	Name	Description Domain
1	1	MOSTYPE	Customer Subtype see L0
5	5	MOSHOOFD	Customer main type see L2

L0: zip_agg Customer Subtype
L2: zip_agg Customer main type

With the mapping dictionaries specified, all I need now is to use the .replace() method on the appropriate dataframe column.

df[df.columns[0]] = df[df.columns[0]].replace(l0_dict)
df[df.columns[4]] = df[df.columns[4]].replace(l2_dict)

df.head()

	zip_agg Customer Subtype	zip_agg Number of houses	zip_agg Avg size household	zip_agg Avg age	zip_agg Customer main type	zip_agg Roman catholic	zip_agg Protestant	zip_agg Other religion	zip_agg No religion	zip_agg Married	...	Nbr fire policies
0	Lower class large families	1	3	2	Family with grown ups	0	5	1	3	7	...	1
1	Mixed small town dwellers	1	2	2	Family with grown ups	1	4	1	4	6	...	1
2	Mixed small town dwellers	1	2	2	Family with grown ups	0	4	2	4	3	...	1
3	Modern, complete families	1	3	3	Average Family	2	3	2	4	5	...	1
4	Large family farms	1	4	2	Farmers	1	4	1	4	7	...	1

5 rows × 86 columns

Save data

The last step is to save all my hard work back to S3.

df.to_csv(f's3://{bucket}/{prefix}/full.csv', index=False)

Delete Files

To ensure no ongoing charges are charged to your account, you can delete the files from S3.

folder_prefix = '1_ins_dataset/'
s3_resource = boto3.resource("s3")

s3_bucket = s3_resource.Bucket(bucket)
s3_bucket.objects.filter(Prefix=folder_prefix).delete()

Using S3 from AWS's SageMaker: reading and writing files

Julie Fisher — Thu, 27 Oct 2022 18:00:00 +0000

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores how to create an MLOps compliant production pipeline using AWS's SageMaker Studio.

SageMaker Studio is a suite of tools that helps manage the infrastructure and collaboration for a machine learning project in the AWS ecosystem. Some of the biggest advantages of SageMaker Studio include:

Ability to spin up hardware resources as needed
Automatically spin down hardware resources once the task is complete
Ability to create a pipeline to automate the machine learning process from preprocessing data through deploying the model

This first post in the series will go over how to pull data from S3 and obtain file metadata. All outputs should match what is in the Notebook unless otherwise specified.

Prerequisites

For brevity, I'll assume that SageMaker Studio and an IAM role with the appropriate permissions have been set up. In a corporate/enterprise environment, these will generally be set up by an administrator or someone on the architecture team.

For directions on setting up the SageMaker environment see Onboard to Amazon SageMaker Domain Using Quick setup
For directions on setting up an AWS account and IAM role see Set Up Amazon SageMaker Prerequisites

This notebook can be run Jupyter Notebook in SageMaker Studio or as a stand alone SageMaker Jupyter Notebook instance. It may work in a local environment where the AWS credentials are specified, but that use case hasn't been tried or tested. This series is designed to take advantage of the managed infrastructure and other benefits of using SageMaker Studio, so that will be the preferred environment for all posts in the series.

Write data to S3

Note, if you already have data in an S3 bucket, you can skip this step. However, the rest of the code in this post, as well as the rest of the series, uses the data saved to the default S3 bucket in this step.

The first part of any data science project is to get data. In working with AWS and SageMaker, the best practices choice for data storage is S3. S3 is the default for SageMaker inputs and outputs, including things like training data sets and model artifacts.

First, let's put some data into S3. The below cell reads in four files from the Insurance Company Benchmark Data Set hosted on the UCI Machine Learning Repository.

I chose this data set for two main reasons:

The features represent both textual/categorical and numeric data types
Multiple files are used to store the data

It is common to have to clean data prior to training. This process can easily start with data in multiple files that need to go through an ETL (extract, transform, load) process before a final single file is produced.

Read in Data from UCI Repo

First we need to pull in our sample data from the UCI Machine Learning Repository. We can do this with pandas. Like many data scientists pandas is my go to library for data import/export, storage, and wrangling.

The pandas library now utilizes functionality from the s3fs library, which allows you to work with S3 files the same way you would with files on the local machine. Note, s3fs needs to be installed on the machine you're working on, but it does not need to be imported into the notebook. In my experience it's installed by default in SageMaker notebooks.

import pandas as pd

train_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticdata2000.txt'
test_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticeval2000.txt'
gt_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/tictgts2000.txt'
cols_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/dictionary.txt'

train = pd.read_table(train_uri, header=None)
test = pd.read_table(test_uri, header=None)
ground_truth = pd.read_table(gt_uri, header=None)
columns = pd.read_table(cols_uri, encoding='latin-1')
train.head()

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

5 rows × 86 columns

Set AWS variables

There are several variables you'll need when sending information back and forth across the AWS infrastructure. These are generally permission/access type variables. It's also useful to capture frequently used information like the bucket and prefix to the specific folder you'll be reading and writing to. This also makes it easier to change the folder path should you want to use a different base location.

import sagemaker.session

session = sagemaker.session.Session()
region = session.boto_region_name
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = 'ins_dataset'

Write to S3

Just like with reading in data, you can write data back to S3 using pandas per your usual workflow.

train.to_csv(f's3://{bucket}/{prefix}/raw/train.csv', index=False)
test.to_csv(f's3://{bucket}/{prefix}/raw/test.csv', index=False)
ground_truth.to_csv(f's3://{bucket}/{prefix}/raw/gt.csv', index=False)
columns.to_csv(f's3://{bucket}/{prefix}/raw/metadata/col_info.csv', index=False)

To see your data in AWS, simply print the bucket and prefix name and visit that folder in the AWS console.

f'{bucket}/{prefix}'

The actual output has been removed for security purposes. Here is an example of what the output should look like:

Set Library Dependencies

Frequently, the library version doesn't match the version needed to run your code. The below cell demonstrates how to load packages as well as upgrade the versions. One of the most frequent library mismatches that I've run into recently is pandas. The default was pandas 1.0.X at the time this post was created. My code generally requires the updates from pandas 1.3.5 or later.

The pandas version you see will probably be different than the one listed in the below output. The second cell below is the code to install or upgrade packages. The third cell is included to double check that the changes you want have been appropriately applied. Note, the code for this particular notebook should run on just about any pandas version >= 1.X.

pd.__version__

'1.0.1'

The below two cells are optional. This code was included as an example of how to update library dependencies.

import sys
!{sys.executable} -m pip install category_encoders
!{sys.executable} -m pip install pandas numpy --upgrade

pd.__version__

'1.3.5'

Read from S3

Now we get to the main point of this post. Reading in files and metadata from S3. First we need numpy, pandas, and boto3. numpy and pandas are packages for manipulating data, boto3 facilitates interaction with AWS.

import numpy as np
import pandas as pd
import boto3

Read a single file

Reading a single file is easy if you know the S3 URI. Basically, we can do this the same way we initially read the file from the UCI Repo.

example = pd.read_csv(f's3://{bucket}/{prefix}/raw/train.csv')
example.head()

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

5 rows × 86 columns

Read multiple files

Things get a little trickier when you need to read multiple files from a subdirectory.

List files in subdirectory

You may or may not need to see what's in the folder. But I typically find it handy to be able to confirm what is or isn't there. There are several ways to do this:

boto3.client
boto3.resource
command line

Note, all three of these methods return the name of the subdirectory as well as the files within it. The boto3 methods both work with objects. For more on objects:

See Resources in the Boto3 Docs Developer Guide for examples of working with Objects.
See the Object section of the Boto3 Docs Available Services - S3 for a full list of things that can be done with Objects.

I'll look at these one by one. This is where having bucket and prefix variables comes in really handy. Note, no outputs are included for the cells that would reveal account specific information, including full S3 URIs. To see this information for yourself, please clone the repo and run the notebook.

boto3.client

s3_client = boto3.client("s3")
s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)

The actual output has been removed for security purposes. Here is an example of what the output should look like:

This is a lot of information and rather messy. We can narrow it down to just the information about the files by looking at the 'Contents'

s3_client = boto3.client("s3")
s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']

The actual output has been removed for security purposes. Here is an example of what the output should look like:

boto3.resource

s3_resource = boto3.resource("s3")
s3_bucket = s3_resource.Bucket(bucket)

for object_summary in s3_bucket.objects.filter(Prefix=prefix):
    print(object_summary.key)

ins_dataset/raw/gt.csv
ins_dataset/raw/metadata/col_info.csv
ins_dataset/raw/test.csv
ins_dataset/raw/train.csv

Command line

file_path = f's3://{bucket}/{prefix}/raw/'
file_path

The acutal output has been removed for security purposes. Here is an example of what the output should look like:

We then include the file_path variable in a bash command entered right into the Jupyter cell.

!aws s3 ls $file_path

                           PRE metadata/
2022-10-25 18:56:26       8002 gt.csv
2022-10-25 18:56:26     683549 test.csv
2022-10-25 18:56:25    1006399 train.csv

The size of the files (the middle values in the above output) can be useful for determining the amount of memory needed. I've also used it to chose smaller files for testing/prototyping code before spinning up larger instances to process larger files.

The information can be dumped into a csv file using the following code:

!aws s3 ls $file_path | cat >> files.csv

Capture file names and read in

I find the boto3.resource output easiest to work with, so I'll use it to capture the file names and read in what I want.

s3_resource = boto3.resource("s3")
s3_bucket = s3_resource.Bucket(bucket)

file_names = []

for object_summary in s3_bucket.objects.filter(Prefix=prefix):
    if (len(object_summary.key.rsplit('.')) == 2):
        file_names.append(object_summary.key)

file_names

['ins_dataset/raw/gt.csv',
 'ins_dataset/raw/metadata/col_info.csv',
 'ins_dataset/raw/test.csv',
 'ins_dataset/raw/train.csv']

In the above snippet I split on . to ensure that the returned object is a file instead of a folder. Also note that this code goes into subdirectories and returns those files as well.

These kinds of conditionals can be used in many different ways to return only specific file types (csv vs parquet vs txt, etc) or to grab only portions of the file path. One use case for this would be a dictionary where the key is the file name and the value is the file path.

s3_resource = boto3.resource("s3")
s3_bucket = s3_resource.Bucket(bucket)

files = {}

for object_summary in s3_bucket.objects.filter(Prefix=prefix):
    if (len(object_summary.key.rsplit('.')) == 2) & (len(object_summary.key.split('/')) <= 3):
        files[object_summary.key.split('/')[-1].split('.')[0]] = f"s3://{bucket}/{object_summary.key}"

files

The actual output has been removed for security purposes. Here is an example of what the output should look like:

From the dictionary of 'name' and 'URI' it's easy to create a dictionary of dataframes.

The dictionary of dataframes data structure is extremely useful in that it's easy to name a dataframe and to have an unspecified number of dataframes to be read in.

df_dict = {}

for df_name in files.keys():
    print(df_name)
    df_dict[df_name] = pd.read_table(files[df_name], header=None)

df_dict

gt
test
train

{'gt':       0
 0     0
 1     0
 2     1
 ...  ..
 3998  0
 3999  0
 4000  0

 [4001 rows x 1 columns],

 'test':                                                       0
 0     0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18...
 1     33,1,4,2,8,0,6,0,3,5,0,4,1,1,8,2,2,6,0,0,1,2,6...
 2     6,1,3,2,2,0,5,0,4,5,2,2,1,4,5,5,4,0,5,0,0,4,0,...
 ...                                                 ...
 3998  36,1,2,3,8,1,5,1,3,7,0,2,2,5,3,2,3,4,2,0,0,3,4...
 3999  33,1,3,3,8,1,4,2,3,7,1,2,2,3,4,1,3,5,1,1,1,2,3...
 4000  8,1,2,3,2,4,3,0,3,5,2,2,0,6,3,8,0,1,8,0,0,0,0,...

 [4001 rows x 1 columns],

 'train':                                                       0
 0     0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18...
 1     33,1,3,2,8,0,5,1,3,7,0,2,1,2,6,1,2,7,1,0,1,2,5...
 2     37,1,2,2,8,1,4,1,4,6,2,2,0,4,5,0,5,4,0,0,0,5,0...
 ...                                                 ...
 5820  33,1,3,4,8,0,6,0,3,5,1,4,3,3,4,0,1,8,1,0,0,2,3...
 5821  34,1,3,2,8,0,7,0,2,7,2,0,0,4,5,0,2,7,0,2,0,2,4...
 5822  33,1,3,3,8,0,6,1,2,7,1,2,1,4,4,1,2,6,1,0,1,3,2...

 [5823 rows x 1 columns]}

Delete Files

To ensure no ongoing charges are charged to your account, you can delete the files from S3.

prefix = prefix + '/'

s3_bucket = s3_resource.Bucket(bucket)
s3_bucket.objects.filter(Prefix=prefix).delete()

The actual output has been removed for security purposes. Here is an example of what the output should look like:

	0	1	2	3	4	5	6	7	8	9	...	79	0
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0
3	9	1	2	3	3	2	3	2	4	5	...	1	0
4	31	1	2	4	7	0	2	0	7	9	...	1	0

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	ABRAND	CARAVAN
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0

	zip_agg Customer Subtype	zip_agg Number of houses	zip_agg Avg size household	zip_agg Avg age	zip_agg Customer main type	zip_agg Roman catholic	zip_agg Protestant	zip_agg Other religion	zip_agg No religion	zip_agg Married	...	Nbr fire policies
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79	0
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0
3	9	1	2	3	3	2	3	2	4	5	...	1	0
4	31	1	2	4	7	0	2	0	7	9	...	1	0

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	ABRAND	CARAVAN
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0

	zip_agg Customer Subtype	zip_agg Number of houses	zip_agg Avg size household	zip_agg Avg age	zip_agg Customer main type	zip_agg Roman catholic	zip_agg Protestant	zip_agg Other religion	zip_agg No religion	zip_agg Married	...	Nbr fire policies
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

Forem: Julie Fisher

Fitting KNN: From Overfit to Underfit and Everything Between

TLDR;

Finding the Right Fit

Taking Model Measurements

Overfit: The Restrictive Fit

Generalizable: The Perfect Fit

Underfit: The Baggy Fit

Tailoring Model Fit

KNN: The Importance of Being Scaled

What is Scaling?

Unscaled Features: KNN Baseline

Visualizing Fit: An Unscaled Hot Mess

Visualizing Fit: Beautiful Scaling

Scaling and Model Performance

Evaluating KNN: From Training Field to Scoreboard

TLDR;

1. Load the data

2. Prepare the data

3. Split the data into train and test sets

4. Separate features from the target

5. Train/Fit the model

6. Predict on the test set

7. Evaluate the model's performance

Pre-Train Prep: Loading and Splitting the Data

Draft Day: Picking the Training and Test Sets

The Draft Board: Choosing Players for Training and Testing

Training Day: Fitting the KNN Model

Game Time: Evaluating KNN’s Performance

The Confusion Matrix: Fundamentals First

Accuracy, Precision, Recall, and More — the Stats that Separate MVPs from Benchwarmers

Practice Scores Don’t Win Championships

Recap

Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection

TLDR;

Abra-data-dabra: Data and EDA

Smoke, Mirrors, and Mysterious Data Provenance

The Magic of Being Prepared

1. Keep It In The Code

2. Playing Well With Others

Behold: The Non-Network Social_Network_Ads Dataset

Feature Overview: Numbers Into Understanding

Age

EstimatedSalary

Gender

The Power of Working Together

Causation and Correlation

Age and EstimatedSalary

Age and Gender

EstimatedSalary and Gender

Age, EstimatedSalary and Gender

Recap

Up Next

Python Projects With Less Pain: Beginner's Guide to Virtual Environments

TLDR;

Dead in the Water

To Update or Not to Update, is that the Question?

The Real World: Cloud Edition

The Price of Easy

Environments, Virtual Environments

May The venv Be With You

Mass pipduction

Next Steps

In Case You Forgot

Cleaning Data: wrangling data for a SageMaker pipeline

Objectives

Prerequisites

Set AWS variables

Read in Data

Combine data

Data Dictionary

Label column

* Quick tip *

Column headers

* Quick Tip *

Column names

Reverse entineer categorical transformations

Save data

Delete Files

Using S3 from AWS's SageMaker: reading and writing files

`Age` and `EstimatedSalary`

`Age` and `Gender`

`EstimatedSalary` and `Gender`

`Age`, `EstimatedSalary` and `Gender`

May The `venv` Be With You

Mass `pip`duction

	0	1	2	3	4	5	6	7	8	9	...	79	0
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0
3	9	1	2	3	3	2	3	2	4	5	...	1	0
4	31	1	2	4	7	0	2	0	7	9	...	1	0

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	ABRAND	CARAVAN
0	33	1	4	2	8	0	6	0	3	5	...	1	0
1	6	1	3	2	2	0	5	0	4	5	...	1	1
2	39	1	3	3	9	1	4	2	3	5	...	1	0

	zip_agg Customer Subtype	zip_agg Number of houses	zip_agg Avg size household	zip_agg Avg age	zip_agg Customer main type	zip_agg Roman catholic	zip_agg Protestant	zip_agg Other religion	zip_agg No religion	zip_agg Married	...	Nbr fire policies
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1

	0	1	2	3	4	5	6	7	8	9	...	79
0	33	1	3	2	8	0	5	1	3	7	...	1
1	37	1	2	2	8	1	4	1	4	6	...	1
2	37	1	2	2	8	0	4	2	4	3	...	1
3	9	1	3	3	3	2	3	2	4	5	...	1
4	40	1	4	2	10	1	4	1	4	7	...	1