How to Build a Spam Detector with ML and Python

SHH — Mon, 27 Oct 2025 00:51:58 +0000

All modern spam detection systems rely on machine learning. ML has proven to be superior at many classification tasks given sufficient training data.

This tutorial shows you how to build a spam detector using supervised learning. More specifically, you will use Python to train a logistic regression model that classifies emails into spam and non-spam.

Prerequisites

You will work with NumPy, SciPy, scikit-learn and Matplotlib:

import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt

Download the spam dataset consisting of 4,601 emails. It is split 3:1 into a training and a test set. Each email is labeled as either 0 (non-spam) or 1 (spam) and comes with 57 features (3 length statistics on consecutive capitalised letters, frequency percentages of 48 words and 6 characters):

features = np.array(
    [
        "word_freq_make",
        "word_freq_address",
        "word_freq_all",
        "word_freq_3d",
        "word_freq_our",
        "word_freq_over",
        "word_freq_remove",
        "word_freq_internet",
        "word_freq_order",
        "word_freq_mail",
        "word_freq_receive",
        "word_freq_will",
        "word_freq_people",
        "word_freq_report",
        "word_freq_addresses",
        "word_freq_free",
        "word_freq_business",
        "word_freq_email",
        "word_freq_you",
        "word_freq_credit",
        "word_freq_your",
        "word_freq_font",
        "word_freq_000",
        "word_freq_money",
        "word_freq_hp",
        "word_freq_hpl",
        "word_freq_george",
        "word_freq_650",
        "word_freq_lab",
        "word_freq_labs",
        "word_freq_telnet",
        "word_freq_857",
        "word_freq_data",
        "word_freq_415",
        "word_freq_85",
        "word_freq_technology",
        "word_freq_1999",
        "word_freq_parts",
        "word_freq_pm",
        "word_freq_direct",
        "word_freq_cs",
        "word_freq_meeting",
        "word_freq_original",
        "word_freq_project",
        "word_freq_re",
        "word_freq_edu",
        "word_freq_table",
        "word_freq_conference",
        "char_freq_;",
        "char_freq_(",
        "char_freq_[",
        "char_freq_!",
        "char_freq_$",
        "char_freq_#",
        "capital_run_length_average",
        "capital_run_length_longest",
        "capital_run_length_total",
    ]
)

Load the data

First, load the data into appropriate train/test variables:

data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)

Next, normalize the scale of each feature by computing the z-scores:

Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)

Define the logistic regression model

Define helper functions and the log-likelihood:

def logsumexp(x):
    offset = np.max(x, axis=0)
    return offset + np.log(np.sum(np.exp(x - offset), axis=0))

def logsigma(x):
    if not isinstance(x, np.ndarray):
        return -logsumexp(np.array([0, -x]))
    else:
        return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))

def l(y, X, w):
    return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))

Define the gradient of the log-likelihood:

def sigma(x):
    return np.exp(x)/(1 + np.exp(x))

def dl(y, X, w):
    return (y - sigma(X.dot(w))).dot(X)

Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)

Here is a Python framework for implementing GD:

def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
    f, update = obj_up
    theta = theta0
    values = np.zeros(nepochs + 1)
    eps = np.zeros(nepochs + 1)
    values[0] = f(theta0)
    eps[0] = eps0

    for epoch in range(nepochs):
        if verbose:
            print(
                "Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
                    epoch, values[epoch], eps[epoch]
                )
            )
        theta = update(theta, eps[epoch])

        values[epoch + 1] = f(theta)
        if values[epoch] < values[epoch + 1]:
            eps[epoch + 1] = eps[epoch] / 2.0
        else:
            eps[epoch + 1] = eps[epoch] * 1.05

    if verbose:
        print("Result after {} epochs: f={}".format(nepochs, values[-1]))
    return theta, values, eps

def gd(y, X):
    def objective(w):
        return -(l(y, X, w))

    def update(w, eps):
        return w + eps * dl(y, X, w)

    return (objective, update)

You can now run GD to obtain optimized weights:

np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)

Predict

Finally, you can define a spam confidence values predictor and classifier:

def predict(Xtest, w):
    return sigma(Xtest.dot(w))

def classify(Xtest, w):
    threshold = 0.5 # initial threshold
    return (sigma(Xtest.dot(w)) > threshold).astype(int)

yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)

Plot the precision-recall-curve to find a better threshold value:

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
    index = int(x * (precision.size - 1))
    plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("Recall")
plt.ylabel("Precision")
# 0.44 looks good

Have a look at the largest weights:

features[np.where(wz_gd>2)]

Unsurprisingly, you will find that char_freq_$ and capital_run_length_longest have an outsized impact, i.e., spam emails frequently include dollar signs and capitalized words.

Conclusion

In this tutorial you have learned to build an email spam detector using machine learning and Python. If you want to practice more, try finding another dataset and build a binary classification model using the framework introduced here.

The Data Science Tech Stack You Must Master in 2025

SHH — Thu, 09 Oct 2025 22:08:07 +0000

It is the year 2025, everybody and their grandma have asked ChatGPT about the meaning of life. While we cannot be sure whether it generated a hallucinated answer, we do know that LLMs are developed using Python. Data scientists today are expected to work with AI/ML models and therefore Python (see below), effectively settling the age-old "Python vs. R" debate.

Package and environment manager

To keep your projects tidy and make your code reproducible on any machine, you need to keep track of the version numbers of the project's dependencies. Package and environment managers help you with that. There have been many package managers (conda, pip etc.) and perhaps even more virtual environment managers (virtualenv, pyenv, pipenv etc.). The general consensus nowadays is that you should just use uv as it combines both functions while being faster than the other solutions.

Development environment

Jupyter notebooks are great to get started: easy to setup and run interactively (cell by cell). However, in the real world you will be expected to ship code to production in the form of scripts and apps, that is, not notebooks.

You could copy-paste code from a Jupyter notebook to a text editor, but there's a more convenient way: integrated development environments (IDE) like VS Code and Cursor. Not only do they combine file explorer, text editor and terminal in one application, but there are also many extensions. They will make your life easier, e.g., code formatter, linter etc. Plus, you don't need to give up on Jupyter notebooks. You can create and run them inside of VS Code/Cursor. Lastly, they allow you to take advantage of AI features like tab/auto completions, making you even more productive.

How to get started with VS Code and uv

Download and install VS Code: https://code.visualstudio.com/Download
Install uv by executing the following command in your VS Code terminal:
- (Linux/MacOS) curl -LsSf https://astral.sh/uv/install.sh | sh
- (Windows) powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Install Python: uv python install
Install a package: uv install pandas
Create a project: uv init your-project-name
Add a dependency to your project lock file: uv add pandas
Create a script: example.py print('Hello world!')
Run a script: uv run example.py

Skills

I have analyzed postings of data science jobs from AI frontier labs, like OpenAI, to identify those skills that are most likely future-proof.

Programming languages

Python and SQL are listed as required qualifications in all listings. R was not mentioned even once.

General capabilities

Design statistical experiments
Conduct A/B tests
Define and operationalize metrics
Visualize results, dashboarding
Communicate with stakeholders
Prototyping
Run simulations
Version control (git)

Frameworks, modules and tools

A list of popular, but not necessarily required experiences:

Pandas, NumPy, scikit-learn, flask
Seaborn/Matplotlib, Tableau/Power BI
GitHub

Conclusion

AI will certainly change how data scientists will work going forward. However, I believe that LLMs will not replace them. Instead, there will be a growing need for capable data scientists that are able to uncover the failure modes of today's AIs and design better systems.

Forem: SHH

How to Build a Spam Detector with ML and Python

Prerequisites

Load the data

Define the logistic regression model

Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)

Predict

Conclusion

The Data Science Tech Stack You Must Master in 2025

Package and environment manager

Development environment

How to get started with VS Code and uv

Skills

Programming languages

General capabilities

Frameworks, modules and tools

Conclusion