<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: SHH</title>
    <description>The latest articles on Forem by SHH (@hew).</description>
    <link>https://forem.com/hew</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3553372%2Fe489d047-1914-4f93-a4f0-4cf902dae71f.png</url>
      <title>Forem: SHH</title>
      <link>https://forem.com/hew</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hew"/>
    <language>en</language>
    <item>
      <title>How to Build a Spam Detector with ML and Python</title>
      <dc:creator>SHH</dc:creator>
      <pubDate>Mon, 27 Oct 2025 00:51:58 +0000</pubDate>
      <link>https://forem.com/hew/how-to-build-a-spam-detector-with-ml-and-python-3b5p</link>
      <guid>https://forem.com/hew/how-to-build-a-spam-detector-with-ml-and-python-3b5p</guid>
      <description>&lt;p&gt;All modern spam detection systems rely on machine learning. ML has proven to be superior at many classification tasks given sufficient training data.&lt;/p&gt;

&lt;p&gt;This tutorial shows you how to build a spam detector using supervised learning. More specifically, you will use Python to train a logistic regression model that classifies emails into spam and non-spam. &lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;You will work with NumPy, SciPy, scikit-learn and Matplotlib:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Download the &lt;a href="https://www.cs.ubc.ca/~murphyk/Teaching/Stat406-Spring10/spamData.mat" rel="noopener noreferrer"&gt;spam dataset&lt;/a&gt; consisting of 4,601 emails. It is split 3:1 into a training and a test set. Each email is labeled as either 0 (non-spam) or 1 (spam) and comes with 57 features (3 length statistics on consecutive capitalised letters, frequency percentages of 48 words and 6 characters):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;features = np.array(
    [
        "word_freq_make",
        "word_freq_address",
        "word_freq_all",
        "word_freq_3d",
        "word_freq_our",
        "word_freq_over",
        "word_freq_remove",
        "word_freq_internet",
        "word_freq_order",
        "word_freq_mail",
        "word_freq_receive",
        "word_freq_will",
        "word_freq_people",
        "word_freq_report",
        "word_freq_addresses",
        "word_freq_free",
        "word_freq_business",
        "word_freq_email",
        "word_freq_you",
        "word_freq_credit",
        "word_freq_your",
        "word_freq_font",
        "word_freq_000",
        "word_freq_money",
        "word_freq_hp",
        "word_freq_hpl",
        "word_freq_george",
        "word_freq_650",
        "word_freq_lab",
        "word_freq_labs",
        "word_freq_telnet",
        "word_freq_857",
        "word_freq_data",
        "word_freq_415",
        "word_freq_85",
        "word_freq_technology",
        "word_freq_1999",
        "word_freq_parts",
        "word_freq_pm",
        "word_freq_direct",
        "word_freq_cs",
        "word_freq_meeting",
        "word_freq_original",
        "word_freq_project",
        "word_freq_re",
        "word_freq_edu",
        "word_freq_table",
        "word_freq_conference",
        "char_freq_;",
        "char_freq_(",
        "char_freq_[",
        "char_freq_!",
        "char_freq_$",
        "char_freq_#",
        "capital_run_length_average",
        "capital_run_length_longest",
        "capital_run_length_total",
    ]
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Load the data
&lt;/h2&gt;

&lt;p&gt;First, load the data into appropriate train/test variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, normalize the scale of each feature by computing the z-scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Define the logistic regression model
&lt;/h2&gt;

&lt;p&gt;Define helper functions and the log-likelihood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def logsumexp(x):
    offset = np.max(x, axis=0)
    return offset + np.log(np.sum(np.exp(x - offset), axis=0))

def logsigma(x):
    if not isinstance(x, np.ndarray):
        return -logsumexp(np.array([0, -x]))
    else:
        return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))

def l(y, X, w):
    return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Define the gradient of the log-likelihood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def sigma(x):
    return np.exp(x)/(1 + np.exp(x))

def dl(y, X, w):
    return (y - sigma(X.dot(w))).dot(X)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)
&lt;/h2&gt;

&lt;p&gt;Here is a Python framework for implementing GD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
    f, update = obj_up
    theta = theta0
    values = np.zeros(nepochs + 1)
    eps = np.zeros(nepochs + 1)
    values[0] = f(theta0)
    eps[0] = eps0

    for epoch in range(nepochs):
        if verbose:
            print(
                "Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
                    epoch, values[epoch], eps[epoch]
                )
            )
        theta = update(theta, eps[epoch])

        values[epoch + 1] = f(theta)
        if values[epoch] &amp;lt; values[epoch + 1]:
            eps[epoch + 1] = eps[epoch] / 2.0
        else:
            eps[epoch + 1] = eps[epoch] * 1.05

    if verbose:
        print("Result after {} epochs: f={}".format(nepochs, values[-1]))
    return theta, values, eps

def gd(y, X):
    def objective(w):
        return -(l(y, X, w))

    def update(w, eps):
        return w + eps * dl(y, X, w)

    return (objective, update)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now run GD to obtain optimized weights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Predict
&lt;/h2&gt;

&lt;p&gt;Finally, you can define a spam confidence values predictor and classifier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def predict(Xtest, w):
    return sigma(Xtest.dot(w))

def classify(Xtest, w):
    threshold = 0.5 # initial threshold
    return (sigma(Xtest.dot(w)) &amp;gt; threshold).astype(int)

yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plot the precision-recall-curve to find a better threshold value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
    index = int(x * (precision.size - 1))
    plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("Recall")
plt.ylabel("Precision")
# 0.44 looks good
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8livmd0lscs4sxuhmpfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8livmd0lscs4sxuhmpfa.png" alt=" " width="640" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Have a look at the largest weights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;features[np.where(wz_gd&amp;gt;2)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unsurprisingly, you will find that &lt;code&gt;char_freq_$&lt;/code&gt; and &lt;code&gt;capital_run_length_longest&lt;/code&gt; have an outsized impact, i.e., spam emails frequently include dollar signs and capitalized words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this tutorial you have learned to build an email spam detector using machine learning and Python. If you want to practice more, try finding another dataset and build a binary classification model using the framework introduced here.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Data Science Tech Stack You Must Master in 2025</title>
      <dc:creator>SHH</dc:creator>
      <pubDate>Thu, 09 Oct 2025 22:08:07 +0000</pubDate>
      <link>https://forem.com/hew/the-data-science-tech-stack-you-must-master-in-2025-1fkd</link>
      <guid>https://forem.com/hew/the-data-science-tech-stack-you-must-master-in-2025-1fkd</guid>
      <description>&lt;p&gt;It is the year 2025, everybody and their grandma have asked ChatGPT about the meaning of life. While we cannot be sure whether it generated a hallucinated answer, we do know that LLMs are developed using &lt;a href="https://python.org" rel="noopener noreferrer"&gt;Python&lt;/a&gt;. Data scientists today are expected to work with AI/ML models and therefore Python (see below), effectively settling the age-old "Python vs. R" debate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Package and environment manager
&lt;/h2&gt;

&lt;p&gt;To keep your projects tidy and make your code reproducible on any machine, you need to keep track of the version numbers of the project's dependencies. Package and environment managers help you with that. There have been many package managers (conda, pip etc.) and perhaps even more virtual environment managers (virtualenv, pyenv, pipenv etc.). The general consensus nowadays is that you should just use &lt;a href="https://github.com/astral-sh/uv" rel="noopener noreferrer"&gt;&lt;strong&gt;uv&lt;/strong&gt;&lt;/a&gt; as it combines both functions while being faster than the other solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development environment
&lt;/h2&gt;

&lt;p&gt;Jupyter notebooks are great to get started: easy to setup and run interactively (cell by cell). However, in the real world you will be expected to ship code to production in the form of scripts and apps, that is, not notebooks.&lt;/p&gt;

&lt;p&gt;You could copy-paste code from a Jupyter notebook to a text editor, but there's a more convenient way: integrated development environments (IDE) like &lt;a href="https://code.visualstudio.com" rel="noopener noreferrer"&gt;VS Code&lt;/a&gt; and &lt;a href="https://cursor.com" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;. Not only do they combine file explorer, text editor and terminal in one application, but there are also many extensions. They will make your life easier, e.g., code formatter, linter etc. Plus, you don't need to give up on Jupyter notebooks. You can create and run them inside of VS Code/Cursor. Lastly, they allow you to take advantage of AI features like tab/auto completions, making you even more productive.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to get started with VS Code and uv
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Download and install VS Code: &lt;a href="https://code.visualstudio.com/Download" rel="noopener noreferrer"&gt;https://code.visualstudio.com/Download&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Install uv by executing the following command in your VS Code terminal:

&lt;ul&gt;
&lt;li&gt;(Linux/MacOS) &lt;code&gt;curl -LsSf https://astral.sh/uv/install.sh | sh&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;(Windows) &lt;code&gt;powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install Python: &lt;code&gt;uv python install&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install a package: &lt;code&gt;uv install pandas&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a project: &lt;code&gt;uv init your-project-name&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Add a dependency to your project lock file: &lt;code&gt;uv add pandas&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a script: example.py &lt;code&gt;print('Hello world!')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run a script: &lt;code&gt;uv run example.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Skills
&lt;/h2&gt;

&lt;p&gt;I have analyzed postings of &lt;a href="https://mljobs.io/data-science" rel="noopener noreferrer"&gt;data science jobs&lt;/a&gt; from AI frontier labs, like OpenAI, to identify those skills that are most likely &lt;strong&gt;future-proof&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Programming languages
&lt;/h4&gt;

&lt;p&gt;Python and SQL are listed as required qualifications in all listings. R was not mentioned even once.&lt;/p&gt;

&lt;h4&gt;
  
  
  General capabilities
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Design statistical experiments&lt;/li&gt;
&lt;li&gt;Conduct A/B tests&lt;/li&gt;
&lt;li&gt;Define and operationalize metrics&lt;/li&gt;
&lt;li&gt;Visualize results, dashboarding&lt;/li&gt;
&lt;li&gt;Communicate with stakeholders&lt;/li&gt;
&lt;li&gt;Prototyping&lt;/li&gt;
&lt;li&gt;Run simulations&lt;/li&gt;
&lt;li&gt;Version control (git)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Frameworks, modules and tools
&lt;/h4&gt;

&lt;p&gt;A list of popular, but not necessarily required experiences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pandas, NumPy, scikit-learn, flask&lt;/li&gt;
&lt;li&gt;Seaborn/Matplotlib, Tableau/Power BI&lt;/li&gt;
&lt;li&gt;GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI will certainly change how data scientists will work going forward. However, I believe that LLMs will not replace them. Instead, there will be a growing need for capable data scientists that are able to uncover the failure modes of today's AIs and design better systems.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>ai</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
