Forem: Leonhard Kwahle

Reinforcement Learning: How Machines Learn from Trial and Error

Leonhard Kwahle — Mon, 02 Jun 2025 12:40:57 +0000

"The only real mistake is the one from which we learn nothing.” — John Powell

Before diving deep into the subject at hand, it's important to recognize that every Machine Learning algorithm is, at its core, an attempt to formalize something that humans, animals, or other living organisms naturally do in pursuit of their well-being.
So don’t get overwhelmed by the math or jargon too early. At its heart, RL is just a formal way of describing how we — as living beings — learn from experience. If you can relate it back to how you’ve personally learned things through trial and error, you’re already thinking like a reinforcement learning researcher.

Stay curious, connect concepts to your own experiences, and let intuition guide your technical understanding.

1.What is Reinforcement Learning?

Reinforcement Learning is simply a type of machine learning in which and agent learns by interacting with an environment and establish
policies based on the response received from the environment

A Policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes - wikipedia

An Agent is simply a representative of someone

you are sent to represent someone but you really do not know
what is wrong or right 
so you will only do try and error
what ever brings good results do them more
avoid what brings bad results

but you will take some time exploring before you will 
identify good and bad actions

Reinforcement Learning has evolved from three major research threads:

Trial-and-error Learning from psychology
Optimal Control Theory
Temporal-difference learning

You can learn more about these threads in this blog post:🏃‍➡️👉here

2.What are the elements of a Reinforcement Learning?

-A Policy : a is like a rule that defines the type of action an agent is supposed to take when in a given state . psychology summarizes it as set of stimulus-response rules .it defines the way a learning agent will behave in time.

-A Reward signal: this defines the goal of reinforcement learning. A reward is a signal number sent by the environment to the agent at each time step. The objective is to maximize the total reward earned in a long run. Rewards defines good and bad events for the learning agent

-A value function : A Value function specifies what is good in a long run. The Value a state is the total amount of reward an agent can expect to get in the future starting from that state. This means in a RL system value is not same as reward. Rewards are the immediate desirable signals from the environments while Value indicate the long-term Return from states after taking into account the states that are likely to follow, and the rewards available in those states .

-A model of the environment : This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward.

Other important terms

-State (S) - The current situation of the environment
-Action (A) – What the agent can do

3.How does an Agent Learn in an environment

The agent tries actions and observes the results:

Takes an action

Gets a reward

Updates its behavior (policy)

This loop continues until it finds the best strategy, called an optimal policy.

let do a small primitive hands-on

checkout the working code here

problem: Think of a smart thermostat that learns the best time to turn the heater on or off to keep the room comfortable and save energy. It gets positive rewards for comfort and negative rewards for wasting energy.

Environmental states

too hot

normal

too cold

Agents actions

Turn ON heater

Turn OFF heater

Reward response

+ 1 if room is comfortable

-1 if room is too hot or cold

OBJECTIVE : maximize reward

i.import numpy and random

import numpy as np
import random

ii.Define states and keep tract of howmany they are

#define state
states = ["too hot", "normal", "too cold"]
n_states = len(states) #the number of states

the python list above is called a state_space and can be accessed using env.state_space.sample() when you here of space think of it as a set

iii.Define actions and keep tract of howmany they are

#define state
actions = ["ON", "OFF"]
n_actions = len(actions) #the number of states

iv. Initialize a Q-table Q(s,a):

This table act as a guide that shows what you will get depending on what state(s) you are and what action(a) you take

This Q-table tracks the expected reward (value) of taking each action in each temperature state.

State	Action: ON	Action: OFF
Cold	0.00	0.00
Okay	0.00	0.00
Hot	0.00	0.00

⚠️ Note: All values start at 0.00. As the agent interacts with the environment, these values will be updated to reflect which actions are more rewarding in each state.

# Initialize Q-table: rows = states, cols = actions
r = n_states
c = n_actions
q_table = np.zeros(( r, c))

v.Reward(response signal) function

question: should the agent be aware of this function ? why?

we will represent this function in the form of a reward matrix using the python dictionary

reward_matrix = {
    "Cold":    {"ON": 0,   "OFF": -1},
    "Okay":    {"ON": -1,  "OFF": 1},
    "Hot":     {"ON": -1,  "OFF": -2}
}
# let this not confuse you 
# These are simply rules you can define your own as you wish

Reward Matrix Explanation

The thermostat has 3 possible temperature states and can take 2 actions: turning the heater ON or OFF.

State	Action	Reward	Interpretation
Cold	ON	0	Neutral: turning on heater is expected in cold.
Cold	OFF	-1	Bad: it's cold and you're not heating.
Okay	ON	-1	Bad: wasting energy when it's already comfortable.
Okay	OFF	+1	Good: maintaining comfort and saving energy.
Hot	ON	-1	Bad: worsening the situation, it's already hot.
Hot	OFF	-2	Very bad: it's hot and you're not cooling down.

vi.Parameters :

this include the learning rate, discount factor, exploration rate

-learning rate(α) : size of the steps you take to reach a goal . if your steps are too small you may take infinite time to reach and if they are too large you may keep passing the destination ( think of a swinging simple pendulum
that is what will happen if your learning rate is too high
so choosing a learning rate is something you should not do it carelessly

discount factor(γ) : this is a number between 0 and 1 it guide the agent on how to compare futur rewards to immediate present reward

did you know that if i buy something from you that cost 10000 CFA and ask you to choose between receiving 10000 CFA NOW OR 20000 CFA after 12 months which want will you prefere

As someone who wishes to maximize profit you will find the present worth of 20000 CFA which lives 12month into the future

this will be done using the discounted factor

-the Reward at time (t) is represented as $R_t )$
the Total Expected Reward is represented as $G_t )$

R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + \dots

if $γ=1\gamma = 1$ : future reward are as important as immediate present rewards
you must consuder the futur outcomes of your actions

if $γ=0\gamma = 0$ : future reward are not at all important as immediate present rewards thus focus only on the present reward

exploration rate : this has to do with the question

should i just stick to what i know brings good results (exploitation) OR i should try and experiment with other alternative methods before choosing the best one (exploration)?
this is important because there may be a methods better that what you already know
this could also waste your time if what you knew was already the best

SO there is this wrestling between Exploitation and Exploration

😊😊❤️❤️❤️I LOVE EXPLORATION

# Parameters
alpha = 0.1    # learning rate
gamma = 0.9    # discount factor
epsilon = 0.2  # exploration rate

👨‍⚕️ these parameters must be chosen wisely

vii. Simulate a training session (episode)


for episode in range(1000):
    # Start at a random state
    state_idx = random.randint(0, n_states - 1)
    state = states[state_idx]

    # Choose action: explore or exploit
    if random.uniform(0, 1) < epsilon:
        action_idx = random.randint(0, n_actions - 1)
    else:
        action_idx = np.argmax(q_table[state_idx])

    action = actions[action_idx]

    # Get reward
    reward = reward_matrix[state][action]

    # Simulate next state: very basic model
    if action == "ON":
        next_state_idx = min(state_idx + 1, n_states - 1)
    else:
        next_state_idx = max(state_idx - 1, 0)

    # Update Q-table using Q-learning formula
    old_value = q_table[state_idx, action_idx]
    next_max = np.max(q_table[next_state_idx])
    q_table[state_idx, action_idx] = old_value + alpha * (reward + gamma * next_max - old_value)

# Display learned Q-values
print("Learned Q-table:")
for i, state in enumerate(states):
    print(f"{state}: ON = {q_table[i, 0]:.2f}, OFF = {q_table[i, 1]:.2f}")

The Three Threads That Wove Reinforcement Learning into Reality

Leonhard Kwahle — Mon, 02 Jun 2025 12:39:24 +0000

Before machines could learn through experience, three powerful ideas from psychology, control theory, and learning algorithms laid the foundation. This post unpacks the origins of reinforcement learning — not in code, but in concept.

Brief historical background of reinforcement learning:

Let’s explore the 3 main threads that led to reinforcement learning
These include : -
• Trial-and-error Learning
• Optimal Control
• Temporal-Difference Learning

Trial-and-error Learning : This thread comes from pscychology, where scientists studied how animals learn by trying different things and seeing what works best. Edwin Thorndike(Thorndike, 1911, p.244) came up with the “Law of Effect” which say that actions with good outcomes are more likely to be repeated
.
Optimal-Control : This is a term used to describe the problem of designing a controller to minimize a measure(which could be something like cost function) of a dynamic system’s behavior over time. One approach to this problem was developed in mid-1950 by the Mathematician Richard Bellman called dynamic programing which is a way to solve complex problems by breaking it down into smaller parts
Temporal-Difference Learning : this is in part with animal learning psychologies in particular in the notion of secondary reinforcers . reinforcers are those things that follow a behavior (action) and make it more likely to occur.

Primary Reinforcers : These are things that are naturally rewarding or pleasing e.g food, water, etc

Secondary Reinforcers : These are things that become rewarding because they are associated with things that are naturally rewarding e.g Money will motivate you because you can use it to get food, comfort etc

go back and continue from where you left

DBSCAN: Finding Cluster of any shape

Leonhard Kwahle — Thu, 08 May 2025 17:04:55 +0000

Exploring how DBSCAN uses density, not distance, to find clusters of any shape from theory and code to real-world applications like anomaly detection and geospatial mapping.

Before diving into the code or real-world applications, it’s essential to understand what DBSCAN actually is and why it stands apart from traditional clustering techniques.

What is DBSCAN

At its core, DBSCAN short for Density-Based Spatial Clustering of Applications with Noise is a powerful algorithm that clusters data not based on shape or center points, but on the density of data points in a region. Let’s break down the key concepts that make DBSCAN unique.

Density-Based Clustering : Instead of grouping points based of distance from a center(like K-MEANS). This algorithm groups points based on how crowded a region is
No need for predefined number of clusters : Unlike k-means which needs to figure out the number of clusters. DBSCAN is able to get the different clusters based on density of points in a region

It is like traversing through all the points noting the number of times you encountered regions of high density

Identifies Noise and Outliers : DBSCAN is able to naturally identifies points that do not belong to any cluster like isolated data or anomalies . these are labelled as noise and hence a good tool for anomaly detection
It can find clusters of any shape : It is not limited to circular or spherical clusters. It is able to detect clusters of any shape like spirals and other complex irregular shapes
Core, Border, and Noise Points : DBSCAN classifies points into 3 types

Types of Points in DBSCAN

Core points : Have enough neighbors nearby (defined by ε and MinPts). ε is a defined distance used to determine the neighbors of a given point and MinPts is the minimum number of neighbor a core point must have to form a dense region
Border points : Close to a core point but not dense enough on their own.
Noise points : Points that are not close to any dense region.

Code implementation

1.using numpy

ALGORITHM According to this algorithm a cluster is a continious region of high density(contains certain number of points whose distant from core point is less that a certain number ε

-For each instance(point) counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε- neighborhood.
-If an instance has at least MinPts instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
-All instances in the same neighborhood will be assigned to the same cluster. this neighborhood may contain points that are core points to other neighborhoods
-Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

1.Import numpy

#import libraries
import numpy as np

2.create a DBSCAN Class(remember Object-Oriented Programming)

#class
class DBSCAN:
     #constructor function
    def __init__(self, eps, MinPts):
        self.eps = eps
        self.MinPts = MinPts
    #define the fit function

define methods for this class

1.fit function:

Initialize labels for all points to noise (-1).
Iterate over all points. If a point is already labeled, skip it.
Find neighbors within epsilon distance using the get_neighbors method.
If a point is not a core point (i.e., it has fewer than min_samples neighbors), label it as noise.
Otherwise, label the point as part of a new cluster and expand the cluster using the expand_cluster method.

def fit(self, data):
        # Initialize labels for all points to noise (-1)
        labels = np.full(data.shape[0], -1)#np.full creates an array of given shape and fill it with a certain value np.full(shape,value)
        cluster_id = 0
        for i in range(data.shape[0]):
            # If point is already labeled, skip it
            if labels[i] != -1:
                continue
            # Find neighbors within epsilon distance
            neighbors = self.get_neighbors(data, i, self.eps)
            # If point is not a core point, label it as noise
            if len(neighbors) < self.min_samples:
                labels[i] = -1
            else:
                # Label point as part of a new cluster
                labels = self.expand_cluster(data, labels, i, neighbors, cluster_id, self.eps, self.min_samples)
                cluster_id += 1
        return labels

2.get_neighbors method

Calculate the Euclidean distance between a point and all other points.
Return the indices of points within epsilon distance.
expand_cluster method:
- Label a point as part of a cluster.
- Iterate over all neighbors of the point. If a neighbor is noise or unlabeled, label it as part of the cluster.
- Find neighbors of each neighbor. If a neighbor is a core point, add its neighbors to the list.

def get_neighbors(self, data, index, eps):
        # Calculate Euclidean distance between point and all other points
        distances = np.linalg.norm(data - data[index], axis=1)
        # Return indices of points within epsilon distance
        return np.where(distances <= eps)[0]
    def expand_cluster(self, data, labels, index, neighbors, cluster_id, eps, min_samples):
        # Label point as part of the cluster
        labels[index] = cluster_id
        # Iterate over all neighbors
        for i in neighbors:
            # If neighbor is noise or unlabeled, label it as part of the cluster
            if labels[i] == -1:
                labels[i] = cluster_id
                # Find neighbors of the neighbor
                new_neighbors = self.get_neighbors(data, i, eps)
                # If neighbor is a core point, add its neighbors to the list
                if len(new_neighbors) >= min_samples:
                    neighbors = np.concatenate((neighbors, new_neighbors))
    return labels

use case

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

#Generate sample data
data, _ = make_moons(n_samples=200, noise=0.05)

#Create DBSCAN instance
dbscan = DBSCAN(eps=0.3, min_samples=10)

#Fit DBSCAN to data
labels = dbscan.fit(data)

#Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.show()

Drawing the Line: What Makes Support Vector Machines So Effective?

Leonhard Kwahle — Mon, 05 May 2025 15:56:25 +0000

Learn how Support Vector Machines find the optimal hyperplane to classify complex data with precision and power.

In the era of ever-growing data, choosing the right algorithm to make sense of it all is crucial. One such powerful and versatile tool is the Support Vector Machine (SVM) — a supervised learning model known for its ability to classify complex datasets with remarkable accuracy.

In this blog, we’ll break down how SVMs work, the intuition behind them, their mathematical foundations, and how you can implement them in real-world applications. Whether you're just starting with machine learning or looking to solidify your understanding, this guide will equip you with the essentials.

Intuition behind Support Vector Machines

Before diving into the math, let’s first understand the core intuition behind Support Vector Machines. Imagine you’re tasked with drawing a line (or hyperplane) that best separates two groups of data points on a graph. But this isn’t just any line — the goal is to find the optimal line that maximizes the margin between the two groups, leaving the least room for error.

This simple yet powerful concept forms the foundation of SVMs. By maximizing this margin, SVMs achieve better generalization, making them highly effective even for complex and noisy datasets. Let’s break it down further and see how this intuition translates into a powerful machine learning tool.

At the heart of SVMs is the concept of finding the optimal hyperplane. This hyperplane serves as the boundary that separates data into different classes. But it’s not just any boundary — SVMs seek the hyperplane that maximizes the margin, ensuring the maximum possible separation between the classes.

The data points closest to the hyperplane are called support vectors, and these are the key players in determining the optimal hyperplane. By focusing on these critical points, SVMs ensure that they don’t overfit to noisy data and maintain excellent generalization capabilities, even on unseen data.

In the next section, we’ll explore the mathematics behind finding this optimal hyperplane and how SVMs handle more complex cases like non-linear boundaries.

Mathematical foundation of Support Vector Machines

In this section we will lay a foundation using the 2D plane and after we have understood the underlying concepts we will then extend it to higher Dimensions

Problem statement

Given sets of points in a 2D plane , draw a line that seperates them in a way that the gap between that line and the nearest point on either sides is maximum. Then define a particular rule that will be used to decide where new points will belong
In the Real World is like fitting the widest street between two set of people in a given village . and the defining a rule that will be used in the future to allocate land to new comers

we will use mathematical equations to represent these planes

Imagine you went to the market with $200 to by two items X and Y.
1 unit of X cost $4 and 1 unit of X cost $3

how many items of X and Y must you purchase with $200

Total cost for X: 3 X quantity of X = 3X
Total cost for Y: 4 X quantity of Y = 4Y

TOTAL cost = 3X + 4Y

but remember this TOTAL MUST NOT EXCEED 200
SO
3X + 4Y < = 200
OUR DECISION BOUNDARY : 3X + 4Y  = 200 (A 2D hyperplane)

Equations of plane

1D: X = 5
2D: ax + by = 5
3D: ax + by + cz = 5

Decision Rule

IN our items purchase any combination that the total price does not exceed 200 is a valid combination
we will use the same idea to create a DECISION RULE for our Support Vector Classifier

Construct a 2D plane with red and black points
Determine how to seperate the red dots from the black dots
This is achieved by fitting a hyperplane which serves as a boundary

imagine a vector perpendicular to this dividing plane
we call this vector (w)

we need this vector because to calculate a
distance in a plane we need a vector perpendicular
to the plane

if you are getting confused check out this my blog post on use cases of vector dot product to real world problems
vector dot product arround us

if you understand let's continue

since every point in our plane is a vector our Decision rule 1 we be

Rule 1: for any new point (u) if the dot product of u and w exceed and certain number b it belongs to red else it belongs to black

\text{if} \quad \mathbf{u} \cdot \mathbf{w} \geq b

U BELONGS TO RED
else 
U belongs to BLACK

The problem at this level is to find an optimal value of w and b
we can modify our rule 1 as

$\mathbf{u} \cdot \mathbf{w} + b \geq 0$

there are many value of w and b SVMs has the ability to choose the optimal

we will add additional constraints
if

\mathbf{u}_r \cdot \mathbf{w} + b \geq 0 + 1 .........(1)

u is a red point

\mathbf{u}_b \cdot \mathbf{w} + b \leq 0 - 1 ........(2)

u is a black point

0 + 1 and 0 - 1 shows that there is a gap of 1 unit on either sides of the decision boundary

take $\mathbf{u}_b$ and $\mathbf{u}_r$ to be at the edge of the gap between the 2 set of points and the decision line at the center of this gap
then width of gap will be

eqn(1) - eqn(2)

#pick up your pen and paper and do the subtraction to get rid of b

(\mathbf{u}_r \cdot \mathbf{w} + b = 1) - (\mathbf{u}_b \cdot \mathbf{w} + b = - 1)

(\mathbf{u}_r \cdot \mathbf{w}) - (\mathbf{u}_b \cdot \mathbf{w}) = 1 - (-1) = 1 + 1 = 2

\mathbf{w} \cdot (\mathbf{u}_r - \mathbf{u}_b) = 2 ..............(*)

and finally we can divide both sides by magnitude of w to get formula for width of the gap(Margin)

\text{Margin} = \frac{2}{||\mathbf{w}||}

our objective is to maximize this margin

in another way

our objective is to minimize ||w||

At this point am about to do something strange

instead of minimizing

\min ||\mathbf{w}||

it is better and computatonally efficient to minimize

\min \frac{1}{2} ||\mathbf{w}||^2

:) trust me what we have just done is the right thing

\min_{\mathbf{w}, b} \ \frac{1}{2} |\mathbf{w}|^2 \quad \text{subject to } y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \ \forall i

In the constraints

y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \ \forall i

If $y_i = +1$ : the point $xi\mathbf{x}_i$ should lie on the positive side of the hyperplane.
If $y_i = -1$ : the point $xi\mathbf{x}_i$ should lie on the negative side.

This constraint ensures all data points are correctly classified and lie outside the margin (not just on the correct side, but also at least 1 unit away from the hyperplane).
.

at this point finding turning points will not be helpful because we
must respect the constraints

For that we will introduce Lagrange Multipliers

Lagrange Multipliers

K-Means Clustering Made Simple: Grouping Data the Smart Way

Leonhard Kwahle — Tue, 29 Apr 2025 17:03:30 +0000

K-Means is an easy yet powerful algorithm that organizes data into groups based on similarity. It's widely used in customer segmentation, image compression, and more—perfect for making sense of messy data.

Before we dive into how K-Means works, let’s first understand what clustering actually means.
Clustering referes to the task of identifying similar instances in a dataset and assigning them to unlabelled groups called clusters.
and to let you know it is an unsupervised machine learning task

It is widely applicable in areas like
-Customer segmentation : group customers based on their purchase and activity in an e-commerce site

-Data analysis: when analyzing a large dataset. Think of performing inferential statistics. You can perform a clustering algorithm to see variety groups in the data set and that will play an important role in your sampling algorithm.

-As a dimensionality reduction technique : This can be achieved when you take a dataset with alot of features and you group datapoints into clusters such that the number of clusters is less than the number of features in the dataset. and then work with these datapoints based on the cluster they belong to and not by their features again. for example detecting a non-plant object from a picture . you can group all the different plants to one cluster and all the non plant object to another cluster thus leaving you with just 2 features which are plant and non-plant

For anomaly detection : when an instance has low affinity to all the clusters( does not belong to any of the clusters ). That is to say when an instance(a data point) exhibit characteristic that is not common in any cluster

Image segmentation: cluster pixels according to their colors the replace each pixel with the mean color of it’s cluster .that will reduce the number of colors in the image (used for object detection )

Search Engines : first apply clustering algorithm to all the images in the database. When a user searches an image send it with its cluster members

What is kmeans clustering

This algorithm was first proposed by Stuart Lloyd at Bell Lab in 1957 as a technique for pulse code modulation. It Was later published virtually by Edward W. Forgy . that is why it is also called Lloyd-Forgy

How does this algorithm works

for a given dataset to perform kmeans clustering on

Select randomly the number of clusters k
Randomly assign k points as centroids
Assign datapoints to these centroids to form clusters
Keep updating the centroids till they stop moving ( converge…)

WRONG CENTROID INITIALIZATION WILL CAUSE YOUR ALGORITHM TO CONVERGE TO A NON-OPTIMAL SOLUTION
and that is why you must learn how to correctly initialise centroids

centroid of a cluster is that** datapoint(instance)** that can be used to represent all the datapoint(instances) of that cluster

How to effectively initialize centroids

Assuming you ran the algorithm earlier and happens to know approximately where the centroids should be. Then you can set the init hyper parameter to a Numpy array containing the list of centroids and set n_init = 1

import numpy as np
#still need to import KMeans from sklearn
good_init_array = np.array([[],[],[],[],[]])
kmeans = KMeans(n_clusters=5, init = good_init_array, n_init=1)
#n_init = 1 means the algorithm will run only once

Another solution is to run the algorithm multiple times with different randomized solutions and take the best solution

The number of randomized inititalization is controlled by the n_init hyper parameter .
By default n_init=10. This means that the whole algorithm runs 10 times when you call the .fit() function and then scikit-learn will keep the best solution.

But how does scikit-learn know the best solution :) ?
it uses a performance metric called model's inertia

model's inertia

It is the mean squared distance between each instance and its closest centroid
The kMeans class runs the algorithm n_init times and keeps the model with the lowest inertia

*how to get the model's inertia *

Kmeans.inertia_
Kmeans.score() #

kmeans++

In 2006 David Arthur and Sergei Vassilvitskii. In their paper proposed a smarter initialization step that tend to select centroids that are distant from one another and this improvement makes the K-Means algorithm much less likely to converge to a suboptimal solution.

They showed that even though this method requires an additional steps for smarter initialization. It is worth it because it makes it possible to drastically reduce the number of the algorithm needs to run to find the optimal solution.

kmeans ++ initialization algorithm

Take one centroid $c(1)\text{c}^\text{(1)}$ , chosen uniformly at random from the dataset.
Take a new centroid $c(i)\text{c}^\text{(i)}$ , choosing an instance $x(i)\text{x}^\text{(i)}$ with probability

$\frac{D(x^\text{(i)})^2}{\sum_{j=1}^{m} D(x^\text{(j)})^2}$

where,

$x^{(i)}$ : The $ithi^\text{th}$ data point in your dataset.
$D(x^{(i)})$ : The shortest distance from $x^{(i)}$ to the nearest already chosen cluster center.
$D(x^{(i)})^2$ : The squared distance, giving higher weight to points farther away.
$∑j=1mD(x(j))2\sum_{j=1}^{m} D(x^{(j)})^2$ : The sum of all squared distances from each point $x^{(j)}$ to its nearest cluster center.

-This algorithm ensures that any instances farthest from from the centroid is likely to me a centroid

Repeat till all the k –clusters have been found

Accelerated kmeans clustering

This one was produced in 2003 by Charles Elkan.
It considerably accelerated the algorithm by avoiding many unnecessary calculations .

Elkan achieved this by exploiting the triangle of inequalities(i.e that a straight line is always the shortest path between 2 points)
And by keeping track of upper and lower bounds for distances between instances and centroids.
This is the algorithm the k means class uses by default.

Mini-batch kmeans clustering

This algorithm was proposed in a 2010 paper by David Scully . Instead of using the whole dataset for each iteration ,the algorithm is capble of using mini-batches moving the centroid just slightly after each iteration .

This speeds up the algorithm typically by a factor of three or four and makes it possible to cluster huge datasets that do not fit in memory. Scikit-Learn implements this algorithm in the MiniBatchKMeans class. You can just use this class like the KMeans class

#import
from sklearn.cluster import MiniBatchKMeans

--
#this is how it is used
minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X) # X is our dataset

Although the Mini-batch K-Means algorithm is much faster than the regular KMeans algorithm, its inertia is generally slightly worse, especially as the number of clusters increases

How to find the optimal number of clusters

this is when you will now know your elbow is not useless

Generally it is not always easy to get the number of cluster for your algorithm. And setting a wrong value of k will lead to bad results.
Now what if we just take the k value with the smallest model’s inertia
This will not work because the value of inertia decreases as the value of k increases meaning that even when you exceed the optimum value of k the model’s inertia will still keep decreasing till every data point becomes a clusters on its own and at that point models inertia = 0

But what actually happens is that as you increase k . the models inertia decreases at very hight rate till you reach the optimum k value from there now the model’s inertia will now be decreasing at a very low rate and this might split perfect clusters for no good reason hence the inertia is not a good performance metric.

Since inertia have failed, a more precise but computationally expensive approach is to use the **silhouette score **which is the mean silhouette coefficient over all instances.

How to calculate the silhouette score

remember the silhouette score is the mean(average) silhouette coefficient over all instances.

so for each instance

Calculate the mean distance from that instance to its cluster members and assign this value to a variable a
Get the closest neighbor cluster
Calculate the mean distance to all the instances of this neighbor cluster and assign to b
Then silhouette score = $(b-a)max(a,b)\frac{\text{(b-a)}}{\text{max(a,b)}}$
The silhouette coefficient vary between -1 and 1
value approaching 1 means instance is in the correct cluster
value approaching 0 means instance is in cluster boundary
value approaching -1 means instance is in the wrong cluster

To compute the silhouette score in scikit-learn is very easy

#import
from sklearn.metrics import silhouette_score

#use
silhouette_score(X, kmeans.labels)

Limitations of K-Means Clustering

While K-Means is simple and widely used, it comes with several important limitations:

1. Requires Predefined Number of Clusters (k)

You must specify the number of clusters in advance, which is often unknown. Choosing the wrong value can lead to poor clustering results.

2. Assumes Spherical, Evenly Sized Clusters

K-Means assumes that all clusters are circular and similar in size. It struggles with:

Non-spherical (e.g., elongated or irregular) clusters
Clusters of different densities
Unevenly sized clusters

3. Sensitive to Initialization

The initial placement of centroids greatly affects the final output. Bad initialization can cause the algorithm to converge to a suboptimal solution.

🔧 Using K-Means++ helps improve initial centroid placement.

4. Sensitive to Outliers and Noise

Outliers can significantly skew the position of cluster centroids, leading to inaccurate groupings.

5. Only Works with Numeric Features

Since K-Means relies on Euclidean distance, it performs poorly with:

Categorical or mixed-type data
Features that aren’t properly scaled

6. May Converge Slowly on Large or Complex Data

For large datasets or poor starting points, K-Means may take many iterations to converge or get stuck in local minima.

Despite these limitations, K-Means remains a powerful tool — especially when combined with techniques like K-Means++ initialization, dimensionality reduction, or outlier filtering.

Hands-on code implementation

Want to See the Code in Action?
Check out my follow-up post where I walk through the full code implementation of K-Means (including K-Means++ initialization):

🔗 K-Means Clustering: Code Implementation in Python

If you found this helpful, don't forget to 💬 comment or 🧡 react!

K-Means Clustering: A Practical Implementation Guide

Leonhard Kwahle — Tue, 29 Apr 2025 17:00:41 +0000

Unveiling the power of unsupervised learning through a step-by-step implementation of the K-Means algorithm, transforming raw data into meaningful clusters.

1. implementation using numpy only

step 1: import numpy and matplotlib

import numpy as np
import matplotlib.pyplot as plt

*step 2:generate sample data *

# 1. Generate sample data
np.random.seed(0)## Set the random seed for reproducibility (like starting a game with the same dice roll every time)

# normal distribution N(mean,std,[rows,columns])
X = np.concatenate([np.random.normal(0, 1, (100, 2)),
                    np.random.normal(5, 1, (100, 2))])

step 3:initialize centroids randomly

k = 2 #number of clusters

#This function randomly selects k (which is 2 in this case) distinct indices (positions) from the range of 0 to the total number of data points.
#replace=False ensures that the same index is not chosen twice.
#X.shape[0] = number of rows

centroids = X[np.random.choice(X.shape[0], k, replace=False)]

step 4: interation

#K-means iterations
max_iterations = 100
for _ in range(max_iterations):
  # Assign points to nearest centroid
  distances = np.sqrt(np.sum((X[:, np.newaxis, :] - centroids)**2, axis=2))
  labels = np.argmin(distances, axis=1)

  # Update centroids
  new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])

  # Check for convergence
  if np.allclose(centroids, new_centroids):
    break

  centroids = new_centroids

step 5:plot

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='red')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

2. implementation using scikit-learn

step 1: import libraries

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

step 2: generate sample data( could be replaced with real world data)

# Sample data (replace with your actual data)
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

step 3: Determine the number of clusters (k) randomly

k = 2  # Example: 2 clusters

step 4: Create a KMeans object

kmeans = KMeans(n_clusters=k) # Initialize the KMeans model with the specified number of clusters

step 5: fit data(X) to model for training

kmeans.fit(X)  # Train the model on the data

step 6: get clusters

# Get the cluster centers
centroids = kmeans.cluster_centers_ # Get the coordinates of the cluster centers

step 7: Predict cluster labels for each data point

labels = kmeans.predict(X)

step 8: visualization of clusters

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')  # Plot data points, colored by cluster
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='r') # Plot centroids
plt.xlabel('Feature 1')  # Label the x-axis
plt.ylabel('Feature 2')  # Label the y-axis
plt.title('KMeans Clustering')  # Set the title of the plot
plt.show()  # Display the plot

kmeans clustering using a real world dataset (wine dataset)

The dataset for our small implementation can be gotten from
archive.ics.uci.edu/ml/datasets/wine+quality

i) import neccesery libraries

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

ii)load the data

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"  # URL of the dataset
try:
  wine_data = pd.read_csv(url, sep=";")
except Exception as e:  # Catching a broader range of potential errors
  print(f"Error loading data from URL: {e}")
  exit()

iii)Select relevant features for clustering (example)

features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
X = wine_data[features]

iv)Determine the optimal number of clusters using the Elbow method (optional)

wcss = []
for i in range(1, 11):
  kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

v)apply KMeans clustering

kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
kmeans.fit(X)

vi)Add cluster labels to the DataFrame

wine_data['cluster'] = kmeans.labels_

vii)Analyze the clusters (example: calculate the mean of each feature for each cluster)

cluster_means = wine_data.groupby('cluster').mean()
print(cluster_means)

viii)Visualize the clusters (example: using the first two features)

plt.scatter(wine_data['fixed acidity'], wine_data['volatile acidity'], c=wine_data['cluster'], cmap='viridis')
plt.xlabel('Fixed Acidity')
plt.ylabel('Volatile Acidity')
plt.title('Wine Quality Clusters')
plt.show()

if you found this helpful please like and share : )
in case i made some errors i will be very grateful for corrections

Logistic Regression from theory to code implementation

Leonhard Kwahle — Mon, 21 Apr 2025 15:56:36 +0000

Imagine you are building a document filter that takes documents as input and decide whether they are fraud or not. You will need a model that doesn't just predict yes or no but gives you a probability. Like "this document is 40% likely to be fraud".

Logistic Regression is perfect for this kind of problem.
In this post, we'll break down the math behind *logistic regression * step-by-step. No scary equations, just clear, intuitive explanations — with a little help from Python code along the way!

What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for categorical classification based on threshold value between 0 and 1 (call is threshold probability if you wish).This could involve predicting if something belong to 1 out of 2 categories(binary classification) or 1 out of many discrete categories ().This could involve classifying emails into spam or not spam, classifying persons as sick or healthy , infected or not infected just to name a few.
While it has "regression" in the name, logistic regression is actually about classification, not predicting a continuous number like standard linear regression does.

But regression is the bases for this classification as the classification is done using continous values between 0 and 1.

Mathematical background

1.Probability: A probability is just a number between 0 and 1 that tells us the likely of an event occuring. 0 means impossible and 1 means certain and 0.6 means 60% chance of occuring.

2.Odds:Odds are just another way of expressing probability. It is the ratio between probability of success(p) and probability of failure(1 - p) this is called odds for or probability of failure(1 - p) divided by probability of success (p).

from this point i will only be talking about odds for

\text {Odds} = \frac {p} {1 - p}

Odds ranges from 0 to infinity.

odds > 1 means success is more likely
odds < 1 means failure is more likely
odds = 1 means 50-50 chance for failure and success

Odds Can only take positive numbers and we need a way to map these numbers to the set that ranges from -infinity to +infinity
Logarithms are perfect for solving this problem

3.log(Odds) or Logit function:
So we take the logarithm of odds, called the log-odds (or logit).
Formula:

\text {log(Odds)} \text{ } or \text{ } logit(p) = log(\frac {p} {1 - p})

if p > 0.5 logit(p) is positive
if p < 0.5 logit(p) is negative
if p = 0.5 logit(p) = 0

why are we doing all these
linear models like Logistic Regression produce output from -infinity to +infinity
But probabilities are restricted between 0 and 1
So we model log-odds (which stretch across the whole real number line) as a linear function of inputs.

that is ;

log(\frac {p} {1 - p}) = \text {w.x} + \text{b}

where w and x are vectors

this is log base e where e = 2.718...

if we know that z = w.x + b gives the log(odds) how do we get the probability (p) .

The Sigmoid function:

This function takes a number between -infinity and +infinity and map it to a number between 0 and 1 ( a probability)
It is obtained by solving for p in the logit function

take $\text{ } {z} = \text{w}. {x} + {b} \text { }$ then

\sigma_ (z)= \frac {1} {1 + e^{-z}}

if z is large and positive output will be close to 1
if z is large and negative output will be close to 0
if z = 0 output will be 0.5

To see what is looks like

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

#create 100 numbers from -10 to 10
z = np.linspace(-10, 10, 100)

#Plot in a graph
plt.plot(z, sigmoid(z))
plt.title("Sigmoid Function")
plt.xlabel("z")
plt.ylabel("σ(z)")
plt.grid()
plt.show()

The Decision Boundary

Once we have the output probability

$y^=σ(z)\hat{y} = \sigma(z)$ , how do we decide the class?

Easy:

If $y^≥0.5\hat{y} \geq 0.5$ , predict class 1 (positive)
If $y^<0.5\hat{y} < 0.5$ , predict class 0 (negative) The decision boundary is where $σ(z)=0.5\sigma(z) = 0.5$ , which happens when $z = 0$ . So, the equation $\cdot x + b = 0$ defines the decision boundary — a straight line (or a hyperplane in higher dimensions).

Cost Function: Measuring How Bad Our Predictions Are.

We need the cost function to measure how bad our model's prediction are
In linear regression, we used Mean Squared Error (MSE).
But in logistic regression, MSE doesn’t work well because of the sigmoid's non-linear nature — it causes messy, non-convex optimization.

Instead, we use Log-Loss (aka Cross-Entropy Loss)

The cross-entropy loss tells us how good our prediction $\hat{y}$ is compared to the true label $y$ .

The formula is:

L(y^,y)=−(ylog⁡(y^)+(1−y)log⁡(1−y^)) L(\hat{y}, y) = - \left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right)

y is a label, a class, a category (0 or 1)
If the true label $y = 1$ , we want $y^\hat{y}$ to be close to 1.
If $y = 0$ , we want $y^\hat{y}$ to be close to 0.
The closer the prediction is to the truth, the smaller the loss!

For multiple examples, we just average the losses:

loss=1m∑i=1mL(y^(i),y(i)) \text{Average loss} = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})

where $m$ is the number of examples.

Goal: Make the loss as small as possible by adjusting the model!

Optimization: Finding the Best Weights that minimizes the loss

We want to minimize the total loss over all data points.
We'll do this using gradient descent
The gradient is defined using partial derrivatives
The gradient at any point in space will tell you the direction you should follow if you want to reach the highest point in that space as fast as possible .
On the contrary gradient descent does the opposite (the fastest direction to the bottom)

how is this achieved

1.compute the gradient of the loss with respect to 
each parameter (w and b).
2.Update the parameters a little bit opposite to the 
gradient (downhill!).

Update Rules (Gradient Descent)

After computing the gradients, we update the weights and bias like this:

Weight update:

$\alpha \frac{\partial L}{\partial w}$
Bias update:

$\alpha \frac{\partial L}{\partial b}$

where:

$L$ is the loss
$α\alpha$ (alpha) is the learning rate — it controls how big the update steps are.

# Assume we have loss gradient dw, db
w = w - learning_rate * dw
b = b - learning_rate * db

If you have more than two categories (e.g., cat, dog, rabbit), you extend logistic regression into Softmax Regression (a.k.a. Multinomial Logistic Regression).

Softmax generalizes sigmoid to handle multi-class classification.

Now that we fully understand the math behind logistic regression — from the linear model(w.x + b) to the sigmoid function( $σ(z)\sigma(z)$ ) and log-loss — it's time to bring it to life with real-world data.

We'll start by implementing logistic regression using Scikit-learn, a popular machine learning library that makes applying models incredibly easy.
After that, we'll also build the same model using TensorFlow/Keras to show how logistic regression fits naturally into deep learning workflows

Even though libraries like Scikit-learn and TensorFlow handle all the math for us under the hood — like computing the log-odds, applying the sigmoid function, and minimizing the cross-entropy loss — understanding the math gives us intuition about what the model is doing behind the scenes

We'll use the Pima Indians Diabetes Dataset, a famous dataset where the goal is to predict whether a patient has diabetes based on medical information like glucose level, BMI, age, and more

I) Implementation using scikit learn

Step 1: Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Load the Dataset

# Load dataset
data = pd.read_csv('diabetes.csv')

# View the first few rows
print(data.head())

Step 3: Prepare the Data

# Split into features (X) and labels (y)
X = data.drop('Outcome', axis=1)  # 'Outcome' is the target
y = data['Outcome']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Train the Model

# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

Step 5: Evaluate the Model

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

II)Implementation using TensorFlow/keras

Step 1: Import Libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import pandas as pd  # Import pandas for data handling
from sklearn.model_selection import train_test_split # Import train_test_split for splitting data
from sklearn.preprocessing import StandardScaler # Import StandardScaler for feature scaling

step 2: Load and prepare data

#2. Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(url, header=None)

# View the first few rows
print(data.head())

#give labels to its columns 
data.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

#LOOK at few rows again :)
data.head()

# Split into features (X) and labels (y)
X = data.drop('Outcome', axis=1)  # 'Outcome' is the target
y = data['Outcome']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Feature scaling (important for logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Build the Model

model = Sequential([
    Dense(1, activation='sigmoid', input_shape=(X_train.shape[1],))
])

Explanation:
Only 1 neuron because it’s binary classification.
Sigmoid activation because we want output probabilities between 0 and 1.

Step 4: Compile the Model

model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])

Explanation:
Adam optimizer for efficient gradient descent.
Binary Crossentropy because it’s binary classification.

Step 5: Train the Model

history = model.fit(
                 X_train, 
                 y_train, 
                 epochs=100,
                 batch_size=32, 
                 validation_split=0.2, 
                 verbose=1
                 )

Epochs = 100: Train 100 passes through the dataset.
Batch size = 32: Process 32 samples at a time.

Step 6: Evaluate the Model

# Evaluate on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Test Accuracy:", accuracy)

As we can see, both Scikit-learn **and **TensorFlow make it incredibly easy to build a logistic regression model.
While Scikit-learn is perfect for quick classical machine learning models, TensorFlow shines when you want to extend logistic regression into deep learning architectures later on.

Visualizing model performance

To better understand how our logistic regression model performed, let's visualize the confusion matrix. and also take a look at the model's learning curves over the training process.

1. Plotting the Confusion Matrix (Scikit-learn)
Confusion matrices help you see not just accuracy, but where the model is making mistakes.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Scikit-learn Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

2. Plotting Training History (TensorFlow/Keras)
TensorFlow gives you the history object, which tracks loss and accuracy during training.
Let’s plot how the model learned over time:

# Plot training & validation accuracy values
plt.figure(figsize=(12,5))

# Accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy over Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')

# Loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss over Epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')

plt.tight_layout()
plt.show()

see all the working code at
_https://colab.research.google.com/drive/1o2OH-ohslAur0R565W63bwGCLLzo12A7?usp=sharing

Linear Regression from a high level to code implementation

Leonhard Kwahle — Mon, 14 Apr 2025 17:48:46 +0000

Regression is simply a statistical process used to understand and model relationship between variables(2 or more) with the goal of Predicting continuous output values using a set of input values

Regression can be linear or non-linear

Linear Regression is statistical process that models the relationship between a target feature(variable) and independent input feature(s)
using a linear equation

Or
Linear regression is a type of Regression that creates and uses a Linear model to perform it's prediction

A model is a simplified (abstract) representation of a complex entity (object) in the real world

An Object is anything which can have a defined role to play when solving a problem

In our context our Models will be Mathematical equations
For a better understanding let's explore Linear Regression from the idea of Covariance and Correlations

Covariance is a measure of how two variables change together. More specifically, it tells you whether an increase in one variable will likely result in an increase or decrease in another variable. In simple terms, covariance shows the direction of the relationship between two variables. However, it doesn't provide much information about the strength of that relationship, nor is it easy to interpret in a standardized way because its value depends on the scale of the variables.
_mathematically _ the covariance of 2 variables X and Y is given as

\text{Cov}(X, Y) = \frac{1}{n-1} \displaystyle\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})

Covariance can be positive, zero, or negative
positive:both variable increase together
negative:one increases as the other decreases
zero:no predictable relationship between the variables.

While Covariance tells us whether two variables move together, it doesn't standardize how strongly they move together. This is where Correlations comes in. By dividing the covariance of two variables by the product of their standard deviations, we get a dimensionless measure called the correlation coefficient (usually denoted as r):

\frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}, \text{ }{ and }{} \sigma_X = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})^2}

A simple example in python

# Sample data
x = [2, 4, 6, 8, 10]
y = [1, 3, 5, 7, 9]

# Function to compute mean
def mean(data):
    return sum(data) / len(data)

# Function to compute covariance
def covariance(x, y):
    x_mean = mean(x)
    y_mean = mean(y)
    return sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y)) / len(x)

# Function to compute standard deviation
def stddev(data):
    data_mean = mean(data)
    return (sum((x - data_mean) ** 2 for x in data) / len(data)) ** 0.5

# Function to compute correlation
def correlation(x, y):
    return covariance(x, y) / (stddev(x) * stddev(y))

# Output
print("Covariance:", round(covariance(x, y), 4))
print("Correlation:", round(correlation(x, y), 4))

# Plot
plt.figure(figsize=(6, 4))
plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, y, color='green', linestyle='--', label='Trend line')

# Add text info
plt.title(f"Scatter Plot\nCovariance: {round(cov, 2)} | Correlation: {round(corr, 2)}")
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

example with a csv file

import pandas as pd
import matplotlib.pyplot as plt

# Load the CSV file
df = pd.read_csv('data.csv')  # Make sure this file exists in your working directory

# Extract two columns
x = df['height']
y = df['weight']

# Compute covariance and correlation
cov = df[['height', 'weight']].cov().iloc[0, 1]
corr = df[['height', 'weight']].corr().iloc[0, 1]

# Print results
print(f"Covariance: {round(cov, 4)}")
print(f"Correlation: {round(corr, 4)}")

# Plot
plt.figure(figsize=(6, 4))
plt.scatter(x, y, color='teal', label='Data Points')
plt.plot(x, y, color='coral', linestyle='--', label='Trend Line')

plt.title(f"Height vs Weight\nCovariance: {round(cov, 2)} | Correlation: {round(corr, 2)}")
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In a nutshel Correlation coefficient (r) describes the strength and direction of the relationship between two or more variables in a standard way that is independent of their respective units of measurement. Meanwhile the Covariance which does a pretty good job for the same task is highly affected by the units of measurement of the variables in concern.

Now that we understand covariance and the correlation coefficient, we are ready to step into linear regression—a cornerstone of predictive modeling.

While the correlation coefficient(r) tells us how strong the relationship is between two variables, it doesn’t tell us the exact nature of the relationship. This is where linear regression comes in.

Linear regression aims to model the relationship between an independent variable,(X) and a dependent variable (y) by fitting a straight line:

$y = m x + b$

Where:

y is the predicted value (dependent variable),
X is the input feature (independent variable),
m is the slope of the line,
b is the y-intercept.

Our goal is to find the values of m and b such that the line "best fits" the data.

Deriving the Best Fit Line

To compute the slope m and intercept b, we use the least squares method. This method minimizes the sum of the squared differences between the observed values and the predicted values.

Slope (m):

\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

You might recognize the numerator—it's the covariance between X and y, while the denominator is the variance of X.

So, another way to write it is:

\frac{\text{Cov}(x, y)}{\text{Var}(x)}

Intercept (b):

\bar{y} - m\bar{x}

Where:

$Xˉ\bar{X}$ is the mean of X,
$yˉ\bar{y}$ is the mean of y.

Intuition Behind the Line

The line $y = m x + c$ passes through the point ( $Xˉ\bar{X}$ , $yˉ\bar{y}$ )That is, the line always intersects the mean of the data.

Also, the slope (m) tells us how much y changes for a unit increase in X. A positive slope means a positive relationship, and vice versa.

Implementing Linear Regression in Code (No Libraries)
Let’s implement this in pure Python, so we understand every step.

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

# Step 1: Calculate means
x_mean = sum(x) / len(x)
y_mean = sum(y) / len(y)

# Step 2: Calculate slope (m)
numerator = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
denominator = sum((xi - x_mean) ** 2 for xi in x)
m = numerator / denominator

# Step 3: Calculate intercept (b)
b = y_mean - m * x_mean

# Final equation
print(f"Regression line: y = {m:.2f}x + {b:.2f}")

Evaluation Metrics for Linear Regression

Once you've trained your regression model, it's crucial to assess how well it's performing. Below are the most widely used metrics for evaluating linear regression.

1. Mean Absolute Error (MAE)

Mean Absolute Error measures the average of the absolute differences between predicted and actual values.

MAE=1n∑i=1n∣yi−y^i∣ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|

$yi\text{y}_i$ = actual value
$y^i\hat{y}_i$ = predicted value
n = number of data points

It gives a linear score, meaning all errors are weighted equally.

2. Mean Squared Error (MSE)

Mean Squared Error measures the average of the squared differences between predicted and actual values.

MSE=1n∑i=1n(yi−y^i)2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Squaring emphasizes larger errors more than smaller ones, making this a sensitive metric to outliers.

3. Root Mean Squared Error (RMSE)

RMSE is simply the square root of MSE. It brings the error metric back to the original unit of the output variable.

RMSE=1n∑i=1n(yi−y^i)2 \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }

4. R-squared (Coefficient of Determination)

R-squared () represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2 R^2 = 1 - \frac{ \sum_{i=1}^{n} (y_i - \hat{y}i)^2 }{ \sum{i=1}^{n} (y_i - \bar{y})^2 }

Where:

$yˉ\bar{y}$ is the mean of the actual values.

An $R^2$ of:

1.0 means perfect prediction,
0 means no better than predicting the mean,
Negative values can occur when the model performs worse than a horizontal line.

Summary

Metric	Description	Sensitive to Outliers
MAE	Average absolute error	No
MSE	Average squared error	Yes
RMSE	Square root of MSE	Yes
$R^2$	Proportion of explained variance	No (but informative)

Choose the metric that aligns best with your business goals and sensitivity to error types.

Metric evaluation in python

import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Actual and predicted values
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

# --- Using NumPy (manual calculations) ---

# MAE
mae = np.mean(np.abs(y_true - y_pred))

# MSE
mse = np.mean((y_true - y_pred)**2)

# RMSE
rmse = np.sqrt(mse)

# R-squared
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
r2 = 1 - (ss_res / ss_tot)

print("Manual Calculation:")
print(f"MAE: {mae:.3f}")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")

# --- Using Scikit-learn (recommended for real-world use) ---

mae_sk = mean_absolute_error(y_true, y_pred)
mse_sk = mean_squared_error(y_true, y_pred)
rmse_sk = mean_squared_error(y_true, y_pred, squared=False)  # RMSE directly
r2_sk = r2_score(y_true, y_pred)

print("\nUsing Scikit-learn:")
print(f"MAE: {mae_sk:.3f}")
print(f"MSE: {mse_sk:.3f}")
print(f"RMSE: {rmse_sk:.3f}")
print(f"R²: {r2_sk:.3f}")