Forem: Koki Esaki

Brushing Up on k-NN for Classification in Python: Theory to Practice

Koki Esaki — Tue, 06 Feb 2024 03:42:53 +0000

Theory

The k-Nearest Neighbor (k-NN) algorithm is frequently characterized as the foundational algorithm in machine learning. It operates by calculating the distances between data points in a training dataset and a test dataset to identify the closest points, termed "nearest neighbors." This method does not restrict itself to just one nearest neighbor; instead, it allows for the selection of a specific number (k) of nearest neighbors during its prediction process.

Euclidean and Manhattan distances are commonly used to calculate distances.

Euclidean distance:

\sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2}

Manhattan distance:

d = |(b_1 - a_1)| + |(b_2 - a_2)|

This method is not only applicable to classification tasks but can also be used for regression problems.

Implementation

To implement the k-NN algorithm, we will use the Iris dataset, which is a popular dataset for classification tasks. The dataset contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The target variable is the species of the iris flower, which can be one of three classes: setosa, versicolor, or virginica.

First, we will load the dataset and split it into training and test datasets.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))

samples: 150; features: 4
samples: 150; values: [0 1 2]

The following code snippet demonstrates the implementation of the k-NN algorithm using the Euclidean distance metric.

from typing import List


class KNeighborsClassifier:

    def __init__(self) -> None:
        self._X_train = None  # The training features to be saved
        self._y_train = None  # The training target to be saved

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Fit the model from the training dataset.

        :param X: The training features.
        :param y: The training target.
        """

        self._X_train = X
        self._y_train = y

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the class labels for the provided data.

        :param X: The data to be classified.
        :return: The class labels for the provided data.
        """
        classlabels = []
        for p0 in X.values:
            distances = []
            for p1 in self._X_train.values:
                # Calculate the Euclidean distance between two points.
                distance = self.calculate_euclidean_distance(p0, p1)
                distances.append(distance)

            # In this classification model, the nearest point is the class label.
            # It is possible to use a different number of nearest points to get outcomes in other problems.
            nearest_index = np.array(distances).argmin()
            classlabels.append(self._y_train.values[nearest_index])

        return classlabels

    def calculate_euclidean_distance(self, p0: List[float], p1: List[float]) -> float:
        """Calculate the Euclidean distance between two points.

        :param p0: The first point.
        :param p1: The second point.
        :return: The Euclidean distance between the two points.
        """
        return np.sqrt(np.sum((p0 - p1) ** 2, axis=0))

Now that we have implemented the k-NN algorithm, we can fit the model to the training dataset.

model = KNeighborsClassifier()
model.fit(X_train, y_train)

Finally, we can use the model to predict the class labels for the test dataset and evaluate the model's performance.

from sklearn.metrics import accuracy_score

y_test_pred = model.predict(X_test)
print(f"Accuracy score for test data: {accuracy_score(y_test, y_test_pred)}")

Accuracy score for test data: 0.9736842105263158

Exploring Gradient Descent Variants, and Fundamentals of Implementation

Koki Esaki — Mon, 05 Feb 2024 04:51:06 +0000

Introduction

After grasping the concepts of linear regression and its optimization technique, gradient descent, in the previous article, here's an opportunity to dive deeper into gradient descent for a more comprehensive understanding.

Types of Gradient Descent

The gradient descent methods can be broadly categorized into three primary types.

Batch Gradient Descent

In this optimization technique, the entire dataset is used to compute the gradient of the cost function. Essentially, it involves evaluating the loss and updating model parameters once per epoch (a complete pass through the dataset). Batch Gradient Descent is computationally efficient for small to medium-sized datasets but can be slow for large datasets.

Stochastic Gradient Descent, SGD

Unlike Batch Gradient Descent, SGD processes one training example at a time to calculate the gradient. This results in frequent updates to the model parameters and introduces more randomness into the optimization process. While it can be faster and can escape local minima, it may exhibit more oscillations in convergence due to the noise from individual data points.

Mini-batch Gradient Descent

Mini-batch Gradient Descent strikes a balance between Batch and Stochastic Gradient Descent. It divides the dataset into smaller subsets called mini-batches. The gradient is calculated and model parameters are updated after processing each mini-batch. This approach combines some benefits of both previous methods: it's computationally efficient and introduces some noise for faster convergence.

Implementation of Gradient Descent

For the scope of this article, our primary focus will be on the batch gradient descent method. Let's consider a simple example to illustrate the gradient descent method. We will use the following cost function:

f(x, y) = 3x^2 - 2xy + 3y^2 + 5x - 5y

The partial derivatives of $f (x, y)$ with respect to $x$ and $y$ are:

\nabla f(x, y) = \begin{bmatrix} 6x - 2y + 5 \ -2x + 6y - 5 \end{bmatrix}

Here, we will minimize the cost function $f (x, y)$ using the gradient descent method. We will start with an initial point $x_0, y_0) = (1.0, 1.0)$ .

To begin, let's import the required libraries and define the problem we aim to solve.

pip install numpy==1.23.5 matplotlib==3.7.4

import numpy as np


def f(solution: np.ndarray) -> float:
    """The function to minimize.

    :param solution: The solution to the function.
    :return: The value of the function.
    """
    x, y = solution  # Unpack the solution
    return 3 * x ** 2 - 2 * x * y + 3 * y ** 2 + 5 * x - 5 * y


def df(solution: np.ndarray) -> np.ndarray:
    """The derivative of the function.

    :param solution: The solution to the function.
    :return: The gradient of the function.
    """
    x, y = solution  # Unpack the solution
    return np.array([6 * x - 2 * y + 5, -2 * x + 6 * y - 5])

Subsequently, we will proceed to implement the gradient descent method without relying on existing libraries.

from typing import Callable


class GradientDescent:
    """Gradient Descent Method."""

    def __init__(self, f: Callable, df: Callable, alpha: float = 0.01, eps: float = 1e-6) -> None:
        """Initialize the gradient descent method.

        :param f: The function to minimize.
        :param df: The derivative of the function.
        :param alpha: The learning rate.
        :param eps: The convergence criterion.
        """
        self.f = f
        self.df = df
        self.alpha = alpha
        self.eps = eps

        self.solutions = []  # Store the solutions (parameters) at each iteration
        self.answers = []  # Store the value of the function at each iteration
        self.gradients = []  # Store the gradient of the function at each iteration

    def solve(self) -> None:
        """Solve the optimization problem."""
        self.solutions = []  # Empty the solutions
        self.answers = []  # Empty the value of the function
        self.gradients = []  # Empty the gradient of the function

        solution = np.array([1.0, 1.0])  # Initial solution
        answer = self.f(solution)  # Value of the function at the initial solution
        grad = self.df(solution)  # Gradient of the function at the initial solution

        self.solutions.append(solution)
        self.answers = [answer]
        self.gradients.append(grad)

        # Iterate until the gradient is close to zero
        while (grad ** 2).sum() > self.eps ** 2:
            solution = solution - self.alpha * grad  # Update the solution
            answer = self.f(solution)  # Value of the function at the updated solution
            grad = self.df(solution)  # Gradient of the function at the updated solution

            self.solutions.append(solution)
            self.answers.append(answer)
            self.gradients.append(grad)

        self.solutions = np.array(self.solutions)
        self.answers = np.array(self.answers)
        self.gradients = np.array(self.gradients)

With the implementation described above, we can address the optimization problem as follows while also visualizing the optimization process.

problem = GradientDescent(f, df)
problem.solve()

import matplotlib.pyplot as plt


plt.scatter(problem.solutions[0, 0], problem.solutions[0, 1], color="k", marker="o", label="Initial Solution")
plt.plot(problem.solutions[:, 0], problem.solutions[:, 1], color="k", linewidth=1.5)
xs = np.linspace(-2.5, 1.5, 100)
ys = np.linspace(-1.5, 2.5, 100)
xmesh, ymesh = np.meshgrid(xs, ys)
z = np.concatenate([xmesh.reshape(1, -1), ymesh.reshape(1, -1)], axis=0)
levels = [-3, -2.8, -2.6, -2.4, -2.2, -2, -1, 0, 1, 2, 3, 4]
plt.contour(xs, ys, f(z).reshape(xmesh.shape), levels=levels, colors="k", linestyles="dotted")
plt.show()

The problem.solutions attribute will contain the solutions at each iteration, while the problem.answers attribute will contain the value of the function at each iteration. We can visualize the convergence of the gradient descent method using the following code.

fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(2, 2, 1)
ax.set_title("Gradient (x)")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.gradients)), problem.gradients[:, 0], color="b")

ax = fig.add_subplot(2, 2, 2)
ax.set_title("Gradient (y)")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.gradients)), problem.gradients[:, 1], color="r")

ax = fig.add_subplot(2, 2, 3)
ax.set_title("Answer")
ax.set_ylabel("Value")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.answers)), problem.answers, color="r")

plt.show()

References

https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a

Brushing Up on Logistic Regression in Python: Theory to Practice

Koki Esaki — Sun, 04 Feb 2024 15:09:23 +0000

Introduction

Contrary to its name, logistic regression is not a regression algorithm but a classification algorithm. In standard regression algorithms, the predicted value of $y$ is a continuous value. However, in classification algorithms, the predicted value falls within the range of $h_\theta(x) ≤ 1$ . This is because we want to categorize by discrete values such as 0 or 1. If $hθ(x)≥0.5h_\theta(x) ≥ 0.5$ , then $y = 1$ , if $hθ(x)<0.5h_\theta(x) < 0.5$ , then $y = 0$ , and we divide by a threshold value (0.5 in this case).

Binary Logistic Regression

Binary Logistic Regression is used for binary classification tasks, where the objective is categorize instances into one of two possible classes. These two classes are often represented as 0 and 1, which correspond to outcomes such as false/true, negative/positive, fail/pass, etc.

Activation Function

In order to categorize by discrete values, the Logistic Function, also known as the Sigmoid Function, is introduced. The characteristic feature is that the function satisfies $0 < g (z) < 1$ and $g (0) = 0.5$ .

\frac{1}{1 + e^{-z}}

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

import matplotlib.pyplot as plt

x = np.arange(-5.0, 5.0, 0.1)
y = sigmoid(x)
plt.plot(x, y)
plt.grid()
plt.show()

Hypothesis Function

In logistic regression, the hypothesis function is a composite function that unites the hypothesis function of linear regression with the sigmoid function.

h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}}

Cost Function

The logistic regression cannot use the same cost function used for linear regression because its output is wavy and causes many local optimizations.

Revisit the cost function of Linear regression:

J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

To minimize the cost function, the hypothesis function looks to be minimized as $hθ(x)h_\theta(x)$ approaches the value of $y$ and maximize as it moves away. This can be expressed using the log function:

J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log (h_\theta(x^{(i)}) + (1-y^{(i)})\log (1 - h_\theta(x^{(i)})))

This method is commonly known as the "Cross-entropy loss".

Optimization

The Gradient descent method in logistic regression is basically the same as in linear regression, but the contents of the hypothesis function $hθ(x)h_\theta(x)$ are different.

Repeat until convergence:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

Implementation

We will implement the binary logistic regression algorithm using Python. The following code is a simple implementation of binary logistic regression using the breast cancer dataset from the scikit-learn library.

To start, we will load the dataset and divide it into training and test sets.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_breast_cancer()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))

samples: 569; features: 30
samples: 569; values: [0 1]

Next, we will standardize the dataset. Standardization is a method of normalizing the dataset by subtracting the mean and dividing by the standard deviation. This is done to prevent the influence of large values on the model.

def standardize(X: pd.DataFrame) -> pd.DataFrame:
    """Standardize the dataset. (z-score normalization)
    :param X: The dataset to be standardized.
    :return: The standardized dataset.
    """
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_train_std = standardize(X_train)
X_test_std = standardize(X_test)

Now, we will proceed to implement the binary logistic regression algorithm. To gain a deeper understanding of the implementation, please refer to the comments provided within the code.

class BinaryLogisticRegression:

    def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "BinaryLogisticRegression":
        """Fit the model to the training dataset. Optimizing the parameters by gradient descent.

        :param X: The training dataset.
        :param y: The target.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features

        self._theta = np.zeros(num_features)  # The parameters (weight)

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = 1.0  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff > self.eps:
            # Update the parameters by gradient descent
            grad = (1 / self._m) * np.dot(self.h(X, self._theta) - y, X)  # Calculate the gradient using the formula
            self._theta = self._theta - self.alpha * grad  # Update the parameters

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error)
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values.

        :param X: The dataset to be predicted.
        :return: The predicted target values.
        """
        return np.where(self.h(X, self._theta) >= 0.5, 1, 0)

    def activate(self, z: np.ndarray) -> np.ndarray:
        """Activation function (sigmoid/logistic function).

        :param z: The output of the hypothesis function.
        :return: The activated output. 0 <= activate(z) <= 1
        """
        return 1 / (1 + np.exp(-z))

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The dataset
        :param theta: The parameters (weight)
        :return: The activated output. 0 <= h(x, theta) <= 1
        """
        return self.activate(np.dot(X, theta))

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function (cross-entropy loss).

        :param X: The dataset
        :param y: The target
        :return: The loss value.
        """
        delta = 1e-7  # To avoid log(0)
        return - (1 / self._m) * (
            np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
        )

Now that the model is prepared, we can go ahead and train it using the standardized training dataset while also visualizing the training process.

model = BinaryLogisticRegression()
model.fit(X_train_std, y_train)

import matplotlib.pyplot as plt


fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(1, 2, 1)
ax.set_title("Cross-entropy Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")

plt.show()

Each iteration, the cross-entropy loss decreases, and the gradient approaches zero. This indicates that the model is converging.

Finally, we will evaluate the model using the standardized training and test datasets.

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")

Acuracy score for train data: 0.9882629107981221
Acuracy score for test data: 0.958041958041958

Multiple Logistic Regression

In the previous section, we implemented a binary logistic regression model. In this section, we will implement a multiple logistic regression model, which can handle multiple classes.

Activation Function

The activation function of the multiple logistic regression model is the softmax function. The softmax function is defined as follows:

g_k(z) = \frac{e(z_k)}{\sum_{i=1}^n e(z_i)}

The softmax function takes a vector of real numbers and returns a vector of the same length, where each element is in the range (0, 1), and the sum of the elements is 1. This is useful for representing the probability distribution of the classes.

import numpy as np

def softmax(z):
    z = z - np.max(z, axis=0)  # Prevent overflow
    return np.exp(z) / np.sum(np.exp(z))

We can visualize the softmax function using the following code:

import matplotlib.pyplot as plt

x = np.arange(-5.0, 5.0, 0.1)
y = softmax(x)
plt.plot(x, y)
plt.ylim(0, 0.1)
plt.grid()
plt.show()

Implementation

The multiple logistic regression model is similar to the binary logistic regression model, but predicts the probability distribution of the classes using the softmax function and takes the class with the highest probability as the predicted class.

As a training dataset, we will use the Iris dataset, which contains 150 samples of three classes of iris flowers. The dataset has four features: sepal length, sepal width, petal length, and petal width.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))

samples: 150; features: 4
samples: 150; values: [0 1 2]

We will use One-hot encoding to convert the target values to a binary matrix. The one-hot encoding is a representation of categorical variables as binary vectors. This method is commonly used to ensure that categorical variables do not imply any ordinal relationship, and each category is treated independently.

y_train_encoded = pd.get_dummies(y_train, dtype=int)
print(y_train.head(3))
y_train_encoded.head(3)

The encoded target values will be in a format that matches the predicted probabilities for each target value, as in $[0.20, 0.30, 0.50]$ .

After training the model, the predicted target, determined by selecting the maximum probability, will be compared with the true target values. For instance, if $y = [0.20, 0.30, 0.50]$ , then $a r g ma x (y) = 2$ .

Next, as in the binary logistic regression model, we will implement the standardize function and apply it to the training and test datasets.

def standardize(X: pd.DataFrame) -> pd.DataFrame:
    """Standardize the dataset. (z-score normalization)
    :param X: The dataset to be standardized.
    :return: The standardized dataset.
    """
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_train_std = standardize(X_train)
X_test_std = standardize(X_test)

We will now proceed to implement the multiple logistic regression model. It's important to note that the implementation is similar to the binary logistic regression model, but in this case, we will use the softmax function as the activation function. Additionally, we will determine the predicted class based on the class with the highest probability.

class MultipleLogisticRegression:

    def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "MultipleLogisticRegression":
        """Fit the model to the training dataset. Optimizing the parameters by gradient descent.

        :param X: The training dataset.
        :param y: The target.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features
        num_targets = y.shape[1]  # The number of targets

        self._theta = np.zeros([num_targets, num_features])  # The parameters (weight)

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = np.ones(num_targets)  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff.sum() > self.eps:
            # Update the parameters by gradient descent
            grad = (1 / self._m) * np.dot((self.h(X, self._theta) - y).T, X)  # Calculate the gradient using the formula
            self._theta = self._theta - self.alpha * grad  # Update the parameters

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error.sum())
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error.sum()}, diff: {diff.sum()}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values.

        :param X: The dataset to be predicted.
        :return: The predicted target values.
        """
        return self.h(X, self._theta).argmax(1)

    def activate(self, z: np.ndarray) -> np.ndarray:
        """Activation function (sigmoid/logistic function).

        :param z: The output of the hypothesis function.
        :return: The activated output. 0 <= activate(z) <= 1
        """
        return np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True)

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The dataset
        :param theta: The parameters (weight)
        :return: The activated output. 0 <= h(x, theta) <= 1
        """
        return self.activate(np.dot(X, theta.T))

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function (cross-entropy loss).

        :param X: The dataset
        :param y: The target
        :return: The loss value.
        """
        delta = 1e-7  # To avoid log(0)
        return - (1 / self._m) * (
            np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
        )

Since the model is ready, we will move on to train the model using the standardized features and one-hot encoded targets and evaluate its performance.

model = MultipleLogisticRegression()
model.fit(X_train_std, y_train_encoded)

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")

Acuracy score for train data: 0.875
Acuracy score for test data: 0.7105263157894737

The model demonstrates decent performance on the training dataset but exhibits poor performance on the test dataset, indicating a clear case of overfitting. To address this issue, regularization techniques can be employed to mitigate overfitting and enhance the model's generalization capabilities.

References

https://www.coursera.org/specializations/machine-learning-introduction

Deciphering Standardization and Normalization: Understanding Feature Scaling Techniques

Koki Esaki — Sat, 03 Feb 2024 10:22:36 +0000

Importance of Feature Scaling

Machine learning algorithms, such as linear regressions and neural networks, work better or converge faster when the features are on a similar scale, and standardization makes the scale of the features similar.

For example, when considering features like age and income, your model may prioritize income over age due to the significant difference in the scale of values.

Standardization (Z-score normalization)

Standardization rescales the feature of a dataset so that they have a mean of 0 and a standard deviation (SD) of 1. This feature scaling technique is achieved by subtracting the average value of the feature from respective feature and then dividing by the standard deviation.

The formula for standardization is:

x_i = \frac{x_i - mean(x)}{SD(x)}

It is less affected by outliers than normalization. Therefore, this method often used when the maximum and minimum values are not fixed or when outliers exist.

from sklearn import preprocessing
import numpy as np


X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
print(X_scaled)

array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

Normalization (Min-Max scaling)

Normalization scales the features of a dataset to a specific range, typically between 0 and 1. This is achived by subtracting the minimum value of the feature from respective feature and then dividing by the range.

The formula for normalization is:

x_i = \frac{x_i - min(x)}{max(x) - min(x)}

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

Implementations from Scratch

First, we will import the necessary libraries, load the dataset, and use the two features from the Iris dataset for the demonstration.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris


iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
X = data.iloc[:, 2:]

Standardization takes the mean as zero and the variance as one. The following code demonstrates how to standardize the dataset.

def standardize(X):
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_std = standardize(X)

Normalization is a 0-1 scaling method where the minimum value is 0 and the maximum value is 1. The following code shows how to normalize the dataset.

def normalize(X):
    return (X - np.min(X, axis=0)) / (np.max(X, axis=0) - np.min(X, axis=0))


X_norm = normalize(X)

The preprocessing results can be visualized using the following plotting method. The first plot shows the original dataset, the second plot shows the standardized dataset, and the third plot shows the normalized dataset.

import matplotlib.pyplot as plt


fig = plt.figure(figsize=(16, 12))

ax = fig.add_subplot(2, 2, 1)
ax.scatter(X.iloc[:, 0], X.iloc[:, 1])
ax.set_title("Before Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

ax = fig.add_subplot(2, 2, 3)
ax.scatter(X_std.iloc[:, 0], X_std.iloc[:, 1])
ax.set_title("After Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

ax = fig.add_subplot(2, 2, 4)
ax.scatter(X_norm.iloc[:, 0], X_norm.iloc[:, 1])
ax.set_title("After Normalization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

plt.show()

References

Brushing Up on Linear Regression in Python: Theory to Practice

Koki Esaki — Thu, 01 Feb 2024 11:23:59 +0000

Having completed an extensive machine learning course, I've noticed that my memory of the material is starting to diminish. To address this, I've made the decision to write a series of articles.

Introduction

Assuming the x-axis represents age and the y-axis indicates income, it appears possible to somehow express the data plotted with a linear function.

The blue line is merely a visual guide and is not based on mathematical accuracy; therefore, we need to do this and that to determine the actual equation of this blue line.

Hypothesis Function

Adjust the free parameters $θ0,θ1\theta_0, \theta_1$ of the function to formulate an expression that most accurately fits the data with minimal error.

h_\theta(x) = \theta_0 + \theta_1x_1

For scenarios involving multiple variables, the formula would be structured as follows. This no longer represents a linear function, yet the foundational principle stays the same.

h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n

Cost Function

The cost function is a tool utilized to develop the hypothetical function. Simply put, it calculates the average discrepancy between the predicted results and the actual outputs. By determining the parameter $θ$ that minimizes the error, the true parameters of the hypothetical function can be ascertained. This method is commonly known as the "mean squared error (MSE)".

J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

The division by 2 in the function $J(θ)J(\theta)$ is implemented to simplify the process of differentiation when calculating the function later.

Optimization Using Gradient Descent

A strategy must be formulated to optimize (in this instance, minimize) the performance of the cost function, aiming for the most favorable results.

The minimization of the mean squared error occurs when the derivative of this function equals zero. This procedure is depicted by the following update formula, known as the gradient descent method.

This method persistently applies this update until the parameter values reach a point of convergence:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

Partial differentiation will be applied, where one variable is differentiated while treating the other variables as constants. The resulting gradient is then multiplied by α, which is called the learning rate, and subtracted from the original $θ_j$ to derive the updated $θ_j$ . As the gradient approaches 0, whatever the value of α, the range of variation of $θ_j$ becomes smaller and closer to 0. When the range of variation becomes small enough, it is called convergence.

Notes, if the value of α is excessively high, the variation in $θ_j$ becomes too large, potentially leading to a failure in convergence. On the other hand, a smaller α results in a slower yet more reliable convergence. Additionally, the update of $θ0,θ1,...,θj\theta_0, \theta_1, ..., \theta_j$ should be done at the same time, as this is a fundamental requirement for the process.

Implementation of Linear Regression

In this section, we will develop a linear regression model utilizing the gradient descent technique. We will use the California Housing dataset from the scikit-learn library for this example.

To begin, we will import the necessary libraries and load the dataset.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4 seaborn==0.13.2

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection

dataset = datasets.fetch_california_housing()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

The dataset comprises 9 features and 20,640 samples. The target variable is the median house value within each block, expressed in units of 100,000 USD.

The code provided next will generate a plot of the correlation matrix for this dataset.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = pd.concat([X, y], axis=1).corr()

plt.figure(figsize=(9, 6))
plt.title("Correlation Matrix")
sns.heatmap(corr_matrix, annot=True, square=True, cmap="Blues", fmt=".2f", linewidths=.5)
plt.savefig("california_housing_corr_matrix.png")

The correlation matrix reveals that the median income is the most strongly correlated with the target variable. The correlation between the target variable and the other features is comparatively lower. However, for simplicity in this example, all features will be used.

We are now set to build the linear regression model. To gain a deeper understanding of its mechanics, we'll create it from the ground up, without relying on pre-existing machine learning libraries.

class LinearRegression:

    def __init__(self, alpha: float = 1e-7, eps: float = 1e-4) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "LinearRegression":
        """Train the model. Optimization method is gradient descent.

        :param X: The feature values of the training data.
        :param y: The target values of the training data.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features

        self._theta = np.zeros(num_features)  # Parameters (weight) of the model (without bias)
        self._theta0 = np.zeros(1)  # Bias of the model

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = 1.0  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff > self.eps:
            # Update the parameters by gradient descent
            y_pred = self.predict(X)  # Predict the target values with the current parameters
            grad = (1 / self._m) * np.dot(y_pred - y, X)  # Calculate the gradient using the formula
            self._theta -= self.alpha * grad  # Update the parameters
            self._theta0 -= (1 / self._m) * np.sum(y_pred - y)  # Update the bias

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error)
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values using the hypothesis function.

        :param X: The feature values of the data.
        :return: The predicted target values.
        """
        # Pass the bias and the parameters to the hypothesis function
        theta = np.concatenate([self._theta0, self._theta])
        return self.h(X, theta)

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The feature values of the data.
        :param theta: The parameters (weight) of the model.
        :return: The predicted target values.
        """
        # theta[0] is bias and theta[1:] is parameters
        return np.dot(X, theta[1:].T) + theta[0]

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function. Mean squared error (MSE).

        :param X: The feature values of the data.
        :param y: The target values of the data.
        :return: The error value.
        """
        y_pred = self.predict(X)  # Predict the target values with the current parameters
        return (1 / (2 * self._m)) * np.sum((y_pred - y) ** 2)  # Compute the error using the formula

The code includes comprehensive explanations in the form of comments. For further understanding, please refer to these comments. Next, we will proceed to train the model and assess its performance on both the training and test data sets.

model = LinearRegression()
model.fit(X_train, y_train)

The results of the training process can be visualized using the following plotting method.

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(1, 2, 1)
ax.set_title("MSE")
ax.set_ylabel("Error")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")

plt.show()

Now, let's evaluate the model on both the training and test data sets.

from sklearn.metrics import mean_squared_error

y_train_pred = model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, y_test_pred)}")

The Mean Squared Error (MSE) for the training data stands at 1.33, while for the test data, it is 1.32. The marginally lower MSE for the test data suggests that the model is not overfitting, which is a positive indication of its generalization capability.

MSE for train data: 1.3350646294600155
MSE for test data: 1.322791709774531

By using the scikit-learn library, the same model can be implemented with a more streamlined code approach. This allows for an efficient and more straightforward way to achieve almost the same results.

from sklearn.linear_model import LinearRegression as SklearnLinearRegression

sklearn_model = SklearnLinearRegression()
sklearn_model.fit(X_train, y_train)

sklearn_y_train_pred = sklearn_model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, sklearn_y_train_pred)}")
sklearn_y_test_pred = sklearn_model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, sklearn_y_test_pred)}")

MSE for train data: 0.5192270684511334
MSE for test data: 0.5404128061709085

Regularization

Regularization decreases the weights to prevent overfitting by making it difficult for any feature to have a high value. It seeks to find the optimal set of weights that enhance the cost function's performance within a given constraint.

Ridge Regression

Ridge Regression is one of the linear regression methods. The equations used for prediction are the same as those in linear regression, but L2 regularization is used to avoid over-fitting. It has high generalization performance by keeping each weight as close to zero as possible.

Cost function:

J(\theta) = \frac{1}{2m}(\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2)

Gradient descent with Regularization:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha (\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j)

Thus, ridge regression uses the L2 norm for the regularization, which is calculated with the Euclidean distance:

\sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2}

Lasso Regression

Lasso regression applies L1 regularization, leading to some weights becoming zero. This results in certain features being entirely excluded from the model. With some weights set to zero, the model simplifies and clarifies which features are significant.

Cost function:

J(\theta) = \frac{1}{2m}(\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}|\theta_j|)

Thus, lasso regression uses the L1 norm for the regularization, which is calculated with the Manhattan distance:

d = |(b_1 - a_1)| + |(b_2 - a_2)|

References

https://www.coursera.org/specializations/machine-learning-introduction