Forem: Arbash Hussain

A Step-by-Step Guide to K-Nearest Neighbors (KNN) in Machine Learning

Arbash Hussain — Wed, 01 Apr 2026 02:05:44 +0000

Introduction

Welcome back, everyone, to the 3rd blog post in our Machine Learning Algorithms Series! Today, we'll dive into K-Nearest Neighbors (KNN), a fundamental algorithm in machine learning. We'll be implementing the KNN algorithm from scratch in Python. By the end of this blog, you'll have a clear understanding of how KNN works, how to implement it, and when to use it. Let's get started!

What is KNN?

K-Nearest Neighbors (KNN) is a straightforward powerful supervised machine learning algorithm used for both classification and regression tasks. Its simplicity lies in its non-parametric nature, meaning it doesn't assume anything about the underlying data distribution. Instead, KNN works by finding the 'k' closest data points (neighbors) in the training dataset to a new input point and making predictions based on these neighbors.

For classification tasks, KNN predicts the class label of the new data point by a majority vote among its nearest neighbors. The class label that appears most frequently among the nearest neighbors is assigned to the new data point.

For regression tasks, KNN predicts the value of the new data point by taking the average of the values of its nearest neighbors. This average value serves as the predicted output for the new data point.

Step by Step Implementation

Code is available on GitHub.

Importing Necessary Libraries

We start by importing the necessary libraries. These help us handle data, compute distances, and visualize results.

import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression, make_classification
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

numpy: For numerical operations. Counter: For counting occurrences of elements.
train_test_split: To split data into training and testing sets.
make_regression and make_classification: To generate synthetic datasets.
matplotlib: For plotting.

Defining the Euclidean Distance Function

This function calculates the Euclidean distance between two points. It’s essential for determining the nearest neighbors.

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

Implementing the KNN Class

The KNN class encapsulates the algorithm’s logic.

Initialization

The __init__ method initializes the KNN class with the number of neighbors k and a flag isclassifier to indicate whether the task is classification or regression.

class KNN:
    def __init__(self, isclassifier, k=3):
        self.k = k
        self.isclassifier = isclassifier

Training

The fit method stores the training data. There’s no complex training process in KNN—just storing the data.

    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

Prediction

The predict method generates predictions for the test data by calling _predict_single for each test point.

    def predict(self, X):
        self.x_test = X
        predictions = [self._predict_single(x) for x in X]
        return predictions

Single Prediction

The _predict_single method calculates distances from the test point to all training points, finds the k nearest neighbors, and makes predictions based on the type of task (classification or regression).

    def _predict_single(self, x1):
        # Find distance between x1 and all other points of x_train
        distances = [euclidean_distance(x1, x2) for x2 in self.x_train]
        # Sort the distances, and get the index of top k points closest to x1.
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_nbrs = [self.y_train[i] for i in k_indices]

        if self.isclassifier:
            prediction = Counter(k_nearest_nbrs).most_common()
            return prediction[0][0]
        else:
            return np.mean(k_nearest_nbrs)

Main Function for Testing

This section tests our KNN implementation with both classification and regression tasks.

Classification Task

if __name__ == "__main__":
    cmap = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])

    # Classification
    X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=44)
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)

    classifier = KNN(isclassifier=True, k=5)
    classifier.fit(x_train, y_train)
    preds = classifier.predict(x_test)

    accuracy = np.sum(preds == y_test) / len(y_test)
    print("On Classification Task")
    print("Accuracy:", accuracy)

Data Generation: Creates a synthetic dataset for classification.
Data Splitting: Splits the data into training and testing sets.
Training: Stores the training data in KNN classifier object.
Prediction and Accuracy: Predicts the labels for the test set and calculates accuracy.

Regression Task

    # Regression
    X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)

    regressor = KNN(isclassifier=False, k=5)
    regressor.fit(x_train, y_train)

    rmse = np.sqrt(np.mean((y_test - regressor.predict(x_test)) ** 2))
    print("On Regression Task")
    print("RMSE:", rmse)

Data Generation: Creates a synthetic dataset for regression.
Data Splitting: Splits the data into training and testing sets.
Training: Trains the KNN regressor.
Prediction and RMSE: Predicts the values for the test set and calculates Root Mean Squared Error (RMSE).

Output

Our KNN algorithm seems to be performing quite well on both Classification and Regression tasks.

Common Misconceptions about KNN

KNN is always accurate: KNN can be effective but is sensitive to noise and irrelevant features. Proper feature selection and preprocessing are essential.
KNN works well with high-dimensional data: In high-dimensional spaces, the concept of distance becomes less meaningful (curse of dimensionality).
KNN is computationally efficient: Prediction can be slow for large datasets due to the need to calculate distances to all training points. Techniques like KD-Trees can help.

When to Apply K-Nearest Neighbors: Key Points to Consider

1. Type of Task: Classification or Regression

Classification: Classifying a new sample based on the majority class of its nearest neighbors.
Regression: Predicting a continuous value based on the average value of its nearest neighbors.

2. Dataset Size and Dimensionality

Small to Medium-Sized Datasets: KNN works well with small to medium-sized datasets.
Low to Moderate Dimensionality: KNN performs best in low to moderate dimensions.

3. Data Distribution

Locally Homogeneous Data: KNN assumes that nearby points are similar.
Smooth Decision Boundaries: Effective when decision boundaries between classes are smooth.

4. No Assumption of Data Distribution

Non-Parametric Nature: KNN makes no assumptions about data distribution, making it flexible and model-free.

Advantages of KNN

Simplicity: Easy to understand and implement.
Versatility: Suitable for both classification and regression tasks.
No Training Phase: No complex training process—just storing the dataset.

Disadvantages of KNN

Computationally Intensive: Prediction can be slow for large datasets.
Sensitivity to Irrelevant Features: All features contribute equally, which can be problematic if some features are irrelevant.
Curse of Dimensionality: Performance degrades in high-dimensional spaces.

Practical Applications

Image Recognition: KNN can be used for tasks like handwritten digit recognition.
Recommender Systems: Helps in collaborative filtering by finding similar users or items.
Medical Diagnosis: Assists in diagnosing diseases based on historical patient data.

Conclusion

I hope this guide has been helpful and encourages you to explore and experiment further with K-Nearest Neighbors (KNN). If you like this blog please leave a like and a follow, you can also checkout my other blogs on machine learning algorithms, I have been posting these blogs in a series, hope you like them.

A Step-by-Step Guide to Decision Trees in Machine Learning

Arbash Hussain — Mon, 30 Mar 2026 00:42:30 +0000

Introduction

Welcome back, everyone! In this blog post, we will build a Decision Tree model from scratch, explaining each and every step and later testing the model on Breast Cancer dataset. By the end, you’ll have a solid understanding of Decision Trees and how to implement them in code.

What is a Decision Tree?

A Decision Tree is a type of supervised learning algorithm used for both classification and regression tasks. It works by splitting the data into subsets based on the value of input features, making decisions at each node until reaching a final prediction at the leaf nodes. Lets understand this with the help of a hypothetical scenario.

The following diagram illustrates the flow of decision tree for decision making with labels.

Rain = Yes
No Rain = No

How Splitting Happens

The way a Decision Tree decides how to split the data involves different techniques:

Gini Impurity (for Classification):
- Think of Gini Impurity as a measure of how mixed up the labels are in a group. If you randomly pick an item from a group, Gini Impurity tells you the chance of it being mislabeled. Lower Gini Impurity means the group is more pure, with mostly the same labels.
Information Gain (for Classification):
- Information Gain is like tidying up messy information. It uses entropy, which is a measure of chaos or randomness. By splitting the data based on a feature, we aim to make the subsets more organized and less random. Higher Information Gain means the data becomes more ordered after the split. We'll use this technique for our implementation.
Mean Squared Error (for Regression):
- Imagine you're trying to predict someone's weight. Mean Squared Error measures how far off your predictions are from the actual weights, squared (to make all differences positive). Lower MSE means your predictions are closer to the truth, minimizing the overall error.
Mean Absolute Error (for Regression):
- Mean Absolute Error is similar to MSE, but instead of squaring the differences, we just take their absolute values. This gives us a measure of how much, on average, our predictions differ from the actual values. Lower MAE means our predictions are more accurate, with smaller errors on average.

These techniques help the Decision Tree decide the best way to split the data at each step, ensuring that the final tree is as accurate and efficient as possible.

Key Concepts

Nodes and Leaves: Each decision point in the tree is called a node, and the final output points are called leaves.
Splitting: Dividing a node into two or more sub-nodes based on certain criteria.
Entropy and Information Gain: Metrics used to decide the best split. Entropy measures the randomness / impurity in a dataset, and information gain calculates the reduction in entropy after the dataset is split on an attribute, so more the information gain the better our split is.
Stopping Criteria:
- Maximum no of layer of nodes our decision tree can grow.
- Minimum no of samples a node must have if samples are less then don’t split the node.
- Minimum entropy change for a split to take place.

Step-by-Step Implementation

Step 1: Import Necessary Libraries

We’ll use NumPy for numerical operations, Pandas for data manipulation, and some utilities from Scikit-learn for loading datasets and splitting data.

import numpy as np
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split

Step 2: Define the Node Class

The Node class represents each node in the tree.

class Node:
    def __init__(self,feature=None,threshold=None,left=None,right=None,*,value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value    # value of node is Yes/No only incase of Leaf, else None

    def is_leaf(self):
        return self.value is not None  # Returns true if the node is leaf else false

Each node stores information about the feature and threshold for splitting, pointers to left and right child nodes, and the value if it is a leaf node.

Step 3: Define the Decision Tree Class

class Decision_Tree:
    def __init__(self, max_depth=10, min_sample_split=2, criteria='entropy', n_features=None):
        self.max_depth = max_depth                 # Stopping Criteria
        self.min_sample_split = min_sample_split   # Stopping Criteria
        self.criteria = criteria                   # Criteria type, entropy in our case.
            self.n_features = n_features           # No of features we'll be using for constructing the tree.
        self.root = None

Training the Tree
The fit method trains the tree on the provided dataset.

    def fit(self, X_train, y_train):
        self.n_features = X_train.shape[1] if not self.n_features else min(X_train.shape[1], self.n_features)
        # So that the no of features in the tree do not exceed the actual no of features we have in data. 
        n_samples = X_train.shape[0]  # No of samples.
        self.root = self.construct_tree(X_train, y_train, self.n_features, n_samples)

Here

X_train.shape[0] gives us no of samples/rows.
X_train.shape[1] gives us no of features/columns.

Constructing the Tree
The construct_tree method recursively builds the tree by splitting the data.

    def construct_tree(self, X, y, n_features, n_samples, depth=0):
        # No of labels there are in a specific feature,
        # if 1 then no need to split. 
        # For eg. in case of Wind, we can go to 2 labels (strong and weak) so we split.
        labels = len(np.unique(y))  

        # Check the stopping criteria 
        # if met create a leaf node, based on label with max frequence
        if depth >= self.max_depth or labels == 1 or n_samples < self.min_sample_split:
            leaf_value = self.most_common_label(y)
            return Node(value=leaf_value)

        # Find the best split
        feat_indexs = np.random.choice(n_features, self.n_features, replace=False)
        best_threshold, best_feature = self.best_split(X, y, feat_indexs)

        # Create child nodes (Recursively Create Tree)
        left_indxs, right_indxs = self.split(X[:, best_feature], best_threshold)
        left = self.construct_tree(X[left_indxs, :], y[left_indxs], n_samples, depth + 1)
        right = self.construct_tree(X[right_indxs, :], y[right_indxs], n_samples, depth + 1)
        return Node(best_feature, best_threshold, left, right)

    def most_common_label(self,y):
        c = Counter(y)
        return c.most_common(1)[0][0]

Here we first check whether the stopping criteria is met.
If the stopping criteria is met, we'll create a leaf node.
If the stopping criteria is not met, we'll find the best split.
Create left and right child based on the best split.

Splitting the Data
The split method divides the data based on the threshold.

    def split(self,X_col,threshold):
        # Left Split, val <= threshold
        left_idxs  = np.argwhere(X_col<=threshold).flatten()
        # Right Split, val> threshold
        right_idxs = np.argwhere(X_col>threshold).flatten()
        # np.argwhere returns the indices in a list of lists, so we flatten the result.
        return left_idxs, right_idxs

left_idxs are the indices of X_col having val less than or equal to threshold.
right_idxs are the indices of X_col having val more than threshold.

Finding the Best Split
The best_split method iterates over all features and thresholds to find the best split based on information gain.

    def best_split(self,X,y,feat_indexs):
        best_gain = -1
        split_index = None
        split_threshold = None

        for feat_index in feat_indexs:
            X_col = X[:, feat_index]       # Values of feature X_col.
            thresholds = np.unique(X_col)  # Selecting all the unique values of X_col as thresholds.

            for thr in thresholds:
                # For each threshold in thresholds.
                # Find the threshold with maximum Information Gain
                gain = self.calculate_gain(X_col,y,thr)

                if gain > best_gain:
                    best_gain = gain
                    split_index = feat_index
                    split_threshold = thr

        return split_threshold, split_index

Initially our best_gain , split_index , split_threshold are set to -1, None, None.
For a X_col we get all its unique vals in a list, and find which amongst them will be the best threshold for the split.

Calculating Entropy and Information Gain
In this blog we use Information gain for splitting the nodes in decision tree. Information Gain is calculated using the formula:

Information Gain = Entropy(Parent) - Weighted Average * Entropy(Children)

Where:

Entropy(Parent) is the entropy of the parent node.
Weighted Average is the average of entropy of the children, weighted by their number of instances.
Entropy(Children) is the entropy of each child node.

Entropy is calculated using the formula:

Entropy = - ∑ [ p(x) * log(p(x)) ]

Where:

p(x) is the probability of occurrence of an event, i.e. number of times class x has occurred divided by total no of samples.
p(x) = count(x) / n

    def calculate_entropy(self, y):
        hist = np.bincount(y) # returns a frequency list of elements, from 0 to max(y).
        ps = hist / len(y)    # [p(x1), p(x2), p(x3),..., p(xN)]
        return -np.sum([p * np.log(p) for p in ps if p > 0])  # Only consider non-zero probabilities

    def calculate_gain(self, X_col, y, threshold):
        entropy_parent = self.calculate_entropy(y)

        # Create children
        left_idxs, right_idxs = self.split(X_col, threshold)
        if len(left_idxs) == 0 or len(right_idxs) == 0:
            return 0

        # Calculate the weighted average entropy of the children.
        n = len(y)
        # No of samples in left and right.
        n_l, n_r = len(left_idxs), len(right_idxs) 
        # Left entropy and right entropy.
        e_l, e_r = self.calculate_entropy(y[left_idxs]), self.calculate_entropy(y[right_idxs]) 
        # No of samples in left/total samples times left entropy + No of samples in right/total samples times right entropy.
        weighted_avg_entropy_children = (n_l / n) * e_l + (n_r / n) * e_r

        # Calculate Information Gain
        info_gain = entropy_parent - weighted_avg_entropy_children

        return info_gain

Making Predictions
The predict method traverses the tree to make predictions on new data.

    def predict(self, X):
        return np.array([self.traverse_tree(x, self.root) for x in X])

    def traverse_tree(self, X, node):
        if node.is_leaf():
            return node.value

        if X[node.feature] <= node.threshold:       # Recursively travel
            return self.traverse_tree(X, node.left) # val <= threshold, Goto Left child.
        return self.traverse_tree(X, node.right)    # val > threshold, Goto Right child.

We send a data X, to predict function.
The data travels the tree until it reaches the predicted leaf.

Step 4: Testing the Model

Finally, we test our Decision Tree model on the breast cancer dataset.

def mse(y1, y2):
    return np.mean((y2 - y1)**2)

def accuracy(y1,y2):
    return np.sum(y1==y2)/len(y1)

if __name__ == "__main__":
    model = Decision_Tree()
    data = datasets.load_breast_cancer()
    X, y = data.data, data.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    error = mse(y_test, preds)
    print('Error:', error)

Output

Points to Note: Common Misconceptions

Misconception 1: Decision Trees Are Always Prone to Overfitting
While Decision Trees can be prone to overfitting, especially with deep trees that capture noise in the training data, this is not always the case. Proper pruning techniques and parameter tuning can significantly mitigate overfitting.

Explanation: The misconception arises because Decision Trees are flexible models that can adapt closely to the training data. However, by setting constraints like maximum depth and minimum samples per leaf, or by using ensemble methods like Random Forests, we can control overfitting effectively.

Misconception 2: Decision Trees Are Always Better with More Features
Adding more features to a Decision Tree does not always improve its performance. Irrelevant or redundant features can confuse the model and lead to poorer performance.

Explanation: Including too many features, especially those that do not contribute meaningful information, can lead to a more complex tree with unnecessary splits. Feature selection techniques or regularization methods can help identify the most relevant features for building a robust model.

When to Apply Decision Trees: Key Points to Consider

Non-linearity: Decision Trees do not require the relationship between input and output variables to be linear. They can very well handle non-linear relationships and interactions between features.

Handling of Mixed Data Types: Decision Trees can handle both numerical and categorical data, making them versatile for different types of datasets.

Handling Missing Values: Decision Trees can manage missing values in the data. During the splitting process, they can work without requiring imputation.

Robustness to Outliers: Decision Trees are relatively robust to outliers, as splits are based on feature thresholds that can separate outliers from the majority of the data.

Small to Medium-Sized Data: Decision Trees work well with small to medium-sized datasets. However, for very large datasets, ensemble methods like Random Forests or Gradient Boosting Trees might be more efficient, which we'll discuss soon.

A Step-by-Step Guide to Linear Regression in Machine Learning

Arbash Hussain — Sun, 29 Mar 2026 01:04:06 +0000

Introduction:

In the vast landscape of machine learning, understanding the basics is crucial, and linear regression is an excellent starting point. In this blog post, we'll learn about linear regression by breaking down the concepts step-by-step. But we won't stop at theory, we'll also delve into coding linear regression from scratch, enabling you to understand it from the depth.

Step 1: Understanding the Basics

At its core, linear regression involves predicting an outcome based on one or more input variables. Imagine trying to predict the score a student might achieve based on the number of hours they study – that's where linear regression comes in.

Step 2: The Equation

Let's start with equation of a straight line.

y = mx + c

Here m is the slope/gradient of the line
x is the coordinate of datapoint
c is the y intercept (where the line crosses the y-axis)

Here's how it translates into our student example:

Score = Study Hours * Study Efficiency + Baseline Score

Here, the student's score Score depends on two things: the number of hours they study Study Hours and how efficiently they study Study Efficiency. The slope m tells us how much the score changes for each additional hour of study, and the y-intercept c is the baseline score achieved with zero study hours.

So, in essence, the equation is a tool that helps us predict an outcome based on the relationship between variables. It's the foundation of our journey into understanding and utilizing linear regression in the world of machine learning.

Step 3: Training the Model

Training the model involves finding the optimal values for m and c The key method we employ is the "least squares" approach, which works towards minimizing the difference between our predicted and actual values.

Least Squares: Getting to the Core
Least squares is pretty straightforward. It's a method that aims to minimize the sum of the squared differences between our predicted values and the actual values. Imagine adjusting our parameters m and c so that our predicted line fits snugly through our data points, minimizing the gaps between them.

Introduction to Gradient Descent
In our quest for optimal values, we introduce another concept called "Gradient Descent." This is a technique that helps us iteratively adjust our parameters to reduce the difference between our predictions and the real values. Think of it as a step-by-step process, gradually refining our predictions.

But what exactly is Gradient Descent?

In simple terms, it's like finding the best path down a hill. We're trying to adjust m and c in the direction that minimizes the difference between our predictions and the actual outcomes. It's a practical approach to fine-tuning our model.

The world of Gradient Descent is vast, and we'll explore its details in a future blog post. This method plays a crucial role in optimizing our model, and we'll dive deeper into its mechanics.

For those curious about Gradient Descent right away, you can check out this amazing blog. Otherwise, stay tuned as we continue our journey through the basics of linear regression.

Step 4: Evaluation

Having trained our model, it's time to assess its performance. The metric we'll employ for this task is the Mean Squared Error MSE, a reliable measure that quantifies the average difference between our predicted values and the actual outcomes.

Why MSE? While there are various methods to measure error, MSE is particularly favored for its ability to penalize larger errors more significantly. This makes it a suitable choice when we want to prioritize minimizing the impact of substantial prediction deviations.

Step 5: Visualizing the Model

To gain insights, we'll create a scatter plot with our regression line to visualize the relationship between the variables.

Step 6: Real-world Applications

Linear regression finds applications in predicting housing prices, stock values, and much more. Its simplicity makes it a powerful tool for understanding and predicting real-world phenomena.

Step 7: Coding Linear Regression from Scratch

Now, let's transition from theory to practice. We'll code a simple linear regression model in Python, and then evaluate it's performance on unseen data.
Code can be found on GitHub.

Step 7.1 Importing Libraries:

import numpy as np 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
import matplotlib.pyplot as plt

numpy for matrix operations.
datasets from sklearn for generating a regression dataset.
train_test_split from sklearn for splitting the dataset into training and testing sets.
matplotlib.pyplot for data visualization.

Step 7.2 Define Linear Regression Class:

class Linear_Regression:
    def __init__(self, lr=0.1, n_iters=100):
        self.weights = None
        self.bias = None
        self.lr = lr
        self.n_iters = n_iters

Here we define a class Linear_Regression to encapsulate linear regression functionality. It has an initialization method setting default learning rate (lr) and number of iterations (n_iters). weights is initialized to none because each feature will have its own weight, i.e. no of weights are equal to the no of features. But there will be only 1 bias.

So, if we have n features the equation of our line will be like this:

y = θ0 + θ1X1+ θ2X2 + . . . + θnXn

Step 7.3 Fit Method:

def fit(self, X, y):
    n_samples, n_features = X.shape
    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iters):
        '''
            Here,
            fs are features
                 f1  f2   f3  f4  f5  
            X = [x11,x12,x13,x14,x15]  weights = [w1]  bias = bias
                [x21,x22,x23,x24,x25]            [w2]
                [x31,x32,x33,x34,x35]            [w3]
                [x41,x42,x43,x44,x45]            [w4]
                [x51,x52,x53,x54,x55]            [w5]
            '''
        y_pred = np.dot(X, self.weights) + self.bias

        dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
        db = (1/n_samples) * np.sum(y_pred - y)

        self.weights = self.weights - self.lr * dw
        self.bias = self.bias - self.lr * db

The fit method trains the linear regression model using gradient descent. It initializes weights and bias, iteratively updates them to minimize the error.

Step 7.4 Predict Method

def predict(self, X_test):
    predicted = np.dot(X_test, self.weights) + self.bias
    return predicted

Step 7.5 Main Execution Block

def mse(y1, y2):
    return np.mean((y2 - y1)**2)

if __name__ == "__main__":
    # Data Generation and Splitting
    X, y = datasets.make_regression(n_samples=300, n_features=1, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

    # Model Initialization and Training
    model = Linear_Regression()
    model.fit(X_train, y_train)

    # Model Prediction and Evaluation
    preds = model.predict(X_test)
    print("Mean Squared Error:", mse(preds, y_test))

    # Visualization
    fig = plt.figure(figsize=(8, 6))
    predictions = model.predict(X)
    cmap = plt.get_cmap('viridis')
    plt.scatter(X_train,y_train,color=cmap(0.9),s=10,label='Training Data')
    plt.scatter(X_test,y_test,color=cmap(0.5),s=10,label='Test Data')
    plt.plot(X,predictions,color="black",linewidth=2,label="Best Fit Line")
    plt.legend()
    plt.show()

Here we generate a synthetic regression data, splits it into training and testing sets.
Initialize the linear regression model.
Train the model on the training data.
Makes predictions on the test data and evaluates the model using Mean Squared Error.
Finally, visualize the training and testing data along with the regression line.

7.6 Best Fit Line:

Points to Note: Common Misconceptions

Misconception 1: Linear Regression Assumes Linearity in All Cases

While linear regression assumes a linear relationship between variables, it doesn't mean that the variables themselves must be linear. Transformations can be applied to make the relationship linear, even if the original variables are not.

Explanation: This misconception often arises from the name "linear regression." People may assume that the technique is only applicable when relationships are strictly linear. In reality, it's about the linearity in the coefficients, not necessarily the raw variables.

Misconception 2: Outliers Always Negatively Affect Linear Regression

While outliers can influence linear regression models, they don't always have a negative impact. Sometimes outliers contain valuable information or highlight specific patterns in the data.

Explanation: Outliers may disproportionately affect the model if they have a substantial impact on the overall pattern. However, not all outliers are detrimental; they can represent unique scenarios or anomalies that are important to capture.

When to Apply Linear Regression: Key Points to Consider

Linearity: Linear regression is most effective when there is a linear relationship between the input and output variables. Visualizing the data through scatter plots can help identify linearity.
Homoscedasticity: The variance of the errors should be consistent across all levels of the independent variable. If the spread of errors widens or narrows systematically, it indicates heteroscedasticity, which may violate linear regression assumptions.
Independence: Observations should be independent of each other. For example, in time-series data, consecutive observations may be correlated, violating the independence assumption.
Normality of Residuals: The residuals (the differences between actual and predicted values) should be approximately normally distributed. This assumption is important for statistical inference.