Forem: Berk Hakbilen

Dealing with Categorical Data: Encoding Features for ML Algorithms

Berk Hakbilen — Sun, 13 Nov 2022 18:34:17 +0000

In real life, the raw data recieved is rarely in the format which we can take and use direclty for our machine learning models. Therefore, some preprocessing is necessary to bring the data to the right format, select informative data or reduce its dimension to be able to extract the most out of the data. In this post, we will talk about encoding to be able to use categorical data as features for our ML models, types of encoding and when they are suitable. Other preprocessing techniques for Features scaling, features selection and dimension reduction are topics for another post… Let’s get started with encoding!

Encoding Categorical Data

Numerical data, as the name suggests, has features with only numbers (integers or floating-point). On the other hand, categorical data has variables that contain label values (text) and not numerical values. Machine learning models can only accept numerical input variables. What happens if we have a dataset with categorical data instead of numerical data?

Then we have to convert the data which contains categorical variables to numbers before we can train a ML model. This is called encoding. Two most popular encoding techniques are Ordinal Encoding and One-Hot Encoding.

Ordinal Ecoding: This technique is used to encode categorical variables which habe a natural rank ordering. Ex. good, very good, excellent could be encoded as 1,2,3.

One-Hot Encoding: This technique is used to encode categorical variables which do not have a natural rank ordering. Ex. Male or female do not have any ordering between them.

Ordinal Encoding

In this technique, each category is assigned an integer value. Ex. Miami is 1, Sydney is 2 and New York is 3. However, it is important to realise that this introduced an ordinality to the data which the ML models will try to use to look for relationships in the data. Therefore, using this data where no ordinal relatinship exists (ranking between the categorical variables) is not a good practice. Maybe as you may have realised already, the example we just used for the cities is actually not a good idea. Because Miami, Sydney and New York do not have any ranking relationship between them. In this case, One-Hot encoder would be a better option which we will see in the next section. Let’s create a better example for ordinal encoding.

Ordinal encoding tranformation is available in the scikit-learn library. So let’s use the OrdinalEncoder class to build a small example:

# example of a ordinal encoding
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# define data
data = np.asarray([['good'], ['very good'], ['excellent']])
df = pd.DataFrame(data, columns=["Rating"],  index=["Rater 1", "Rater 2", "Rater 3"])
print("Data before encoding:")
print(df)

# define ordinal encoding
encoder = OrdinalEncoder()

# transform data
df["Encoded Rating"] = encoder.fit_transform(df)
print("\nData after encoding:")
print(df)

Data before encoding:
            Rating
Rater 1       good
Rater 2  very good
Rater 3  excellent

Data after encoding:
            Rating  Encoded Rating
Rater 1       good             1.0
Rater 2  very good             2.0
Rater 3  excellent             0.0

In this case, the encoder assigned the integer values according to the alphabetical order which is the case for text variables. Although we usually do not need to explicitly define the order of the categories, as ML algorihms will be able to extract the relationship anyway, for the sake of this example we can define an explicit order of the categories using the categories variable of the OrdinalEncoder.

import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# define data
data = np.asarray([['good'], ['very good'], ['excellent']])
df = pd.DataFrame(data, columns=["Rating"], index=["Rater 1", "Rater 2", "Rater 3"])
print("Data before encoding:")
print(df)

# define ordinal encoding
categories = [['good', 'very good', 'excellent']]
encoder = OrdinalEncoder(categories=categories)

# transform data
df["Encoded Rating"] = encoder.fit_transform(df)
print("\nData after encoding:")
print(df)

Data before encoding:
            Rating
Rater 1       good
Rater 2  very good
Rater 3  excellent

Data after encoding:
            Rating  Encoded Rating
Rater 1       good             0.0
Rater 2  very good             1.0
Rater 3  excellent             2.0

Label Encoding

LabelEncoder class from scikit-learn is used to encode the Target labels in the dataset. It actually does exactly the same thing as OrdinalEncoder however expects only a one-dimensional input which comes in very handy when encoding the target labels in the dataset.

One-Hot Encoding

Like we mentioned previously, for categorical data where there is no ordinal relationship, ordinal encoding is not the suitable technique because it results in making the model look for natural order relationships within the categorical data which does not actually exist which could worsen the model performance.

This is where the One-Hot encoding comes into play. This technique works by creating a new column for each unique categorical variable in the data and representing the presence of this category using a binary representation (0 or 1). Looking at the previous example:

The simple table transforms to the following table where we have a new column repesenting each unique categorical variable (male and female) and a binary value to mark if it exists for that.

Just like OrdinalEncoder class, scikit-learn library also provides us with the OneHotEncoder class which we can use to encode categorical data. Let’s use it to encode a simple example:

from sklearn.preprocessing import OneHotEncoder

# define data
data = np.asarray([['Miami'], ['Sydney'], ['New York']])
df = pd.DataFrame(data, columns=["City"], index=["Alex", "Joe", "Alice"])
print("Data before encoding:")
print(df)

# define onehot encoding
categories = [['Miami', 'Sydney', 'New York']]
encoder = OneHotEncoder(categories='auto', sparse=False)

# transform data
encoded_data = encoder.fit_transform(df)

#fit_transform method return an array, we should convert it to dataframe
df_encoded = pd.DataFrame(encoded_data, columns=encoder.categories_, index= df.index)
print("\nData after encoding:")
print(df_encoded)

Data before encoding:
           City
Alex      Miami
Joe      Sydney
Alice  New York

Data after encoding:
      Miami New York Sydney
Alex    1.0      0.0    0.0
Joe     0.0      0.0    1.0
Alice   0.0      1.0    0.0

As we can see the encoder generated a new column for each unique categorical variable and assigned 1 if it exists for that specific sample and 0 if it does not. This is a powerful method to encode non-ordinal categorical data. However, it also has its drawbacks… As you can imagine for dataset with many unique categorical variables, one-hot encoding would result in a huge dataset because each variable has to be represented by a new column. For example, if we had a column/feature with 10.000 unique categorical variables (high cardinality), one-hot encoding would result in 10.000 additional columns resulting in a very sparse matrix and huge increase in memory consumption and computational cost (which is also called the curse of dimensionality). For dealing with categorical features with high cardinality, we can use target encoding…

Target Encoding

Target encoding or also called mean encoding is a technique where number of occurence of a categorical variable is taken into account along with the target variable to encode the categorical variables into numerical values. Basically, it is a process where we replace the categorical variable with the mean of the target variable. We can explain it better using a simple example dataset…

Group the table for each categorical variable to calculated its probability for target = 1:

Then we take these probabilites that we calculated for target=1, and use it to encode the given categorical variable in the dataset:

Similar to ordinal encoding and one-hot encoding, we can use the TargetEncoder class but this time we import it from category_encoders library:

from category_encoders import TargetEncoder

# define data
fruit = ["Apple", "Banana", "Banana", "Tomato", "Apple", "Tomato", "Apple", "Banana", "Tomato", "Tomato"]
target = [1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
df = pd.DataFrame(list(zip(fruit, target)), columns=["Fruit", "Target"])
print("Data before encoding:")
print(df)

# define target encoding
encoder = TargetEncoder(smoothing=0.1) #smoothing effect to balance categorical average vs prior.Higher value means stronger regularization.

# transform data
df["Fruit Encoded"] = encoder.fit_transform(df["Fruit"], df["Target"])
print("\nData after encoding:")
print(df)

Data before encoding:
    Fruit  Target
0   Apple       1
1  Banana       0
2  Banana       0
3  Tomato       0
4   Apple       1
5  Tomato       1
6   Apple       0
7  Banana       1
8  Tomato       0
9  Tomato       0

Data after encoding:
    Fruit  Target  Fruit Encoded
0   Apple       1       0.666667
1  Banana       0       0.333333
2  Banana       0       0.333333
3  Tomato       0       0.250000
4   Apple       1       0.666667
5  Tomato       1       0.250000
6   Apple       0       0.666667
7  Banana       1       0.333333
8  Tomato       0       0.250000
9  Tomato       0       0.250000

We have achieved the same table as what we have calculated manually ourselves…

Advantages of target encoding

Target encoding is a simple and fast technique and it does not add additional dimensionality to the dataset. Therefore, it is a good encoding method for dataset involving feature with high cardinality (unique categorical variables pof more than 10.000).

Disadvantages of target encoding

Target encoding makes use of the distribution of the target variable which can result in overfitting and data leakage. Data leakage in the sense that we are using the target classes to encode the feature may result in rendering the feature in a biased way. This is why there is the smoothing parameter while initializing the class. This parameter helps us reduce this problem (in our example above, we deliberately set it to a very small value to achieve the same results as our hand calculation).

In this post, we covered encoding methods to convert categorical data to numerical data to be able to use it as features in our machine learning models!

If you liked this content, feel free to follow me for more free machine learning tutorials and courses!

Regression in Machine Learning

Berk Hakbilen — Mon, 17 Oct 2022 19:55:10 +0000

In statistical modeling, regression analysis estimates the relationship between one or more independent variables and the dependent variable which represents the outcome.

To explain with an example you can imagine a list of houses, with information regarding to the size, distance to city center, garden (independent variables). Using these information, you can try to understand how the price(dependent variables) changes.

So for a regression analysis we have a set of observations or samples with one or more variables/features. Then, we define a dependent variable (the outcome) and try to find a relation between the dependent variables and the independent variables. The best way to do this is by finding a function which best represent the data.

Linear Models for Regression

In the linear regression model, we will use regression analysis to best represent the dataset through a linear function. Then, we will use this function to predict the outcome of a new sample/observation which was not in the dataset.

Linear regression is one of the most used regression models due to its simplicity and ease of understanding the results. Let’s move on to the model formulations to understand this better.

The linear regression function is written assuming a linear relationship between the variables:

where w terms are the regression coefficients, x terms are the independent variables or features, y is dependent variable/outcome and c is the constant bias term.

We can write a simple linear regression function for the houses examples we mentioned above.

So if we plug in the features of a new house into this function, we can predict its price (let’s assume size is 150m2 and distance to city center is 5 km).

See how the coefficient of distance to city center is minus. Meaning closer to center, more expensive the house will be.

We can create a simple fake regression dataset with only one feature and plot it to see the data behaviour more clearly.

import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
plt.figure()
plt.title('Samples in a dataset with only one feature (dependent variable)')
X, y = make_regression(n_samples = 80, n_features=1,
                            n_informative=1, bias = 50,
                            noise = 40, random_state=42)
plt.scatter(X, y, marker= 'o', s=50)
plt.show()

The dataset above has only one dependent variable. In this case, the regression function would be:

where w1 would be the slope the curve and c would be the offset value.

When we train our model on this data, the coefficients and the bias term will be determined automatically so that the regression function best fits the dataset.

The model algorithm finds the best coefficients for the dataset by optimizing an objective function, which in this case would be the loss function. The loss function represents the difference between the predicted outcome values and the real outcome values.

Least-Squared Linear Regression

In the Least-Squared linear regression model the coefficients and bias are determined by minimizing the sum of squared differences (SSR) for all of the samples in the data. This model is also called Ordinary Least-Squares.

If we interpret the function, it is a function determined by taking the square of the difference between the predicted outcome value and the real outcome value.

Let's train the Linear Regression model using the fake dataset we previously created and have a look at the calculated coefficients.

sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 42)
model = LinearRegression()
model.fit(X_train, y_train)

print('feature coefficient (w_1): {}'.format(model.coef_))
print('intercept (c): {:.3f}'.format(model.intercept_))
print('R-squared score (training):{:.3f}'.format(model.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(model.score(X_test, y_test)))

Here, R^2 is the coefficient of determination. This term represents the amount of variation in outcome(y) explained by the dependence on features (x variables). Therefore, a larger R^2 indicates a better model performance or a better fit.

When R^2 is equal to one, then RSS is equals to 0. Meaning the predicted outcome values and the real outcome values are exactly the same. We will be using the R^2 term to measure the performance of our model.

plt.figure(figsize=(6,5))
plt.scatter(X, y, marker= 'o', s=50, alpha=0.7)
plt.plot(X, model.coef_*X + model.intercept_, 'r-')
plt.title('Least-squares linear regression model')
plt.xlabel('Variable/feature value (x)')
plt.ylabel('Outcome (y)')
plt.show()

Ridge Regression - L2 Regularization

Ridge regression model calculates coefficients and the bias (w and c) using the same criteria in Least-Squared however with an extra term.

This term is a penalty to adjust the large variations in the coefficients. The linear prediction formula is still the same but only the way coefficients are calculated differs due to this extra penalty term. This is called regularization. It serves to prevent overfitting by restricting the variation of the coefficients which results in a less complex or simpler model.

This extra term is basically the sum of squares of the coefficients. Therefore, when we try to minimize the RSS function, we also minimize the the sum of squares of the coefficients which is called L2 regularization. Moreover, the alpha constant serves to control the influence of this regularization. This way, in comparison to the Least-Squared model, we can actually control the complexity of our model with the help of alpha term. The higher alpha term, higher the regularization is, and simpler the model will be.

The accuracy improvement with datasets including one dependent variable (feature) is not significant. However, for datasets with multiple features, regularization can be very effective to reduce model complexity, therefore overfitting and increase model performance on test set.

Let's have a look at its implementation in python:

from sklearn import datasets
X,y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 42)
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(X_train, y_train)

print('feature coefficients: {}'.format(model.coef_))
print('intercept (c): {:.3f}'.format(model.intercept_))

plt.figure(figsize=(10,5))
alphas = [1,5,10,20,50,100,500]
features = ['w_'+str(i+1) for i,_ in enumerate(model.coef_)]
for alpha in alphas:
  model = Ridge(alpha=alpha).fit(X_train,y_train)
  plt.scatter(features,model.coef_, alpha=0.7,label=('alpha='+str(alpha)))

plt.axhline(0)
plt.xticks(features)
plt.legend(loc='upper left')
plt.show()

Normalization can be applied unfairly to the features when they have different scales (when one feature has values around 0-1 and the other has from 100-1000). This can cause inaccuracies in our model when we apply regularization. In this case, feature scaling comes to our help to normalize all the values in the dataset, so that we can get rid of the scale differences. We will look in to feature scaling in another section...

Lasso Regression - L1 Regularization

Lasso regression is also a regularized linear regression model. In comparison to Ridge regression, it uses L1 regularization as the penalty term while calculating the coefficients.

Let's have a look at how the RSS function looks like with the penalty term for L1 regularization.

The penalty term for L1 regularization is the sum of absolute values of the coefficients. Therefore, when the algorithm tries to minimize RSS, it enforces the regularization by minimizing the sum of absolute values of the coefficients.

This results in coefficients of the least effective paramaters to be 0 which is kind of like feature selection. Therefore, it is most effectively used for datasets where there a few features with a more dominant effect compared to others. This results in eliminating features which have a small effect by setting their coefficients to 0.

Alpha term is again used to control the amount of regularization.

from sklearn.linear_model import Lasso
model = Lasso()
model.fit(X_train, y_train)

print('feature coefficients: {}'.format(model.coef_))
print('intercept (c): {:.3f}'.format(model.intercept_))

After finding the coefficients of the dominant features, we can go ahead and list their labels.

import numpy as np
data = datasets.load_diabetes()
np.take(data.feature_names,np.nonzero(model.coef_))

alphas = [0.1,0.5,1,2,5,10]
for alpha in alphas:
  model = Lasso(alpha=alpha).fit(X_train,y_train)
  print('feature coefficients for alpha={}: \n{}'.format(alpha,model.coef_))
  print('R-squared score (training): {:.3f}'.format(model.score(X_train, y_train)))
  print('R-squared score (test): {:.3f}\n'.format(model.score(X_test, y_test)))

Ridge or Lasso?

To sum up, it makes sense to use the Ridge regression model there are many small to medium effective features. If there are only a few dominantly effective features, use the Lasso regression model.

Polynomial Regression

Linear regression performs well on the assumption that the relationship between the independent variables (features) and the dependent variable(outcome) is linear. If the distrubtion of the data is more complex and does not show a linear behaviour, can we still use linear models to represent such datasets? This is where polynomial regression comes in very useful.

To capture this complex behaviour, we can add higher order terms to represent the features in the data. Transforming the linear model with one feature:

Since the coefficients are related to features linearly, this is still a liner model. However, it contains quadratic terms and the curve fitted is a polynomial curve.

Let's continue with an example for Polynomial regression. To convert the features to higher order terms, we can use the PolynomialFeatures class from scikit-learn. Then we can use the Linear regression model from before to train the model.

But before, let us create a dataset which could be a good fit for a 2nd degree function. For that we will use numpy to create random X points and plug them into a representative function.

np.random.seed(0)
X = 2 - 3 * np.random.normal(0, 1, 100)
y = X - 2 * (X ** 2)  + np.random.normal(-3, 3, 100)

plt.scatter(X, y, s=10)
plt.show()

We can reshape the arrays we created so that we can feed them in to the model. First, we will train a LinearRegression model to see how it fits to this data.

X = X[:, np.newaxis]
y = y[:, np.newaxis]

model = LinearRegression()
model.fit(X,y)
print('feature coefficients: \n{}'.format(model.coef_))
print('R-squared score (training): {:.3f}'.format(model.score(X, y)))

plt.plot(X, model.coef_*X + model.intercept_, 'r-')
plt.scatter(X,y, s=10)
plt.show()

As expected, Linear Regression model does not provide a very good fit with the normal features for a dataset of this behaviour. Now, we can create 2nd order Polynomial features using the PolynomialFeatures class from sk-learn library. Then, use these new 2nd order features to train the same linear regression model.

from sklearn.preprocessing import PolynomialFeatures

poly_features= PolynomialFeatures(degree=2)
X_poly = poly_features.fit_transform(X)

model = LinearRegression()
model.fit(X_poly,y)
print('feature coefficients: \n{}'.format(model.coef_))
print('R-squared score (training): {:.3f}'.format(model.score(X_poly, y)))
plt.scatter(X,model.predict(X_poly),s=10,label="polynomial prediction")
plt.scatter(X,y,s=10,label="real data")
plt.legend(loc='lower left')
plt.show()

This time, we were able to obtain a very good fit using the same linear regression model but with 2nd order features obtained from the PolynomialFeatures class. This is a perfect example to show how Polynomial Linear Regression can be used to obtain better fits with data which do not have a linear relationship between the features and the outcome value.

If you like this, feel free to follow me for more free machine learning tutorials and courses!

Decision Trees in Machine Learning Explained

Berk Hakbilen — Tue, 16 Aug 2022 21:10:34 +0000

Decision Tree is a supervised machine learning algorithm which works on the basis of recursively answering some questions (if-else conditions). The algorithm is used both for regression and classification. However mostly for classification problems.

The questions in boxes are called the internal nodes where the answers to the questions split it into branches. The nodes which do not split anymore are called leaf which represent the decision/output of the model.

This tree of course is much bigger and more complex for bigger datasets when compared to our simple example above. The tree grows and forms according to the data we provide to it (training the model). However, this simple diagram also shows how simple actually the algorithm works. You can already imagine that to be able to split the data properly, you need to ask the right questions starting from the top node. This means which features and what conditions to use are crucial to bulding a good performing decision tree. Well, how is this possible?

Firstly, the root node feature is selected based on the results from the Attribute Selection Measure(ASM). Afterwards ASM is applied all nodes emerging recursively until no more split is possible (when we reach the leaf).

Attribute Selective Measure (ASM)

Attribute Subset Selection Measure is a method used for data reduction. The data reduction is necessary to make better analysis and prediction of the target variable. This is how the decision tree chooses the nodes to make the best splits for the data.

The two main ASM techniques:

Gini index
Information Gain (ID3)

Gini Index
Gini index or Gini impurity is a measure of impurity (degree of probability of the feature being classified incorrectly) used for creating the decision tree. A feature with low Gini index value should is preferred for the decision of nodes while creating the decision tree. Gini index is used to create only binary splits in the tree.

Information Gain (ID3)
Information Gain tells how informative a feature is by measuring the changes in the entropy after splitting the data on that feature. Decision tree algorithm always tries to maximize the Information Gain, in which the node with highest information gain is chosen as the first node (first split). Therefore the tree is first split by the feature with highest entropy, decreasing the entropy all the way down the the leafs.

The entropy formula is:

We can use the DecisionTreeClassifier model from the scikit learn library:

Let’s use the cancer dataset from scikit-learn library and apply the Decision Tree model:

# Import train_test_split function and the dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

cancer = load_breast_cancer()
y = cancer.target
X = cancer.data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train, y_train)

print("Training accuracy {:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))

Our training accuracy being higher than our test accuracy shows us that our model is overfitting to the data. Let’s plot our decision tree and examine it’s complexity.

from sklearn import tree
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (16,8), dpi=100)
tree.plot_tree(clf, feature_names = cancer.feature_names, class_names=cancer.target_names, filled = True, fontsize = 5);

Changing the asm criterion from gine to entropy:

clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train, y_train)

print("Training accuracy:{:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))

Our test accuracy improved while using the entropy attribute selection measure as the splitting criterion. We can have a look at the tree again to see if there are any changes to the splits done.

fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (16,8), dpi=100)

tree.plot_tree(clf, feature_names = cancer.feature_names, class_names=cancer.target_names, filled = True, fontsize = 5);

As we can see, our tree is pretty complex which is resulting in overfitting the data. We can reduce the complexity of the tree by providing a maximum depth (max_depth) and prevent overfitting.

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)

print("Training accuracy:{:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))

By reducing the maximum depth of our decision tree to three, we were able to decrease overfitting and increase our test accuracy slightly. If we examine our tree diagram now, we will be seeing a much simpler tree…

By setting the max_depth equal to 3, we reduced the complexity of the Decisiont ree by pruning it. This pruned model is less complex and a little easier to understand in comparison to the previous model where the tree kept splitting until all leaves are pure (gini impurtiy = 0).

Decision Tree Regression

Let’s have a look at how the decision tree works for a regression problem. For this we can again generate a random dataset:

import numpy as np

np.random.seed(5)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()

# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

The scikit-learn libary provides us with a decision tree model for regression called DecisionTreeRegressor:

from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor(criterion="squared_error", random_state=0)

dt_reg.fit(X, y)

fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(16,8), dpi=100)
tree.plot_tree(dt_reg, feature_names='X', filled=True, fontsize=5);

#generate a random Test data
T = np.linspace(0, 5, 100)[:, np.newaxis]

#creating two regression trees with different depths
dt_reg_1 = DecisionTreeRegressor(max_depth = 10, random_state=0)
dt_reg_2 = DecisionTreeRegressor(max_depth = 3, random_state=0)

#training the models
dt_reg_1.fit(X, y)
dt_reg_2.fit(X, y)

#making predictions for the random test data we generated above
y_pred_1 = dt_reg_1.predict(T)
y_pred_2 = dt_reg_2.predict(T)

#comparison plot to see the effect of tree depth
plt.figure()
plt.scatter(X, y, s=40, c="orange", label="actual")
plt.plot(T, y_pred_1, color="b", label="max_depth=10", linewidth=2)
plt.plot(T, y_pred_2, color="g", label="max_depth=3", linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Looking at the figure, we can see how the decision tree regressor model with max depth set to ten, overfits to the data capturing all the noise in the data. Moreover, the tree with max depth set to three, is much better in generalising and creating a better fit to the data without capturing all the noise.

Advantages
Easy to understand, interpret and visualize
Usually no feature scaling or normalization and feature selection needed
Functions well with multiple data types (categorical, numerical, binary) in the dataset (easier data preprocessing)
It is also suitable for multi output problems

Disadvantages
Decision Trees tend to overfit and do not generalize very well
Mostly need ensemble of trees of other models for better generalization performance
Decision trees tend to form biased trees if there is data imbalance. Datasets with dominating classes should be balanced.

If you like this, feel free to follow me for more machine learning tutorials and courses!

Support Vector Machines

Berk Hakbilen — Mon, 15 Aug 2022 20:06:00 +0000

Support Vector Machine is another simple algorithm which definitely is an advantage to know about. It is an algorithm which performs relatively good with less computational cost. In case of regression, SVM works by finding a hyperplane in an N-dimensional space (N number of features) which fits to the multidimensional data while considering a margin. In case of classification, same hyperplane is calculated but to distinctly classify the data points again while considering a margin. There can be many possible hyperplanes that could be selected. However, the objective is to find the hyperplane with the maximum margin, meaning maximum distance between between the target classes.

SVM can be used both for regression and classification problems but it is widely used for classification.

Let’s explain some terms before we go any further.

Kernel is the function used to convert data into higher dimension.

Hyperplane is the line separating the classes (for classification problems). For regression, it is the line that we fit to our data to predict continue outcome values.

Boundary lines are the lines that form the area with the error we previously mentioned. They are the two line around hyper plane which represent the margin.

Support vectors are the data points closest to these boundary lines.

Kernel

We mentioned that kernel is the function to convert our data into higher dimensions. Well how is that useful for us?

Sometimes data is distributed in a way that it is impossible to get an accurate fit by using a linear line(seperator). SVR can handle highly non-linear data using the kernel function. The function implicitly maps the features to higher dimensions meaning higher feature space. This allows us to describe it also using a linear hyperplane.

The most used three kernels are:

Linear kernel
Polynomial kernel:
Radial Basis Function (RBF) — Good for dealing with overlapping data

Support Vector Regression (SVR)

Similar to linear regression models, SVR also tries to find a curve that best fits the dataset. Remembering our equation for dataset with one feature from linear regression:

y=w1x1+c

Considering the SVR for a dataset also with one feature, the equation looks similar but with consideration of an error.

−e≤y−(w1x1+c)≤e

Looking at the equation, it is clear to see that only the points outside the e error area will be considered in the cost calculation.SVR can of course also be used for complicated datasets with more features by using higher dimensional feature terms similar to that in polynomial regression.

Hyperplane is the best fit to the data when it coincides with maximum number of point possible. We determine the boundary lines (the value of e which is the distance from hyperplane) so that the points that are closest to the hyperplane are within the boundary lines.

Keep in mind because the margin (between the boundary lines) will be tolerated, it will not be calculated as error. I guess you can already imagine how this term will allow us to tune the complexity of our model (underfitting/overfitting).

import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt

np.random.seed(5)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 5)[:, np.newaxis]
y = np.sin(X).ravel()

# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

# Fit regression model
SVR_rbf = SVR(kernel='rbf' )
SVR_lin = SVR(kernel='linear')
SVR_poly = SVR(kernel='poly')

y_rbf = SVR_rbf.fit(X, y).predict(X)
y_lin = SVR_lin.fit(X, y).predict(X)
y_poly = SVR_poly.fit(X, y).predict(X)

# look at the results
plt.scatter(X, y, c='k', label='data')
plt.plot(X, y_rbf, c='b', label='RBF')
plt.plot(X, y_lin, c='r',label='Linear')
plt.plot(X, y_poly, c='g',label='Polynomial')
plt.xlabel('data')
plt.ylabel('outcome')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

Support Vector Machines for Classification

We already saw how the “Support Vector Machines (SVM)” algorithm works for regression. For classification the idea is actually almost the same. In fact SVM is mostly used for classification problems. I believe you can already imagine why…

For regression, we mentioned that SVM tries to find a curve that best fits the dataset, and then make a prediction for a new point using that curve. Well, the same curve can easily be used to classify the data to two different classes. For multidimensional space with n dimensions (meaning data with n number of features), the model fits a hyper-plane (also called the decision boundary) that best differentiates the two classes. Remembering the image from the regression section where we explained the kernel…

The margin is the gap between the two closest points from each class which is the distance from the hyperplane to the closest points (support vectors). The hyperplane which fits best to the data, meaning which seperates the two classes the best, is the hyperplane with the maximum possible margin. Therefore, the SVM algorithm searches for the hyperplane with the maximum margin (distance to the closest points).

Like we mentioned already in the regression section, some dataset is just not suitable to be classified by a linear hyperplane… In this case, again the “Kernel trick” comes to our rescue implicitly mapping the data to higher dimensions, therefore making it possible for the data to be classified by a linear hyperplane. Since we already discussed the Kernel types and how it works, I will continue with an implementation example…

let’s continue using the cancer dataset from scikit learn library:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn import svm

X, y = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_sub = X[:, 0:2]

# create a mesh to plot in
x_min, x_max = X_sub[:, 0].min() - 1, X_sub[:, 0].max() + 1
y_min, y_max = X_sub[:, 1].min() - 1, X_sub[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear').fit(X_sub, y)

plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X_sub[:, 0], X_sub[:, 1], c=y, cmap=plt.cm.Paired)
plt.scatter(X_sub[svc.support_, 0], X_sub[svc.support_, 1],c='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

#Create and instance of the classifier model with a linear kernel
lsvm = svm.SVC(kernel="linear")

#fit the model to our train split from previous example
lsvm.fit(X_train,y_train)

#Make predictions using the test split so we can evaluate its performance
y_pred = lsvm.predict(X_test)

Let’s compare the performance of the model by comparing the predictions to the real values from the test set…

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

We were able to achieve an accuracy of 95.6% which is very good. Let’s compare the training and test scores to check for overfitting…

print("training set score: %f" % lsvm.score(X_train, y_train))
print("test set score: %f" % lsvm.score(X_test, y_test))

Training score seems to be a little higher than the test score. We can say the model is overfitting although not much.

If you like this, feel free to follow me for more free machine learning tutorials and courses!

Evaluation for Regression Models in Machine Learning

Berk Hakbilen — Sun, 14 Aug 2022 15:01:00 +0000

Model evaluation is very important since we need to understand how well our model is performing. In comparison to classification, performance of a regression model is slightly harder to determine because, unlike classification, it is almost impossible to predict the exact value of a target variable. Therefore, we need a way to calculate how close our prediction value is to the real value.

There are different model evaluation metrics that are used popularly for regression models which we are going to dive into in the following sections.

Mean Absolute Error

Mean absolute error is a very intuitive and simple technique, therefore also popular. It is basically the average of the distances between the predicted and the true values. Basically the distances between the predicted and the real values are also the error terms. The overall error for the whole data is the average of all prediction error terms. We take the absolute of the distances/errors to prevent negative and positive terms/errors from cancelling each other.

Advantages

MAE is not sensitive to outliers. Use MAE when you do not want outliers to play a big role in error calculated.

Disadvantages

MAE is not differentiable globally. This is not convenient when we use it as a loss function, due to the gradient optimization method.

Mean Squared Error (MSE)

MSE is one of widely used metrics for regression problems. MSE is the the measure of average of squared distance between the actual values and the predicted values. Squared terms help to also take into consideration of negative terms and avoid cancellation of the total error between positive and negative differences.

Advantages

Graph of MSE is differantiable which means it can be easily used as a loss function.
MSE can be decomposed into variance and bias squared. This helps us understand the effect of variance or bias in data to the overall error.

Disadvantages

The value calculated MSE has a different unit than the target variable since it is squared. (Ex. meter → meter²)
If there exists outliers in the data, then they are going to result in a larger error. Therefore, MSE is not robust to outliers (this can also be an advantage if you are looking to penalize outliers).

Root Mean Squared Error (RMSE)
As the name already suggests, in RMSE we take the root of the mean of squared distances, meaning the root of MSE. RMSE is also a popularly used evaluation metric, especially in deep learning techniques.

Advantages

The error calculated has the same unit as the target variables making the interpretation relatively easier.

Disadvantages

Just like MSE, RMSE is also susceptible to outliers.

R-Squared

R square is a different metric compared to the ones we have discussed until now. It does not directly measure the error of the model.

R-squared evaluates the scatter of the data points around the fitted regression line. It is the percentage of the target variable variation which the model considers compared to the actual target variable variance. It is also known as the “coefficient of determination” or goodness of fit.

As we can see above, R-squared is calculated by dividing the sum of squared error of predictions by the total sum of square, where predicted value is replaced by the mean of real values.

R-squared is always between 0 and 1. 0 indicates that the model does not explain any of the variation in the target variable around its mean value. The regression model basically predicts the mean of the parget variable. A value of 1 indicates, that the model explains all the variance in the target variable around its mean.

A larger R-squared value usually indicates that the regression model fits the data better. However, a high R-square model does not necessarily mean a good model.

import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import seaborn as sns; sns.set_theme(color_codes=True)

X, y = make_regression(n_samples = 80, n_features=1,
n_informative=1, bias = 50, noise = 15, random_state=42)

plt.figure()
ax = sns.regplot(x=X,y=y)

model = LinearRegression()
model.fit(X, y)
print('R-squared score: {:.3f}'.format(model.score(X, y)))

X, y = make_regression(n_samples = 80, n_features=1,
        n_informative=1, bias = 50, noise = 200, random_state=42)

plt.figure()
ax = sns.regplot(x=X,y=y)

model = LinearRegression()
model.fit(X, y)
print('R-squared score: {:.3f}'.format(model.score(X, y)))

Advantages

R-square is a handy, and an intuitive metric of how well the model fits the data. Therefore, it is a good metric for a baseline model evaluation. However, due to the disadvantages we are going to discuss now, it should be used carefully.

Disadvantages

R-squared can’t determine if the predictions are biased, that is why looking at the residual plots in addition is a good idea.
R-squared does not necessarily indicate that a regression model is good to go. It is also possible to have a low R-squared score for a good regression model and a high R-squared model for a bad model (especially due to overfitting).
When new input variables (predictors) are added to the model, the R-square is going to increase (because we are adding more variance to the data) independent of an actual performance increase in model. It never decreases when new input variables are added. Therefore, a model with many input variables may seem to have a better performance jsut because it has more input variables. This is an issue which we are going to address with adjusted R-squared. R-squared can’t determine if the predictions are biased, that is why looking at the residual plots in addition is a good idea. R-squared does not necessarily indicate that a regression model is good to go. It is also possible to have a low R-squared score for a good regression model and a high R-squared model for a bad model (especially due to overfitting). When new input variables (predictors) are added to the model, the R-square is going to increase (because we are adding more variance to the data) independent of an actual performance increase in model. It never decreases when new input variables are added. Therefore, a model with many input variables may seem to have a better performance jsut because it has more input variables. This is an issue which we are going to address with adjusted R-squared.

It is still possible to fit a good model to a dataset with a lot of variance which is likely going to have a low R-square. However, it does not necessarily mean the model is bad if it is still able to capture the general trend in the dataset, and capture the effect of change of a predictor on the target variables. R-square becomes a big problem when we want to predict a target variable with a high precision, meaning with a small prediction interval.

A high R-squared score also does not necessarily mean a good model because it is not able to detect the bias. Therefore, also checking the residual plots is a good idea. Like we mentioned previously, a model with a high R-squared score can also be overfitting since it captures most of the variance in the data. Therefore, it is always a good idea to check the R-squared score of the predictions from the model and compare it to the R-squared score from the training data.

If you like this, feel free to follow me for more free machine learning tutorials and courses!