Forem: Madeline

EDA part 2: Learning about Estimates of Location

Madeline — Wed, 24 Nov 2021 19:42:20 +0000

This is the second article in a series on EDA. What is EDA? For an overview of Exploratory Data Analysis see my last article.

Getting estimates of location

In this article I'm going to go over some specific ways we can measure data to gain more generalized information from it. We'll be talking about numerical data exclusively. We'll look at some ways to see patterns in our data, rather than a jumble of numbers. We'll use math to generalize the numerical data.

As such this article will contain some equations, but fear not! I will explain each piece of the math so you can understand what the fancy math symbols are trying to tell you. I'll also provide some code examples for you to see a working example of what the equation is doing.

Motivation

So you've got a bunch of data in your dataset. You look at the items in your dataset.

Let's imagine we have a dataset of dog food varieties and we're looking at a column in our dataset for prices. Dog food A costs $5, Dog food B costs $20, Dog food C also costs $20...etc all the way to Dog food n which costs $29 (n in math just refers to whatever number is the last in a sequence - if there are 9 types of dog food then n = 9).

Each item in the dog food price column has a different value. How does knowing the prices of each dog food variety help you understand the dataset as a whole? Well it doesn't, really. When your variables are made up of measured data the data itself can have any number of different values. Looking at each of these values individually will take a lot of time, especially when you have more than nine varieties of dog food. It's also hard to keep all those values in your working memory so that you can recognize patterns.

For any given variable we are interested in exploring (which is the column of a dataset), we need a way to summarize all those different numerical values.

Here’s an example: Instead of looking at each price of dog food one at a time let’s find a way to at a glance see how much dog food costs in general (or, you know, on average, wink, wink)

This is called summary statistical analysis. We are finding an estimate of location.

What's location?

This is an estimate of where most of the data falls and it is sometimes called the central value/tendency in Statistics.

We can make a bar graph to visualize where most of our dog food prices fall. We can see they're around $20-22 per bag. We can also see that this one fluke price of $5 is way off from the rest of the data (we'll come back to that later in the article).

The Mean

In addition to looking at a graph, we can find an actual value to represent the data's central tendency. One such value is the mean, although you might know it as it's other common name, the average.

To get the mean for a column in your dataset, you add all the values in the column together and then divide that resulting sum by the number of members in that column.

Here's the fancy equation:

mean = x-bar = Σⁿ i=1 xi / n

What does this all … ahem…mean?

The x with the line over it is pronounced "x bar" and is used to talk about the mean of a sample from a population. It's the value we're looking for.
The big Greek letter sigma (Σ) may look intimidating, but it is an abbreviation for “sum”. It tells us we are adding together a bunch of values
The i=1 subscript on the Sigma tells us that we are going to do this for each of the values in our column, starting with the first, x1 all the way to xn (n means however many values you have in your dataset, that is the final one).
Divided by n - again, n just means the number of all the values in the dataset. In our example the nth item is Dog food I, which is $29

So we'll write this equation out for our dog food price example:

mean dog food price = (5 + 20 + 20 + 21 + 23 + 23 + 24 + 25 + 29) / 9

Let’s write that in Python:

dog_food_prices = [5, 20, 20, 21, 23, 23, 24, 25, 29]

def get_mean_price(dog_food_prices): 
    mean = round(sum(dog_food_prices) / len(dog_food_prices))
    return mean

What does this code do?

first we put the numbers of all the dog food prices into a list
we write a function def_mean_price who's job is to calculate the mean dog food price
we put the dog_food_prices list as a parameter into our function
then we add all those numbers from that list together using the sum method built into Python
we divide the value returned by sum by the length of dog_food_prices - in other words we divide the sum by how many different kinds of dog food there are
then round just rounds off our answer to 2 decimal places so that it doesn't have a ridiculous number of decimal places trailing off into the aether...
we save all that stuff to a variable called mean
and we return the result mean

Quick definitions

What is a sample? A sample is a part or subset of a population. In our example the sample is the 9 dog food varieties we are looking at.

What is a population? A population is the whole of the group of things or people being studied. In our example the population would be all dog food.

Variations on the mean

Trimmed mean (sometimes called the truncated mean or average): the trimmed mean is the same as the mean except you’re going to remove outliers. This means that the lowest and highest values will be trimmed from the dataset before adding together all your values and calculating the mean. It helps normalize your value, so your mean isn’t skewed by extreme high or extreme low values (Like the fluke $5 bag of dog food).

Here's how we'd change the code to find the trimmed mean instead of the mean.

dog_food_prices = [5, 20, 20, 21, 23, 23, 24, 25, 29]

dog_food_prices.remove(min(dog_food_prices))
dog_food_prices.remove(max(dog_food_prices))

def get_trimmed_mean(dog_food_prices): 
    mean = round(sum(dog_food_prices) / len(dog_food_prices))
    return mean

Weighted mean (weighted average): is similar to the mean, except before adding all the values together you’re going to multiply each one by a set weight (w) which helps to balance out the values a little

Median

To get the median you sort the data and then take the middle value. If there are an even number of values, then take the two middle values and get their mean to find your median value. Since the median depends only on the values in the center of the data, it is already protected from the scary influence of outliers.

In Python this would look like:

dog_food_prices = [5, 20, 20, 21, 23, 23, 24, 25, 29]
dog_food_prices = sorted(dog_food_prices)

def find_median(prices):
  length = len(prices)
  if length/2:
    return (prices[int(length/2-1)]+prices[int(length/2)])/2.0
  else:
    return prices[int((length/2-1)/2)]

print(find_median(dog_food_prices))

Variation on the median

Weighted median: similar to the weighted mean, the weighted median is the same as the median except each value is first multiplied by a weight that you choose

Some more terms

Robust: robust means that the measure is not easily influenced by outliers

Outlier: an outlier is an extreme value. It can be either at the lower end or upper end of the values. E.g. using the example of dog food prices from earlier, let’s say we have a bag of dog food that costs only $5. Since the other bags cost around $20 to $22 per bag this value would be considered an outlier because it is so different from the others. You can see if we take a mean of the dog food prices, we’re going to have a deceptively low average. That’s why you want to either use the median, or trim your outliers before taking the mean, to give a more accurate picture of the central value for dog food prices.

Anomaly detection: is an area where we study outliers. We want to focus on what is weird in a dataset. The weird stuff, the outliers, are the points of interest, and we compare them to the rest of the values which serve as a reference point.

Come back in 2 weeks for more on Exploratory Data Analysis! I'll be giving an overview on estimates of variability: things like standard deviation and variance.

Exploratory Data Analysis: what is it?

Madeline — Tue, 09 Nov 2021 15:39:22 +0000

This is the beginning of a series on Exploratory Data Analysis.

Defining EDA

EDA is an acronym for Exploratory Data Analysis. As the name suggests, it's all about exploring your data using simple diagrams and summary statistics. EDA features visualizations like scatter plots, box plots, correlation matrices, and summary values such as mean, median, interquartile range, and much much more! The goal of EDA is to give you a visual overview of your dataset before you jump into predictive modeling. As such, it's a process that typically takes place at the beginning stages of a machine learning project.

Some other Definitions, because I care

Hold on! What's a dataset? What's predictive modeling?

I'm glad you asked.

A dataset is just a collection of all the data you are going to use in a machine learning project usually arranged in a tabular structure (more on that later).

Predictive modeling is using statistics to predict outcomes of an unknown event—this could be a future event, like the price of a certain type of dog food at a store in a certain area next month—or it could even be an unknown past event (for example, it can be used to predict who most likely committed a crime).

Motivation for EDA

The reason you'll want to do EDA at the beginning of the project is because your discoveries during this process will guide what models you build with the data. However, you can return to EDA at other points during your project cycle, especially when your model isn't giving you results you expect. If you're getting strange results it can be a good idea to take another look at your data—see if there are any messy outliers, for example—because a machine learning model is only as good as the data that goes into it.

Defining Structured Data

The types of EDAs I will be discussing in this and future blog posts work on structured data—that means data consisting of rows and columns. Think of a table, an excel spreadsheet, or a dataframe from the Python pandas library (I will have more to say about pandas in later articles). Since it's arranged in a table, this type of data is usually called tabular.

Unstructured data, on the other hand, is data like the pixels making up an image, or the sounds making up a stream of speech. These things cannot be used for EDAs described in this article. We can't get a meaningful average from pixel values, for example.

Types of structured data

There are 2 kinds of structured data:

numerical data consists of numbers
- for example: data points that are prices of dog food
categorical data has a certain fixed amount of categories
- for example: data points that are brands of dog food

There are 2 kinds of numerical data:

continuous numerical data which you can visualize as data on a number line
- for example: prices which can be $1, $5, $10, and any of the numbers in between
discrete numerical data which can only be whole, concrete numbers
- for example: the exact amount of dogs that eat a certain type of dog food. We can't have half a dog!

and there are 2 kinds of categorical data:

binary/boolean data is a special kind of categorical data with just 2 possible options:
- for example: whether the dog food is made for a particular dietary restriction
  - 0 or 1
  - true or false
  - yes or no
ordinal data is is categorical data that has a specific ordering:
- for example: customer satisfaction ratings of dog food on a scale of 1-5

Why is it important to categorize your data?

After all, at some point, all the data is going to be turned into numbers when you feed it into the computer program, so why go to all this trouble of categorizing it before hand?

The type of data will help you choose what kind of analysis to use on it. It will also help the computer program you feed the data into determine how to process it. For example, what kind of visual display you want for your data (scatter plot or box plot?), or what kind of predictive modeling to use (linear regression or decision tree?)

Machine learning and data science libraries, like scikit-learn and the pandas library in Python have special functions that operate on certain types of data but not others, so you will need to know what type of data you are dealing with to use these technologies successfully. For example, some functions need to know whether the data is categorical or ordinal.

Tabular/Rectangular data

Tabular data, also known as rectangular data, refers to a data object. A data object can take different forms depending on the technology you are working with. It could be in the form of an excel spreadsheet, a database table, or a pandas dataframe. All these technologies have this in common: they describe a two dimensional matrix that consists of cases and features.

A quick note: if you are starting with unstructured data, it won't be in this rectangular form, and may need to be processed before EDA is possible. (Alternatively, you may choose different approaches for your EDA, other than discussed here.)

Non-rectangular data structures include, time series, spatial data structures, graph/network data structures, but I won't be discussing these types of data in this series.

Softmax Activation Function: walk through the math with me

Madeline — Mon, 19 Apr 2021 00:47:37 +0000

I chose to walk through the Softmax activation function for this week, because it was one of the most confusing activation functions that I worked to understand. Thus I have a soft spot (corny laughter) in my heart for it.

The motivation for Softmax

Softmax is used when you have a classification problem with more than one category/class. Suppose you want to know if a certain image is an image of a cat, dog, alligator, or onion. Softmax turns your model's outputs into values that represent probabilities. If is similar to sigmoid (used for binary classification), but works for any number of outputs.

Note: Softmax is not so good if you want your algorithm to tell you that it doesn't recognize any of the classes in the picture, as it will still assign a probability to each class. In other words, Softmax won't tell you "I don't see any cats, dogs, alligators, or onions."

The mathematical formula for Softmax:

Steps:

I am using S here to stand for the function Softmax, which takes y as its input (y is what we are imagining comes from the output of our machine learning model)
the subscript i lets us know that we do these steps for all of the y inputs that we have
the numerator in the formula tells us that we need to raise e to the power of each number y that goes through the Softmax function—this is called an exponential
An exponential is the mathematical constant e (approximately 2.718) raised to a power.
the denominator in the formula tells us to add together the exponentials of all the numbers from your model's output
- ∑ (the Greek capital letter Sigma) stands for "sum", which means "add all the following stuff together"
What did we just find? The probability is numerator (which will be different depending on the number y) divided by the denominator (which will be the same for all the outputs)

Why use exponentials?

Because we are making the numbers exponential, they will always be positive and the will get very large very quickly.

Why is this nifty? If one number y among the inputs to Sigmoid is higher, the exponential will make that number even higher. Thus, Sigmoid will select the class that is the most likely to be in the image out of all the the classes that it can choose from—and that output will be closer to 1. The closer the number is to 1 the more likely that it is of that class

What did we just do to the outputs?

the outputs will always be in the range between 0 and 1
the outputs will always add up to 1
each value of the outputs represents a probability
taken together the outputs form a probability distribution

A brief example of Softmax at work

We have our inputs to the Softmax function: -1, 1, 2, 3
We use each of these inputs on e as an exponent:

e^-1, e¹, e², e³

which gives us approximately these results: 0.36, 2.72, 7.39, 20.09
we add together the exponentials to form our denominator (we will use the same denominator for each input):

e^-1 + e¹ + e² + e³= 30.56

then we divide each numerator by the denominator that we just found:
- 0.36 / 30.56 = 0.01
- 2.72 / 30.56 = 0.09
- 7.39 / 30.56 = 0.24
- 20.09 / 30.56 = 0.66
So Softmax turns these inputs into these outputs:
- -1 → 0.01
- 1 → 0.09
- 2 → 0.24
- 3 → 0.66
and if we add up all the outputs, they equal 1 (try it, it actually works!)

Thanks for reading this week's blog post. I hope you enjoyed it and have a clearer understanding of how the mathematics behind Softmax works.

How Neural Networks work: activation functions

Madeline — Sun, 11 Apr 2021 19:40:40 +0000

In my post Deep learning: when and why is it useful?, I discussed why a neural network uses non-linear functions. In this article we will see an example of how to use those non-linear functions in a neural network. We'll be considering classification problems for the sake of simplicity. We'll also look at a few different kinds of non-linear functions, and see the different effects they have on the network.

What actually is an activation function?

A neural network can have any number of layers. Each layer has a linear function followed by a non-linear function, called an activation function. The activation function takes the output from the linear function and transforms it somehow. That activation becomes the input features for the next layer in the neural network.

Here's a 3 layer neural network to help us visualize the process:

We have features (x1, x2, x3, x4, x5) that go into the first hidden layer
this layer calculates the linear functions (y=wx+b) on each input and then puts the result from that calculation through the activation function ReLU(max(0, x)
the activations output from layer 1 becomes the inputs for layer 2 and the same calculations happen here
the activations output from layer 2 become the input for layer 3 and then layer 3 does the same linear function-activation function combo
the last set of activations from layer 3 go through a final activation layer, that will be different depending on what your model is trying to predict: if you have only two classes (a binary classifier), then you can use a sigmoid function, but if you have more than two classes you will want to use softmax
those final activations are your predicted labels

Why would we want to do that?

Linear models are great, but sometimes non-linear relationships exist and we want to know what they are. Consider the example below.

We want to predict the boundary line between the blue and pink dots (fascinating, I know!).

This is what a linear function, such as logistic regression, can uncover:

This is what a non linear neural network can uncover, which gives us a better visualization of the boundaries between blue and pink dots:

Furthermore, the more hidden layers we add to our network, the more complex relationships we can potentially find in the data—each layer is learning about some feature in the data.

Okay, but how?

I find it helpful to think of activation functions in two categories (I don't know if this is an "official" distinction, it's just the way I think about them)—activations on hidden units and activations for the final output. The activations for the hidden units exist to make training easier for the neural network, and allow it to uncover non-linear relationships in the data. The activations for the final output layer are there to give us an answer to whatever question we are asking the neural network.

For example, let's imagine we're training a binary classifier that distinguishes between pictures of cats and mice. We might use the ReLU activation function on our hidden units, but for our final output layer we need to know the answer to our question: is this picture of a cat or a mouse? So we will want an activation function that outputs 0 or 1.

Let's take a look at what each different activation function is actually doing.

A few popular activation functions:

x here stands for the output from the linear function that is being fed into the activation function

Sigmoid converts outputs to be between 0 and 1

Tanh: converts numbers to range from -1 to 1—you can picture it as a shifted version of sigmoid. It has the effect of centering the data so that the mean is closer to 0, which improves learning for the following layer.

ReLU (rectified linear unit): max(0, x) if the number is negative the function gives back 0, and if the number is positive it just gives back the number with no changes—ReLU tends to run faster than tanh in computations, so it is generally used as the default activation function for hidden units in deep learning
Leaky ReLU: max(0.001x, x)—if the number x is negative, it gets multiplied by 0.01, but if the number x is positive, it stays the same x

Final layer activation functions:

Sigmoid is used to turn the activations into something interpretable for predicting the class in binary classification,
- since we want to get an output of either 0 or 1, a further step to is added:
- decide which class your predictions belong to according to a certain threshold (often if the number is less than 0.5 the output is 0, and if the number is 0.5 or higher the output is 1)
- (Yes, sigmoid is on both lists—that's because it is more useful in deep learning in producing outputs, but it's helpful to know about for understanding Tanh and ReLU).
Softmax is used when you need more than one final output, such as in a classifier for more than one category/class. Suppose you want to know if a certain image depicts a cat, dog, alligator, onion, or none of the above. The motivation: it provides outputs that are probabilities

If that formula looks gross to you, come back next week—I plan to break it down step by step until if seems painfully simple

Random Forests: improving decision trees (ML Log 12)

Madeline — Sun, 28 Mar 2021 15:50:29 +0000

Why improve on Decision Trees?

At the end of my article on Decision Trees we looked at some drawbacks to decision trees. One of them was that they have a tendency to overfit on the training data. Overfitting means the tree learns what features classify the training data very well, but isn't so good at making generalizations that accurately predict the testing set.

I mentioned that one way we can try to solve the problem of overfitting is by using a Random Forest. Random Forests are an ensemble learning method, so called because they involve combining more than one model into an ensemble. Essentially, we are going to train a bunch of decision trees, and take the majority vote on class predictions for the test data.

Quick reminder:

Training data is the subset of your data that you use to train your model. The tree in the random forest will use this data to learn which features are likely to explain which class a given data sample belongs to.

Test data is the subset of your data that you use to make sure your model is predicting properly. It's supposed to simulate the kind of data your model would make predictions on in the real world (for example, if you are making a dog breed classifying app, the test data should mimic the kinds of images you might get from app users, uploading pictures of dogs).

Wisdom in a crowd

Have you ever been in a classroom and the teacher for some reason asks the class as a whole a question. (Not necessarily a great teaching technique, but let's move on.) You think you know the answer, but you are afraid of being wrong, so you wait a little while until bolder classmates have given their answers before agreeing with what the majority is saying.

This can be a good strategy, if not for learning the material, at least for being right. You wait to see what the consensus is in the classroom before casting your vote.

You're not sure of your answer, but if enough other people agree with you, it seems more likely to you that your answer is the right one.

While this technique isn't probably the best predictor of whether a group of students is learning the material, it can be used to good effect in machine learning.

How do Random Forests work?

Building a random forest classifier can be broken into two steps.

Training, by building a forest of a bunch of different trees, all of which learn from a bagged random sample of the data ****
Making an predictions by taking the predictions of each tree in the forest and taking the majority vote of which class each sample in the test set belongs to

To train a Random Forest, we train each decision tree on random groups of the training data, using a technique called bagging or bootstrap aggregating.

What actually is bagging?

Bagging is not dividing the training data into subsets and building a tree from each subset.

Instead, each individual tree randomly grabs samples from the training set.The training set has n number of samples. What the tree does is choose n number of samples randomly from a bag of all the training samples, but after considering each sample it will put it back into the bag before picking out another one. This is called sampling with replacement.

In this very simplified image the different shapes and colors in the bag just represent different samples in the training data (there is no significance intended to the shapes and colors). Each tree grabs a sample from the bag and then puts it back before grabbing another sample, so each tree ends up with a different set of data that it uses to build its tree.

Note that this does mean that any given tree in the random forest might end up with the same sample more than once (as you can see in my little picture). But because there are multiple trees in the forest, and each one chooses samples randomly, there will be enough variation in the trees that it won't really matter too much if samples are repeated for a given tree.

Prediction time

Once the trees are made they can make their predictions. We feed the test set to the trees in our random forest classifier, each tree makes its predictions on the test set, and then we compare the predictions and take the ones that the majority of trees agrees on.

The benefits of Random Forests

Like decision trees, random forests are straightforward and explainable. Since a random forest is made up of decision trees, we still have access to the Feature Importance (using Scikit-Learn, for example) for understanding the model. Our model can still tell us how important a given feature is in predicting the class of any sample.

Since we are getting the answer from more than one tree, we are able to get an answer that the majority agrees upon. This helps reduce the overfitting we see in a Decision Tree. Each tree in the random forest is searching for the best feature in a random subset of the data, rather than the best feature in all of the training set. This helps the model achieve more stable predictions. The Random Forest Classifier as an ensemble can't memorize the training data, because each tree in the forest doesn't have access to all the training data when it makes its tree.

Bonus: Hyperparameters in Scikit-Learn

Hyperparameters is a term used in machine learning to refer to the details of the model that you can tweak to improve its predictive power. These are different from parameters which are the actually things that your model uses to compute functions (like the weights w and the bias b in a linear function).

Some hyperparameters you can tweak in Random Forests are:

The number of trees—in sklearn (Scikit-Learn's machine learning library) this hyperparameter is called n_estimators
- more trees generally improves the model's predictive power, but also slows down training the model, because there are more trees to build
The n_jobs hyperparameter in sklearn tells your computer how many processors to use at once, so if you want to have the model run faster, you can set this hyperparameter to -1, which tells the computer to use as many processors as it has

Other hyperparameters that you can change are the same ones found in Decision Trees—for example, max_features and min_samples_leaf —which I discussed in my post and demonstrated in this Kaggle notebook on Decision Trees.

In conclusion

Random Forests are a handy boost to your baseline Decision Tree model, either in classification or regression problems. You can usually reduce overfitting, while not giving up too much of the model explainability that you have access to with a Decision Tree algorithm.

Deep Learning: when and why is it useful? (ML Log 11)

Madeline — Sun, 14 Mar 2021 15:06:53 +0000

Deep learning, neural networks, ANNs, CNNs, RNNs—What does it all mean? And why would we want to use deep learning when a decision tree or linear regression works just fine? Well, the short answer is, we wouldn't. If your simple machine learning approach is solving your problem satisfactorily, then there isn't much reason to employ a neural network, since training them tends to be expensive in terms of time and computing power.

Problems that work well for traditional machine learning methods are ones that involve structured data—data where the relationship between features and labels is already understood. For example, a table of data that matches some traits of a person (such as age, number of children, smoker or non-smoker, known health conditions, etc.) with the price of that person's health insurance.

With some problems, the relationship between features and targets is less clear, and a neural network will be your best bet to make predictions on the data. These will be problems that involve unstructured data—things like images, audio, and natural language text. For example, what arrangement of pixels in an image of a cat (features) makes it more likely that it as a picture of a cat (label), rather than any other thing.

What actually is a neural network?

It's called an artificial neural network, not because it artificially replicates what our brains do, but rather because it is inspired by the biological process of neurons in the brain. A neuron receives inputs with the dendrites (represented by the X, or feature vector in machine learning) and sends a signal to the axon terminal, which is the output (represented by the y, or label vector).

Here's a nice picture I borrowed from Wikipedia that illustrates a neuron, and it's relationship to the inputs and outputs we see in machine learning problems:

So the idea is to do something similar using computers and math. The following little picture represents a single layer neural network, with an input layer that contains the features, a hidden layer that puts the inputs through some functions, and the output layer which spits out the answer to whatever problem we are trying to solve. What actually goes on behind the scenes is just numbers. I don't know if that needs to be said or not, but there aren't actually little circles attached to other circles by lines—this is just a visual way to represent the mathematical interactions between the inputs, hidden layer, and the output.

Image credits

Inputs could be pixels in an image, then the hidden layer(s) use some functions to try to find out what arrangement of pixels are the ones that represent a cat (our target), and the output layer tells us whether a given arrangement of pixels probably represents a cat or not.

What do all these terms mean?

From what I've seen, neural network and deep learning are used mostly interchangeably, although a neural network is a kind of architecture that is used in the problem space of deep learning.

Variations on neural networks with special abilities that come from each class of model's specific architecture:
CNNS: convolutional neural networks—these are used for images
RNNs: recurrent neural networks—used for something like text, where the order of input (words or characters) matters
GANs: generative artificial networks—networks that make something new out of the input, like turning an image into another image

The problems that are solved in the deep learning space are so called, because the networks used to solve them have multiple hidden layers between the input and output—they are deep. Each layer learns something about the data, then feeds that into the layer that comes next.

What do the layers learn?

In a shallow network, like linear regression, for example, the only layer is linear, and contains a linear function. The model can learn to predict a linear relationship between input and output.

In deep learning there is a linear and a non-linear function, called an activation function, at work in each layer, which allows the network to uncover non-linear relationships in the data. Instead of just a straight line relationship between features and label, the network can learn more complicated insights about the data, by using the functions in multiple layers to learn about the data.

Example: In networks that are trained on image data, the earlier layers learn general attributes of image data, like vertical and horizontal edges, and the later layers learn attributes more specific to the problem the network as a whole is trying to solve. In the case of the cat or not-cat network, those features would be the ones specific to cats—maybe things like pointy ears, fur, whiskers, etc.

The exact architecture of the neural network will vary, depending on the input features, what problem is being solved, and how many layers we decide to put between the input and output layers, but the principal is the same: at each layer we have a linear function and an activation function that is fed into the layer that comes next, all the way until the final output layer, which answers the question we are asking about the data.

Why can't we just have multiple linear layers?

What's the point of the activation function, anyway? To answer that question, let's look at what would happen if we didn't have an activation function, and instead had a string of linear equations, one at each layer.

We have our linear equation: y = wx + b
Layer 1
- Let's assign some values and solve for y:
  - w=5, b=10, x=100: y = (5*100) + 10 → y = 510
- 510 is our output for the first layer
Layer 2:
- We pass that to the next layer, 510 is now the input for this layer, so it's the new x value
- Let's set our parameters to different values:
  - w = 4, b = 6, and now we have the equation:
  - y = (4*510) + 6 → our new output is y = 2046
But here's the thing: instead of doing this two layer process, we could have just set w and b equal to whatever values would let us get 2046 in the first place. For example: w=20 , b=46, which would also give us y = 20 * 100 + 46 = 2046 in a single layer.
Most importantly, we won't achieve a model that recognizes non-linear relationships in data while only using linear equations.

It doesn't matter how many layers of linear equations you have—they can always be combined into a single linear equation by setting the parameters w and b to different values. Our model will always be linear unless we introduce a non-linear function into the mix. That is why we need to use activation functions. We can string together multiple linear functions, as long as we separate each one by an activation function, and that way our model can do more complex computations, and discover more complicated relationships in the data.

In brief

If you have a lot of data and a problem you want to solve with it, but you aren't sure how to represent the structure of that data, deep learning might be for you. Images, audio, and anything involving human language are likely culprits for deep learning, and each of those problems will have its own flavor of neural network architecture that can be used to solve it.

Entropy: what is it and why should you care? (ML log 10)

Madeline — Sun, 07 Mar 2021 21:58:53 +0000

Decision trees

Last week we explored Decision Trees, and I said that the algorithm chooses the split of data that is most advantageous as it makes each branch in the tree. But how does the algorithm know which split is the most advantageous? In other words, what criterion does the algorithm use to determine which split will help it answer a question about the data, as opposed to other possible splits?

Entropy is this criterion.

What actually is entropy?

Entropy measures randomness. But that might sound a little abstract if you aren't used to thinking about randomness, so instead let's shift our attention to something concrete: socks.

1st Situation: Imagine you have a laundry basket filled with 50 red socks and 50 blue socks. What is the likelihood that you will pull a red sock out of the basket? There is a 50/50 chance that it will be a red sock, so the chance is 0.5. When there is an equal chance that you will get a red sock or a blue sock, we say that the information is very random - there is high entropy in this situation.

2nd Situation: Now you divide those socks into two different boxes. You put the 50 red socks into a box, labeled Box 1, and the 50 blue socks into another box, labeled Box 2. The likelihood that you will pull a red sock out of Box 1 is 1.0 and the likelihood that you will pull a red sock out of Box 2 is 0. Similarly, the likelihood that you'll pull a blue sock out of Box 2 is 1.0 and the likelihood that you will pull a red sock out of that same box is 0. There is no randomness here, because you know whether you will get a red sock or a blue sock. In this situation, there is no entropy.

3rd Situation: But let's suppose that we split up the basket of socks another way - 50 socks in Box 1 and 50 socks in Box 2, but we kept the distribution of the socks the same - so there are 25 red socks in Box 1 and 25 blue socks. And in Box 2 there are also 25 red socks and 25 blue socks. In this situation, although we have divided the total amount of socks in half, the entropy is the same as the first situation, because the information of whether you will grab a red sock or a blue sock is just as unpredictable as the first example.

So we can say that entropy measures the unpredictability or randomness of the information within a given distribution.

Just so you know, this concept is also called impurity. In the 2nd situation we reduced the entropy, which gives us low impurity. In the 1st and 3rd examples we had high impurity.

A Decision Tree is tasked with reducing the impurity of the data when it is making a new branch in the tree. With those situations above, a better split of the data would be the kind seen in Situation 2, where we separated the blue socks from the red socks. A bad split would be Situation 3, because we didn't make it any easier for us to guess whether a sock would be red or blue.

Information gain

Why do we want to reduce the randomness/impurity? Remember, the goal of a Decision Tree is to predict which class a given sample belongs to, by looking at all the features of that sample. So we want to reduce the unpredictability as much as possible, and then make an accurate prediction.

This is called information gain. We are measuring how much information a given feature gives us about the class that a sample belongs to.

We have a sample
We know the value of a feature of that sample (for example, in Box 1 or Box 2)
Does knowing that feature reduce the randomness of predicting which class that sample belongs to?
If it does, than we have reduced the entropy and achieved information gain

The term information gain is more or less self-explanatory, but in case it is less for you: it is telling us how much we learned from a feature. We want to find the most informative feature — the feature that we learn the most from. And when a feature is informative we say we have gained information from it, hence information gain.

Math

I'm going to do my best to break down the mathy part now, for calculating entropy and information gain.

Here is some notation for probabilities:

True (i.e. a sample belongs to a certain class): 1
False (i.e. a sample doesn't belong to a certain class): 0
A given sample: X
Probability (likelihood) that given sample X is of a certain class (True): p(X = 1)
Probability that X is not of a certain class (False): p(X = 0)

Notation for entropy:

H(X)=−p(X=1)log2(p(X=1))−p(X=0)log2(p(X=0))

H(X): entropy of X
Note: when* p*(X=1)=0 or p(X=0)=0 there is no entropy

Notation for** information gain**:

IG(X,a) = H(X)−H(X|a)

IG(X,a): information gain for X when we know an attribute a
IG(X, a) is defined as the entropy of X, minus the conditional entropy of X given a
The idea behind information gain is that it is the entropy for a given attribute a that we lose *if we know the value of *a

Specific conditional entropy: for any actual value a0 of the attribute a we calculate:

So let's imagine there are several different possible attribute values for a that are numbered as follows: a0, a1, a2,...an
We want to calculate the specific conditional entropy for each of those values...

Conditional entropy: for all possible values of a

H(X|a)=∑aip(a=ai)⋅H(X|a=ai)

Entropy of X given a
We add together all the values for specific conditional entropy of each attribute a0 through an...
...that's what ∑aip(a=ai) stands for
∑ means "sum"
It is conditional *because it is the entropy that depends on an attribute *a

So here are the steps you can take with those equations to actually find the conditional entropy for a given attribute:

Calculate each probability for your first possible value of a:
Calculate the specific conditional entropy for the probability found in Step 1
Multiply the values from step 1 and step 2 together
Do that for each possible value for a
Add together all those specific conditional entropy values you found for a
...the sum reached in Step 5 is your conditional entropy

Then to calculate information gain for an attribute a:

Subtract the conditional entropy from the entropy
At this point you have found the value for your information gain with respect to that attribute a

What you have now discovered is how random your data really is, if you know the value of a given attribute a (i.e. a given feature). Information gain lets us know how much knowing a reduces the unpredictability.

And this is what the decision tree algorithm does. It uses the entropy to figure out which is the most informative question to ask the data.

For more details on the math behind the concept of entropy

This awesome Twitter thread by Tivadar Danka gives a more detailed breakdown into the mathematical concept of entropy, and why there are all those logarithms in the equation. And while we're on the subject, I highly recommend that you follow him on Twitter for great threads on math, probability theory, and machine learning.

If you want a refresher on what logarithms are, check out this very no-nonsense explanation from Math Is Fun. I also highly recommend visiting this cite before Wikipedia for all of your math basics needs. It is written with a younger audience in mind, which is perfect if your brain isn't used to all of the fancy math lingo and symbols yet.

Citations
In case you're wondering, my understanding of entropy was drawn from the book Doing Data Science by Rachel Schutt and Cathy O'Neil and the Udacity Machine Learning course lectures.

I hope you enjoyed this Machine Learning Log, and please share any thoughts, questions, concerns, or witty replies in the comments section. Thank you for reading!

Machine Learning Log 9: Decision trees and mushrooms

Madeline — Sun, 28 Feb 2021 20:29:42 +0000

If you've ever made a decision you've unconsciously used the structure of a decision tree. Here's an example: You want to decide whether you are going to go for a run tomorrow: yes or no. If it is sunny out, and your running shorts are clean, and you don't have a headache when you wake up, you will go for a run.

The next morning you wake up. No headache, running shorts are clean, but it's raining, so you decide not to go for a run.

But if you had answered yes to all three questions, then you would be going for a run.

You can use a flow chart to represent this thought process.

That flow chart is a simple decision tree. Follow the answers and you will reach the conclusion about whether you will run or not tomorrow.

Decision trees in machine learning

We can teach a computer to follow this process. We want the computer to categorize some data by asking a series of questions that will progressively divide the data into smaller and smaller portions.

We will get the computer to:

Ask a yes or no question of all the data...
...which splits the data into 2 portions based on answer
Ask each of those portions a yes or no question...
...which splits each of those portions into 2 more portions (now there are 4 portions of data)
Continue this process until...
- All the data is divided
- Or we tell it to stop

This is an algorithm, and it is called recursive binary splitting. Recursive means it's a process that is repeated again and again. Binary means there are 2 outcomes: yes or no / 0 or 1. And splitting is what we call dividing the data into 2 portions, or as they're more fancily known, splits.

How does the algorithm decide which way to split up the data? It uses a cost function and tries to reduce the cost. Whichever split reduces the cost the most will be the split that the algorithm chooses.

What is reducing the cost? Briefly, cost is a measure of how wrong an answer is. In a decision tree, we are trying to gain as much information as possible, so a split that reduces the cost is one which will group similar data into similar classes. For example, say you are trying to sort a basket of socks based on if they are red or blue. A bad split would be to divide the basket of socks in half but keep the same ratio of red and blue socks. A good split would be to put all the red socks in one pile and all the blue socks in another. (A decision tree algorithm uses entropy to measure the cost of each split, but that discussion is beyond the scope of this article.)

We can use decision trees in classification problems (predicting whether or not an item belongs to a certain group or not) or regression problems (predicting a discrete value).

Using these answers to categorize new data instances

Then when we have some new data, we compare the features of the new data to the features of the old data. Whichever split a given sample from the new data matches up with, is the category that it belongs to.

a given test sample belongs to the split where the training samples had the same set of features as that test sample
For Classification problems: at the end the prediction is 0 or 1, whether the item belongs to a class
For Regression problems:
- We assign a prediction value to each group (instead of a class)
- The prediction value is the target mean of the items in the group

Now let's discuss the data

Now let's look at an actual dataset so we can see how a decision tree could be useful in machine learning.

If mushrooms grew on trees...

We're going to look at the mushrooms dataset from Kaggle. We have over 8,000 examples of mushrooms, with information about their physical appearance, color, and habitat arranged in a table. About half of the samples are poisonous and about half of the samples are edible.

Our goal will be to predict, given an individual mushroom's features, if that mushroom is edible or poisonous. This makes it a binary classification problem, since we are sorting the data into 2 categories, or classes.

By the way, I don't recommend you use the model we produce to actually decide whether or not to eat a mushroom in the wild.

Check out the code

Here, you can find a Kaggle Notebook to go along with the example discussed in this article. I provided a complete walkthrough of importing the necessary libraries, loading the data, splitting it up into training and test sets, and making a decision tree classifier with the scikit-learn library.

How do we know when to stop splitting the data?

At a certain point you have to stop dividing the data up further. This will naturally happen when you run out of data to divide. But that would be after we have a massive tree of over 8,000 leaf nodes - one for each sample in our training data! That would not be very useful, because we want to have a tree that generalizes well to new data. If we wait too long and let our algorithm split the data into too many nodes, it will overfit. This means it will understand the relationships between features and labels in the training data really well - too well - and it won't be able to predict the class of new data samples that we ask the model about.

Some criteria for stopping the tree:

Setting the max depth: tell the algorithm to stop dividing the data when it gets to a certain number of nodes
Setting the minimum number of samples required to be at a leaf node
Setting the minimum number of samples required to split an *internal *node
- this is helpful if we want to avoid having a split for just a few samples, since this would not be representative of the data as a whole

For example, here is how we can set the max depth with sklearn's library:

from sklearn import DecisionTreeClassifier

model_shallow = DecisionTreeClassifier(max_depth=4, random_state=42)
model_shallow.fit(X_train, y_train)

This will yield a tree that only has 4 decision nodes total.

What are the drawbacks of decision trees?

A decision tree algorithm is a type of greedy algorithm, which means that it wants to reduce the cost as much as possible each time it makes a split. It chooses the locally optimal solution at each step.

This means the decision tree may not find the globally optimal solution--the solution that is best for the data as a whole. This means that at each point where it needs to answer the question "is this the best possible split for the data?" it answers that question for that one node, at that one point in time.

This means that the decision tree will learn the relationship between features and targets in the training data really well but it won't generalize well to new data. This is called overfitting.

One way we can deal with this overfitting is to use a Random Forest, instead of a Decision Tree. A Random Forest takes a bunch of decision trees and then uses the average prediction from all of the trees to predict the class of a given sample.

What are the benefits of decision trees?

Decision trees are fairly easy to visualize and understand. We say that they are explainable, because we can see how the decision process works, step by step. This is helpful if we want to understand which features are important and which are not. We can use the decision tree as a step in developing a more complicated model, or on its own. For example, we can use feature_importances_ to decide which features we can safely trim from our model without it performing worse.

A Decision Tree is an excellent starting point for a classification problem, since it will not just give you predictions, but help you understand your data better. As such, it is a good choice for your baseline.

Terms cheat-sheet

Decision Tree Anatomy:

node: the parts of the tree that ask the questions
root: the first node--creates the initial split of data into 2 portions
branches or edges: internal nodes--they come between the root node and the leaf nodes
decision node or leaf node: when we reach the end of a sequence of questions, this is the node that gives the final answer (for example, of what class a sample belongs to)
split: the portion of data that results from splitting

Other terms:

training data: the data used to fit the model
validation data: data used to fine-tune the model and make it better (we left out that step in the Kaggle notebook)
testing data: the data used to test if the model predicts well on new information
instance / sample: one example from a portion of your data - for example, a single mushroom
algorithm: step by step instructions that we give to a computer to accomplish some task
baseline: a simple model we train at the beginning stages of exploring our data to gain insights for improving our predictions later on (all future models will be compared to this one)

If you enjoyed this article, please take a look at the Kaggle notebook that I made to go with it. It is a beginner friendly example of using the Mushrooms dataset to build a decision tree, evaluate it, and then experiment a bit with the model.

Additionally, I'd love to get feedback about the format of breaking the general overview of a topic apart from the code notebook. I felt that both could stand on their own, so someone could go through the code example to see how it works, or someone could read this article. If you want to read both, hey that's cool too!

Thank you for reading!

Do you need math to get started with Machine Learning? My slightly ranty opinion and some tips for growth

Madeline — Sun, 21 Feb 2021 15:47:14 +0000

A little story

(If you don't like personal anecdotes, skip to Tips 😉)

Okay so you're interested in machine learning and you ask Google "what do I need to know to start machine learning?"

"Learn calculus, probability, statistics, linear algebra, learn to code, and then you can start learning machine learning," Google tells you.

Your heart sinks.

Maybe you haven't touched math since high school. Being told you need to learn that amount of math just to get started might be enough to send you away in discouragement, never to revisit the idea of machine learning.

It did me at first.

"Stick to learning web dev. Everyone says it is where people without a tech background should go," I told myself again and again.

The thing is, the more I tried to learn web development, the less interested in it I became. While the more I thought about machine learning, the more I just wanted to find out what it was all about.

Several false starts at learning algebra so I could learn precalculus so I could learn calculus so I could learn linear algebra later, I realized it would be months and months before I could start actually playing around with machine learning models.

After one more lackluster attempt at building an ecommerce site with React, I finally just started over. "What's the worst that could happen? I only know a little about matrices, I have no exposure to calculus yet, but it's not going to hurt anything to just see what it's like." So I started studying machine learning, developing intuitions for the math concepts machine learning is built on as I needed them.

We all like Tips

Okay, with that personal anecdote out of the way...

Here are some practical tips for studying the math you need for machine learning:

Do not start by learning probability, statistics, linear algebra, and multivariate calculus. Machine learning is built on that mathematical foundation - so it absolutely is important (don't let anyone tell you otherwise) - but you don't need to start there. That is to say, they may be prerequisites in a course catalog, and topics that will help you understand machine learning algorithms more quickly, but if you are learning on your own and start there, you may never get to the point where you feel ready to begin with machine learning.
Begin with an introduction to machine learning and when you come to a term you don't understand - for example vector, tensor, function, mean - look it up!
Understand what is the motivation for using that particular element of math. For example, before learning what a derivative does - before looking at the mathematical formulas - find out why they are used in the first place.
If you don't understand a mathematical formula break it down into the smallest components that you do understand.
Alternately, if you aren't able to break down the formula into smaller components, find a different representation of the concept - Look for a:
- Code snippet
- Picture
- Video explanation
- Worked out algebraic example
- All of the above!
The internet is your friend. If you don't understand one explanation that does not mean you are "not a math person." Instead it may mean you need to learn the concept from a different angle. There is no shame in seeking out another resource if the first one isn't serving you. Don't just keep smashing your head into a brick wall and hope you'll somehow get through it.

And now it's gonna get a little deep

I also want to address the fact that many people consciously or subconsciously feel their self worth is tied to being good or not so good at mathematics. Or, at least, I suspect I'm not the only one.

When I can't remember how to do something in algebra, it still shatters my self esteem. This is probably for a variety of reasons - one of them is probably because I spent a lot of time (4 years?) thinking that if I couldn't solve algebra problems, I wouldn't do well on the SAT, and wouldn't be able to prove to people that I have good logical reasoning skills.

This could be you, if you've ever thought of yourself as "being bad at math" or "not a math person" or "more of a creative person" (or any other euphemisms for bad at math). If maybe you've heard the phrase "you should be able to figure this out with basic high school algebra," and you couldn't. Or in a math explanation someone said "it clearly follows," and it didn't.

You need to start shedding those identities right away. Start thinking of yourself as someone who is learning math. Someone who doesn't know everything yet, but is building their intuitions and is curious to discover more.

You are not unintelligent because you don't understand math. Math isn't obvious. It isn't something we intrinsically know. (Just ask any 5 or 6 year old!)

Math is a learned skill, and as such I believe anyone can learn it. I don't think there are people who are good at math and people who are bad at math. I think that the idea of "a math person" probably has more to do with how well that person learned math the way it was taught to them.

If your brain did well with how math was taught in your school, then you probably excelled in math subjects. But maybe you weren't so lucky. Maybe you've always felt some underlying inferiority because you didn't succeed at math in the classroom. Maybe you've been avoiding math for the rest of your life.

Well I'm here to tell you math is nothing to be afraid of. It is beautiful, it is a useful tool, and you can learn it too.

I got a D in precalculus because I was intimidated by my teacher, never went to office hours, and was so thoroughly confused I didn't even know what questions to ask. And I'm learning how to use partial derivatives in gradient descent from the internet. Will I ever be a calculus expert? Probably not, but that isn't preventing me from learning what I can.

Start with what you know, and build up to what you don't know one step at a time.

What are your thoughts about math? Share them with me in the comments! I'd love to have a conversation with you.

Machine Learning Log 8: an intuitive overview of gradient descent

Madeline — Tue, 16 Feb 2021 21:05:13 +0000

The final step in the linear regression model is creating an optimizer function to improve our weights and bias. I'm going to explain how gradient descent works in this article, and also give you a quick explanation of what a derivative and a partial derivative are, so you can follow the process. So if you haven't studied calculus yet, don't worry. You won't become a calculus expert by reading this article (I'm certainly not one), but I think you'll be able to follow the process of gradient descent a little bit better.

I will also link an article which helped me understand the math behind partial derivatives, which you can look at to fill in some details I won't be covering here.

Why do we need gradient descent?

The goal of gradient descent is to minimize the loss. In an ideal world we want our loss to be 0 (but keep in mind that this isn't realistically possible). We minimize the loss by improving the parameters of the model, which are the weight w and the bias b in linear regression. We improve those parameters either by making them larger, or smaller - whichever makes the loss go down.

If you need to get up to date:

Read my article on Linear Regression

Read my article on Mean Squared Error from last week

How long does gradient descent...descend?

Gradient descent is an iterative process - this is just a fancy way to say that the process repeats over and over again until you reach some condition for ending it.

The condition for ending could be:

We are tired of waiting: i.e. we let gradient descent run for a certain number of iterations and then tell it to stop
The loss is minimized as much as we need for our problem: i.e. the loss is equal to or less than a certain number that we decide on

How can we improve the loss?

This is where derivatives come in.

What does a derivative do?

For a given function:

It tells us how much a change in the weight will change the output of a function For example, for the MSE loss function:
How much will changing w a little bit change the loss?
Basically, the derivative tells us the slope of the line at a given point

But, what is a partial derivative?

A partial derivative is a derivative with more than one variable. In the linear regression equation we have w and b which both can change, so there are two variables that can affect the loss. We want to isolate each of those variables so that we can figure out how much w affects the loss and how much b affects the loss separately.

So we measure the derivative, or the slope one variable at a time
Whichever variable we are not measuring, we make a constant, by setting it equal to 0
First we calculate the derivative of the loss with respect to w
And then we calculate the derivative of the loss with respect to b

Rather than illustrating the formula for partial derivatives of MSE here (which I am still learning to understand myself), I am going to include a link to a very helpful article that goes through the mathematical formula step by step for finding the partial derivatives of mean squared error. The author basically does what I was hoping to do in this article before I became a little overwhelmed by the amount of background I would need to provide.

Once you know the derivatives, how big of a step do you take when updating w and b?

Now that we have calculated the derivatives we need to actually use them to update the parameters w and b.

We will use something called the Learning Rate to tell us how big of a step to take in our gradient descent. It is called the learning rate, because it affects how quickly our model will learn the patterns in the data. What do we do with it? We use it to multiply the derivative with respect to w and b when we update w and b in each iteration of training our model.

So, in short, it's a number that controls how quickly our parameters w and b change. A lower learning rate will cause w and b to change slowly (the model learns slower), and a higher learning rate will cause w and b to change more quickly (the model learns faster).

How does gradient descent work along with linear regression?

Remember in my overview of linear regression article I discussed how after we find the loss we'll need to use that information to update our weight and bias to minimize the loss? Well we're finally ready for that step.

A quick summary before we get started with the code. We have a forward pass, where we calculate our predictions and our current loss, based on those predictions. Then we have a backward pass, where we calculate the partial derivative of the loss with respect to each of our parameters (w and b). Then, using those gradients that we gained through calculating the derivatives, we train the model by updating our parameters in the direction that reduces the loss. We use the learning rate to control how much those parameters are changed at a time in each iteration of training.

Here are the steps for one iteration of gradient descent in linear regression:

A bit of review - Moving forwards

This is called the forward pass:

So we initialize our parameters

we can start them off at 0
or we can start them off at random numbers (but I've decided to start them at 0, to simplify the code)

We calculate linear regression with our current weight and bias
We calculate the current loss, based on the current values for w and b

# import the Python library used for scientific computing 
import numpy as np 

# predict function, based on y = wx + b equation 
def predict(X, w, b):
    return X * w + b

# loss function, based on MSE equation 
def mse(X, Y, w, b):
    return np.average(Y - (predict(X, w, b)) ** 2)

And now the new stuff - Moving backwards

This part is called the backward pass:

Using the current loss we calculate the derivative of the loss with respect to w...
...and with respect to b

# calculate the gradients 
def gradients(X, Y, w, b): 
    w_gradient = np.average(-2 * (X * (predict(X, w, b) - Y)))
    b_gradient = np.average(-2 * (predict(X, w, b) - Y))
    return (w_gradient, b_gradient)

And using the gradients to train the model

Then we update the weight and bias with the derivative of the loss in the direction that minimizes the loss, by multiplying each derivative with the learning rate
Then we repeat that process as long as we want (set in the number of epochs) to reduce the loss as much as we want

# train the model 
# lr stands for learning rate 
def train(X, Y, iterations, lr): 
    # initialize w and b to 0 
    w = 0
    b = 0

    # empty lists to keep track of parameters current values and loss 
    log = []
    mse = []

    # the training loop 
    for i in range(iterations):
        w_gradient, b_gradient = gradient(X, Y, w, b)
        # update w and b
        w -= w_gradient * lr
        b -= b_gradient * lr

        # recalculate loss to see
        log.append((w,b))
        mse.append(mse(X, Y, w, b)
    return w, b, log, mse

A parting note: There are tricks to avoid using explicit loops in your code, so that the code will run faster, when we start to train very large datasets. But to give an idea of what is going on, I thought it made sense to visualize the train function as a loop.

I hope you enjoyed this overview of Gradient Descent. My code might not be very eloquent, but hopefully it gives you an idea of what's going on here.

If you like this style of building up the functions used in machine learning models a little bit at a time, you may enjoy this book, Programming Machine Learning, whose code I relied on in preparing this article.

Machine Learning Log 7: understanding the loss function - fancy math symbols explained!

Madeline — Sun, 07 Feb 2021 21:01:15 +0000

This post will be an overview of the loss function called mean squared error, as a follow up to last week's discussion of linear regression.

Loss/Error function

Last week we left off our discussion of Linear Regression with recognizing the need for a loss function. After a first look at our prediction, we found that we calculated that a 100 year old violin would cost about $54, instead of the real world value of $800. Clearly, there is some serious work to be done so that our predictions can become more accurate to the real world values. This process is called optimization. And the process begins with a loss function.

Quick review:

Linear regression is based on the equation y = wx + b

We are predicting the y value (the violin's price) based on
the x value - our feature (the violin's age)
multiplied by the weight w
and added to the bias b, which controls where the line intercepts the y-axis
at first w and b are initialized to random weights (in other words, an uninformed guess)

Once we get our first prediction (using the random values for w and b that we chose) we need to find out how wrong it is. Then we can begin the process of updating those weights to produce more accurate values for y.

But first, since so much of machine learning depends on using functions...

In case you forgot what a function is:

I like to think of functions as little factories that take in some numbers, do some operations to them, and then spit out new numbers
the operations are always the same
so, for example, if you put in 2 and get out 4 you will always get out 4 whenever you put in 2
basically a function defines a relationship between an independent variable (x) and a dependent variable (y)

Mean Squared Error

I'll use mean squared error in this article, since it is a popular loss function (for example, you'll find it is used as the default metric for linear regression in the scikit learn machine learning library).

Purpose: What are we measuring? The error, also called loss - i.e. how big of a difference there is between the real world value and our predicted value.

The math:
We are going to...

Subtract the predicted y value from the actual value
Square the result

That would be it for our tiny example, because we made one prediction, so we have only one instance. In the real world, however, we will be making more than one prediction at a time, so the next step is:

Add together all the squared differences of all the predictions - this is the squared error
Divide the sum of all the squared errors by the number of instances (i.e. take the mean of the squared error)

Bonus: if you want to find the Root Mean Squared Error (an evaluation metric often used in Kaggle competitions), you will take the square root of that last step (You might use RMSE instead of MSE since it penalizes large errors more and smaller errors less).

The equation:

The mathematical formula looks like this.

At first it might look intimidating, but I'm going to explain the symbols so that you can start to understand the notation for formulas like this one.

n stands for the number of instances (so if there are 10 instances - i.e. 10 violins, n=10)
The huge sigma (looks like a funky E) stands for "sum"
1/n is another way of writing "divide by the number of instances", since multiplying by 1/n is the same as dividing by n
Taken together, this part of the formula just stands for "mean" (i.e. average):
Then Y stands for the observed values of y for all the instances
And Ŷ (pronounced Y-hat) stands for the predicted y values for all the instances
Taken together, this part of the formula just stands for "square of the errors":
What do I mean by all the instances? That's if you made predictions on more than one violin at a time, and put all of those predictions into a matrix

The code:
In the example below

X stands for all the instances of x - the features matrix
Y stands for all the instances of y - the ground truth labels matrix
w and b stand for the same things they did before: weight and bias

# import the Python library used for scientific computing 
import numpy as np 

# last week's predict function 
def predict(X, w, b):
    return X * w + b

def mse(X, Y, w, b):
    return np.average(Y - (predict(X, w, b)) ** 2)

At this point we know how wrong we are, but we don't know how to get less wrong. That's why we need to use an optimization function.

Optimizing predictions with Gradient Descent

This is the part of the process where the model "learns" something. Optimization describes the process of gradually getting closer and closer to the ground truth labels, to make our model more accurate. Based on the error, update the weights and bias a tiny bit in the direction of less error. In other words, you are making the predictions a little bit more correct. Then you check your loss again, and rinse and repeat until you decide you have minimized the error enough, or you run out of time to train your model - whichever comes first.

Now, I know I said I would discuss gradient descent this week, but I am starting to feel that doing it justice probably requires a whole post. Either this week's post will get too long, or I won't be able to unpack the details of gradient descent to a simple enough level. So, please bear with me and stay tuned for next week's Machine Learning Log, in which I will (promise!) discuss gradient descent in detail.

Machine Learning Log 6: Linear regression explained by a non mathy person (part 1)

Madeline — Sun, 31 Jan 2021 21:12:05 +0000

Describing a linear relationship

In this post I will give an explanation of linear regression, which is a good algorithm to use when the relationship between values is, well, linear. In this post we'll imagine a linear relationship where when the x value (or feature) increases, so does the y value (the label).

With enough labelled data, we can employ linear regression to predict the labels of new data, based on its features.

The equation for linear regression:

Note: This example will only deal with a regression problem with one variable

y = wx + b

y is the dependent variable - the thing you are trying to predict
x is the independent variable - the feature - when x changes it causes a change in y
w is the slope of the line - in our machine learning example it is the weight - x will be multiplied by this value to get the value for y
b is the bias - it tells us where our line intercepts the y-axis - if b = 0 then line is at the origin of the x and y axes, but this won't always fit our data too well, which is why setting the bias is useful

The algorithm's goal

We want to figure out what values to use for the weight and bias so that the predictions are most accurate when compared to the ground truth.

Remember my violin price predictions example from Machine Learning Log 3: a brief overview of supervised learning? In that article we talked about using different features to predict the change in the price of a violin, where the overall goal was to predict the price of a given violin.

To make it simpler for this article we are going to consider only one variable for our price prediction. We will predict the price of the violin only depending on the age of the violin. We'll pretend that there is a linear relationship such that when the age of the violin goes up the price also goes up (this is vaguely true in the real world).

So as we have it now:

y = violin price
x = violin age
w = ?
b = ?

We want to find out what values to use for the weight w and the bias b so that we can fit the slope of our line to our data. Then we can use this information to predict the price y from a new feature x.

Visually

Here's a little graph I drew to represent some possible data points for violins' prices and ages. You can see that when the age goes up so does the price. (This is fictional data, but violins do tend to appreciate in value.)

In code terms:

Let's say we have a 100-year-old violin, that costs, I don't know $8000. We have no idea what to use for w and b yet, so to make our prediction we'll start our weight and bias off at some random number:

Since the weight and bias are started off at random numbers, we're basically taking a stab in the dark guess at this point.

In this case the result of running my little code above was y_prediction = 56.75 which is very wrong (since our ground truth is $8000). This isn't a surprise, since we set our weight and bias to random numbers to begin with. Now that we know how wrong our guess was, we will update the weight and bias to gradually get closer to an accurate prediction that is closer to the real world price of $8000. That is where a cost function and an optimizer come in.

In reality we would be running the predictions with more than just one sample from our dataset of violins. Using just one violin to understand the relationship between age and price won't give us a very accurate predictions, but I presented it that way to simplify the example.

Cost function: seeing how wrong we are

This is where stuff really gets fun! In order to learn from our mistakes we need to have a way to mathematically represent how wrong we are. We do this by comparing the actual price values to the predicted values we got. We find the difference between those values. We use that difference to adjust the weights and bias to fit our values more closely. We rinse and repeat, getting incrementally less wrong, until our predictions are as close as possible to the real world values.

Tune in next week as I explain this process! I'll look at a cost function and describe the process of gradient descent. But I think that's enough for this week. I want to keep these posts fairly short so they can be written and consumed more or less within a Sunday afternoon. 😉

That's all for now, folks!

I am still learning and welcome feedback. As always, if you notice anything extremely wrong in my explanation, or have suggestions on how to explain something better, please let me know in the comments!