Forem: Rishit Dagli

How to Build Better Machine Learning Models

Rishit Dagli — Wed, 28 Apr 2021 02:19:57 +0000

Hello developers 👋. If you have built Deep Neural Networks before, you might know that it can involve a lot of experimentation.

In this article, I will share with you some useful tips and guidelines that you can use to better build better deep learning models. These tricks should make it a lot easier for you to develop a good network.

You can pick and choose which tips you use, as some will be more helpful for the projects you are working on. Not everything mentioned in this article will straight up improve your models’ performance.

A high-level approach for Hyperparameter tuning🕹️

One of the more painful things about training Deep Neural Networks is the large number of hyperparameters you have to deal with.

These could be your learning rate α, the discounting factor ρ, and epsilon ε if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates β₁ and β₂ if you are using the Adam optimizer (Kingma et al.).

You also need to choose the number of layers in the network or the number of hidden units for the layers. You might be using learning rate schedulers and would want to configure those features and a lot more 😩! We definitely need ways to better organize our hyperparameter tuning process.

A common algorithm I tend to use to organize my hyperparameter search process is Random Search. Though there are other algorithms that might be better, I usually end up using it anyway.

Let’s say for the purpose of this example you want to tune two hyperparameters and you suspect that the optimal values for both would be somewhere between one and five.

The idea here is that instead of picking twenty-five values to try out like (1, 1) (1, 2) and so on systematically, it would be more effective to select twenty-five points at random.

Based on Lecture Notes of Andrew Ng

Here is a simple example with TensorFlow where I try to use Random Search on the Fashion MNIST Dataset for the learning rate and the number of units in the first Dense layer:

Here I suspect that an optimal number of units in the first Dense layer would be somewhere between 32 and 512, and my learning rate would be one of 1e-2, 1e-3, or 1e-4.

Consequently, as shown in this example, I set my minimum value for the number of units to be 32 and the maximum value to be 512 and have a step size of 32. Then, instead of hardcoding a value for the number of units, I specify a range to try out.

hp_units = hp.Int('units', min_value = 32, max_value = 512, step = 32)
model.add(tf.keras.layers.Dense(units = hp_units, activation = 'relu'))

We do the same for our learning rate, but our learning rate is simply one of 1e-2, 1e-3, or 1e-4 rather than a range.

hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3, 1e-4])
optimizer = tf.keras.optimizers.Adam(learning_rate = hp_learning_rate)

Finally, we perform Random Search and specify that among all the models we build, the model with the highest validation accuracy would be called the best model. Or simply that getting a good validation accuracy is the goal.

tuner = kt.RandomSearch(model_builder,
                        objective = 'val_accuracy', 
                        max_trials = 10,
                        directory = 'random_search_starter',
                        project_name = 'intro_to_kt') 

tuner.search(img_train, label_train, epochs = 10, validation_data = (img_test, label_test))

After doing so, I also want to retrieve the best model and the best hyperparameter choice. Though I would like to point out that using the get_best_models is usually considered a shortcut.

To get the best performance you should retrain your model with the best hyperparameters you get on the full dataset.

# Which was the best model?
best_model = tuner.get_best_models(1)[0]

# What were the best hyperparameters?
best_hyperparameters = tuner.get_best_hyperparameters(1)[0]

I won’t be talking about this code in detail in this article, but you can read about it in this article I wrote some time back if you want.

Use Mixed Precision Training for large networks🎨

The bigger your neural network is, the more accurate your results (in general). As model sizes grow, the memory and compute requirements for training these models also increase.

The idea with using Mixed Precision Training (NVIDIA, Micikevicius et al.) is to train deep neural networks using half-precision floating-point numbers which let you train large neural networks a lot faster with no or negligible decrease in the performance of the networks.

But, I’d like to point out that this technique should only be used for large models with more than 100 million parameters or so.

While mixed-precision would run on most hardware, it will only speed up models on recent NVIDIA GPUs (for example Tesla V100 and Tesla T4) and Cloud TPUs.

I want to give you an idea of the performance gains when using Mixed Precision. When I trained a ResNet model on my GCP Notebook instance (consisting of a Tesla V100) it was almost three times better in the training time and almost 1.5 times on a Cloud TPU instance with almost no difference in accuracy. The code to measure the above speed-ups was taken from this example.

To further increase your training throughput, you could also consider using a larger batch size — and since we are using float16 tensors you should not run out of memory.

It is also rather easy to implement Mixed Precision with TensorFlow. With TensorFlow you could easily use the tf.keras.mixed_precision Module that allows you to set up a data type policy (to use float16) and also apply loss scaling to prevent underflow.

Here is a minimalistic example of using Mixed Precision Training on a network:

In this example, we first set the dtype policy to be float16 which implies that all of our model layers will automatically use float16.

After doing so we build a model, but we override the data type for the last or the output layer to be float32 to prevent any numeric issues. Ideally, your output layers should be float32.

Note: I’ve built a model with so many units so we can see some difference in the training time with Mixed Precision Training since it works well for large models.

If you are looking for more inspiration to use Mixed Precision Training, here is an image demonstrating speedup for multiple models by Google Cloud on a TPU:

Speedups on a Cloud TPU

Use Grad Check for backpropagation✔️

In multiple scenarios, I have had to custom implement a neural network. And implementing backpropagation is typically the aspect that’s prone to mistakes and is also difficult to debug.

With incorrect backpropagation your model could learn something which might look reasonable, which makes it even more difficult to debug. So, how cool would it be if we could implement something which could allow us to debug our neural nets easily?

I often use Gradient Check when implementing backpropagation to help me debug it. The idea here is to approximate the gradients using a numerical approach. If it is close to the calculated gradients by the backpropagation algorithm, then you can be more confident that the backpropagation was implemented correctly.

As of now, you can use this expression in standard terms to get a vector which we will call dθ[approx]:

Calculate approx gradients‌‌

In case you are looking for the reasoning behind this, you can find more about it in this article I wrote.

So, now we have two vectors dθ[approx] and dθ (calculated by backprop). And these should be almost equal to each other. You could simply compute the Euclidean distance between these two vectors and use this reference table to help you debug your nets:

Reference table

Cache Datasets💾

Caching datasets is a simple idea but it’s not one I have seen used much. The idea here is to go over the dataset in its entirety and cache it either in a file or in memory (if it is a small dataset).

This should save you from performing some expensive CPU operations like file opening and data reading during every single epoch.

This does also means that your first epoch would comparatively take more time📉 since you would ideally be performing all operations like opening files and reading data in the first epoch and then caching them. But the subsequent epochs should be a lot faster since you would be using the cached data.

This definitely seems like a very simple to implement idea, right? Here is an example with TensorFlow showing how you can very easily cache datasets. It also shows the speedup 🚀 from implementing this idea. Find the complete code for the below example in this gist of mine.

A simple example of caching datasets and the speedup with it

How to tackle overfitting ⭐

When you’re working with neural networks, overfitting and underfitting might be two of the most common problems you face. This section talks about some common approaches that I use when tackling these problems.

You might know this, but high bias will cause you to miss a relationship between features and labels (underfitting) and high variance will cause the model to capture the noise and overfit to the training data.

I believe the most effective way to solve overfitting is to get more data — though you could also augment your data. A benefit of deep neural networks is that their performance improves as they are fed more and more data.

But in a lot of situations, it might be too expensive to get more data or it simply might not be possible to do so. In that case, let’s talk about a couple of other methods you could use to tackle overfitting.

Apart from getting more data or augmenting your data, you could also tackle overfitting either by changing the architecture of the network or by applying some modifications to the network’s weights. Let’s look at these two methods.

Changing the Model Architecture

A simple way to change the architecture such that it doesn’t overfit would be to use Random Search to stumble upon a good architecture. Or you could try pruning nodes from your model, essentially lowering the capacity of your model.

We already talked about Random Search, but in case you want to see an example of pruning you could take a look at the TensorFlow Model Optimization Pruning Guide.

Modifying Network Weights

In this section, we will see some methods I commonly use to prevent overfitting by modifying a network’s weights.

Weight Regularization

Iterating back on what we discussed, “simpler models are less likely to overfit than complex ones”. We try to keep a bar on the complexity of the network by forcing its weights only to take small values.

To do so we will add to our loss function a term that can penalize our model if it has large weights. Often L₁ and L₂ regularizations are used, the difference being:

L1 - The penalty added is ∝ to |weight coefficients|
L2 - The penalty added is ∝ to |weight coefficients|²

where |x| represents absolute values.

Do you notice the difference between L1 and L2, the square term? Due to this, L1 might push weights to be equal to zero whereas L2 would have weights tending to zero but not zero.

In case you are curious about exploring this further, this article goes deep into regularizations and might help.

This is also the exact reason why I tend to use L2 more than L1 regularization. Let’s see an example of this with TensorFlow.

Here I show some code to create a simple Dense layer with 3 units and the L2 regularization:

import tensorflow as tf
tf.keras.layers.Dense(3, kernel_regularizer = tf.keras.regularizers.L2(0.1))

To provide more clarity on what this does, as we discussed above this would add a term (0.1 × weight_coefficient_value²) to the loss function which works as a penalty to very big weights. Also, it is as easy as replacing L2 to L1 in the above code to implement L1 for your layer.

Dropouts

The first thing I do when I am building a model and face overfitting is try using dropouts (Srivastava et al.). The idea here is to randomly drop out or set to zero (ignore) x% of output features of the layer during training.

We do this to stop individual nodes from relying on the output of other nodes and prevent them from co-adapting from other nodes too much.

Dropouts are rather easy to implement with TensorFlow since they are available as layers. Here is an example of me trying to build a model to differentiate images of dogs and cats with Dropout to reduce overfitting:

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (3,3), padding='same', activation='relu',input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv2D(128, (3,3), padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

As you could see in the code above, you could directly use tf.keras.layers.dropout to implement the dropout, passing it the fraction of output features to ignore (here 20% of the output features).

Early stopping

Early stopping is another regularization method I often use. The idea here is to monitor the performance of the model at every epoch on a validation set and terminate the training when you meet some specified condition for the validation performance (like stop training when loss < 0.5)

It turns out that the basic condition like we talked about above works like a charm if your training error and validation error look something like in this image. In this case, Early Stopping would just stop training when it reaches the red box (for demonstration) and would straight up prevent overfitting.

It (Early stopping) is such a simple and efficient regularization technique that Geoffrey Hinton called it a “beautiful free lunch”.

-Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron

Adapted from Lutz Prechelt

However, for some cases, you would not end up with such straightforward choices for identifying the criterion or knowing when Early Stopping should stop training the model.

For the scope of this article, we will not be talking about more criteria here, but I would recommend that you check out “Early Stopping — But When, Lutz Prechelt” which I use a lot to help decide criteria.

Let’s see an example of Early Stopping in action with TensorFlow:

import tensorflow as tf

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
model = tf.keras.models.Sequential([...])
model.compile(...)
model.fit(..., callbacks = [callback])

In the above example, we create an Early Stopping Callback and specify that we want to monitor our loss values. We also specify that it should stop training if it does not see noticeable improvements in loss values in 3 epochs. Finally, while training the model, we specify that it should use this callback.

Also, for the purpose of this example, I show a Sequential model — but this could work in the exact same manner with a model created with the functional API or sub-classed models, too.

Thank you for reading!

Thank you for sticking with me until the end. I hope you will benefit from this article and incorporate these tips in your own experiments.

I am excited to see if they help you improve the performance of your neural nets, too. If you have any feedback or suggestions for me please feel free to reach out to me on Twitter.

The Art of Hyperparameter Tuning in Deep Neural Nets by Example🎨

Rishit Dagli — Fri, 22 Jan 2021 13:00:49 +0000

Hello developers 👋, If you have worked on building Deep Neural Networks earlier you might know that building neural nets can involve setting a lot of different hyperparameters. In this article, I will share with you some tips and guidelines you can use to better organize your hyperparameter tuning process which should make it a lot more efficient for you to stumble upon a good setting for the hyperparameters.

What is a hyperparameter anyway?🤷‍♀️

Very simply a hyperparameter is external to the model that is it cannot be learned within the estimator, and whose value you cannot calculate from the data.

Many models have important parameters which cannot be directly estimated from the data. This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.

— Page 64, 65, Applied Predictive Modeling, 2013

The hyperparameters are often used in the processes to help estimate the model parameters and are often to be specified by you. In most cases to tune these hyperparameters, you would be using some heuristic approach based on your experience, maybe starter values for the hyperparameters or find the best values by trial and error for a given problem.

As I was saying at the start of this article, one of the painful things about training Deep Neural Networks is the large number of hyperparameters you have to deal with. These could be your learning rate α, the discounting factor ρ, and epsilon ϵ if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates β₁ and β₂ if you are using the Adam optimizer (Kingma et al.). You also need to choose the number of layers in the network or the number of hidden units for the layers, you might be using learning rate schedulers and would want to configure that and a lot more😩! We definitely need ways to better organize our hyperparameter tuning process.

Which hyperparameters are more important?📑

Usually, we can categorize hyperparameters into two groups: the hyperparameters used for training and those used for model design🖌️. A proper choice of hyperparameters related to model training would allow neural networks to learn faster and achieve enhanced performance making the tuning process, definitely something you would want to care about. The hyperparameters for model design are more related to the structure of neural networks a trivial example being the number of hidden layers and the width of these layers. The model training hyperparameters in most cases could well serve as a way to measure a model’s learning capacity🧠.

During the training process, I usually give the most attention to the learning rate α, and the batch size because these determine the speed of convergence so you should consider tuning them first or give them more attention. However, I do strongly believe that for most models the learning rate α would be the most important hyperparameter to tune consequently deserving more attention. We will later in this article discuss ways to select the learning rate. Also do note that I mention “usually” over here, this could most certainly change according to the kind of applications you are building.

Next up, I usually consider tuning the momentum term β in RMSprop and others since this helps us reduces the oscillation by strengthening the weight updates in the same direction, also allowing us to decrease the change in different directions. I often suggest using β = 0.9 which works as a very good default and is most often used too.

After doing so I would try and tune the number of hidden units for each layer followed by the number of hidden layers which essentially help change the model structure followed by the learning rate decay which we will soon see. Note: The order suggested in this paragraph has seemed to work well for me and was originally suggested by Andrew Ng.

Furthermore, a really helpful summary about the order of importance of hyperparameters from lecture notes of Ng was compiled by Tong Yu and Hong Zhu in their paper suggesting this order:

learning rate
Momentum β, for RMSprop, etc.
Mini-batch size
Number of hidden layers
learning rate decay
Regularization λ

Common approaches for the tuning process🕹️

The things we talk about under this section would essentially be some things I find important and apply while the tuning of any hyperparameter. So, we will not be talking about things related to tuning for a specific hyperparameter but concepts that apply to all of them.

Random Search

Random Search and Grid Search (a pre-cursor of Random Search) are by far the most widely used methods because of their simplicity. Earlier it was very common to sample the points in a grid and then systematically perform an exhaustive search on the hyperparameter set specified by users. And this works well and is applicable for several hyperparameters with limited search space. In the diagram here as we mention Grid Search asks us to systematically sample the points and try out those values after which we could choose the one which best suits us.

To perform hyperparameter tuning for deep neural nets it is often recommended to rather choose points at random. So in the image above, we choose the same number of points but do not follow a systemic approach to choosing those points like on the left side. And the reason you often do that is that it is difficult to know in advance which hyperparameters are going to be the most important for your problem. Let us say the hyperparameter 1 here matters a lot for your problem and hyperparameter 2 contributes very less you essentially get to try out just 5 values of hyperparameter 1 and you might find almost the same results after trying the values of hyperparameter 2 since it does not contribute a lot. On other hand, if you had used random sampling you would more richly explore the set of possible values.

So, we could use random search in the early stage of the tuning process to rapidly narrow down the search space, before we start using a guided algorithm to obtain finer results that go from a coarse to fine sampling scheme. Here is a simple example with TensorFlow where I try to use Random Search on the Fashion MNIST Dataset for the learning rate and the number of units:

Hyperband

Hyperband (Li et al.) is another algorithm I tend to use quite often, it is essentially a slight improvement of Random Search incorporating adaptive resource allocation and early-stopping to quickly converge on a high-performing model. Here we train a large number of models for a few epochs and carry forward only the top-performing half of models to the next round.

Early stopping is particularly useful for deep learning scenarios where a deep neural network is trained over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.

Here is a simple example with TensorFlow where I try to use Hyperband on the Fashion MNIST Dataset for the learning rate and the number of units:

Choosing a Learning Rate α🧐

I am particularly interested in talking more about choosing an appropriate learning rate α since for most learning applications it is the most important hyperparameter to tune consequently also deserving more attention. Having a constant learning rate is the most straightforward approach and is often set as the default schedule:

optimizer = tf.keras.optimizers.Adam(*learning_rate = 0.01*)

However, it turns out that with a constant LR, the network can often be trained to a sufficient, but unsatisfactory accuracy because the initial value could always prove to be larger, especially in the final few steps of gradient descent. The optimal learning rate would depend on the topology of your loss landscape, which is in turn dependent on both the model architecture and the dataset. So we can say that an optimal learning rate would give us a steep drop in the loss function. Decreasing the learning rate would decrease the loss function but it would do so at a very shallow rate. On other hand increasing the learning rate after the optimal one will cause the loss to bounce about the minima. Here is a figure to sum this up:

You now understand why it is important to choose a learning rate effectively and with that let’s talk a bit about updating your learning rates while training or setting a schedule. A prevalent technique, known as *learning rate annealing, *is often used, it recommends starting with a relatively high learning rate and then gradually lowering the learning rate📉 during training. As an example, I could start with a learning rate of 10⁻² when the accuracy is saturated or I reach a plateau we could lower the learning rate to let’s say 10⁻³ and maybe then to 10⁻⁵ if required.

In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.

— Stanford CS231n Course Notes by Fei-Fei Li, Ranjay Krishna, and Danfei Xu

Going back to the example above, I suspect that my learning rate should be somewhere between 10⁻² and 10⁻⁵ so if I simply update my learning rate uniformly across this range, we use 90% of the resource for the range 10⁻² to 10⁻³ which does not make sense. You could rather update the LR on the log scale, this allows us to use an equal amount of resources for 10⁻² to 10⁻³ and between 10⁻³ to 10⁻⁴. Now it should be super easy for you😎 to understand exponential decay which is a widely used LR schedule (Li et al.). An exponential schedule provides a more drastic decay at the beginning and a gentle decay when approaching convergence.

Here is an example showing how we can perform exponential decay with TensorFlow:

Also, note that the initial values are influential and must be carefully determined, in this case, though you might want to use a comparatively large value because it will decay during training.

Choosing a Momentum Term β🧐

To better help us understand why I will later suggest some good default values for the momentum term, I would like to show a bit about why the momentum term is used in RMSprop and some others. The idea behind RMSprop is to accelerate the gradient descent like the precursors Adagrad (Duchi et al.) and Adadelta (Zeiler et al.) but gives superior performance when steps become smaller. RMSprop uses exponentially weighted averages of the squares instead of directly using ∂w and ∂b:

Now you might have guessed until now the moving average term β should be a value between 0 and 1. In practice, most of the time 0.9 works well (also suggested by Geoffrey Hinton) and I would say is a really good default value. You would often consider trying out a value between 0.9 (averaging across the last 10 values) and 0.999 (averaging across the last 1000 values). Here is a really wonderful diagram to summarize the effect of β. Here the:

red-colored line represents β = 0.9
green-colored line represents β = 0.98

As you can see, with smaller numbers of β, the new sequence turns out to be fluctuating a lot, because we’re averaging over a smaller number of examples and therefore are much closer to the noisy data. However with bigger values of beta, like β, we get a much smoother curve, but it’s a little bit shifted to the right because we average over a larger number of examples. So in all 0.9 provides a good balance but as I mentioned earlier you would often consider trying out a value between 0.9 and 0.999.

Just as we talked about searching for a good learning rate α and how it does not make sense to do this in the linear scale rather we do this in the logarithmic scale in this article. Similarly, if you are searching for a good value of β it would again not make sense to perform the search uniformly at random between 0.9 and 0.999. So a simple trick might be to instead search for 1-β in the range 10⁻¹ to 10⁻³ and we will search for it in the log scale. Here’s some sample code to generate these values in the log scale which we can then search across:

r = -2 * np.random.rand() # gives us random values between -2, 0
r = r - 1                 # convert them to -3, -1
beta = 1 - 10**r

Choosing Model Design Hyperparameters🖌️

Number of Hidden Layers d

I will not be talking about some specific rules or suggestions about the number of hidden layers and you will soon understand why I do not do so (by example).

The number of hidden layers d is a pretty critical parameter for determining the overall structure of neural networks, which has a direct influence on the final output. I have seen almost always that deep learning networks with more layers often obtain more complex features and relatively higher accuracy making this a regular approach to achieving better results.

As an example, the ResNet model (He et al.) can be scaled up from ResNet-18 to ResNet-200 by simply using more layers, repeating the baseline structure according to their need for accuracy. Recently, Yanping Huang et al. in their paper achieved 84.3 % ImageNet top-1 accuracy by scaling up a baseline model four times larger!

Number of Neurons w

The number of neurons in each layer w must also be carefully considered after having talked about the number of layers. Too few neurons in the hidden layers may cause underfitting because the model lacks complexity. By contrast, too many neurons may result in overfitting and increase training time.

A couple of suggestions by Jeff Heaton that have worked like a charm could be a good start for tuning the number of neurons. to make it easy to understand here I use wᵢₙₚᵤₜ as the number of neurons for the input layer and wₒᵤₜₚᵤₜ as the number of neurons for the output layer.

The number of hidden neurons should be between the size of the input layer and the size of the output layer. wᵢₙₚᵤₜ < w < wₒᵤₜₚᵤₜ
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. w = 2/3 wᵢₙₚᵤₜ + wₒᵤₜₚᵤₜ
The number of hidden neurons should be less than twice the size of the input layer. wᵢₙₚᵤₜ < 2 wₒᵤₜₚᵤₜ

And that is it for helping you choose the Model Design Hyperparameters🖌️, I would personally suggest you also take a look at getting some more idea about tuning the regularization λ and has a sizable impact on the model weights if you are interested in that you can take a look at this blog I wrote some time back addressing this in detail:

Solving Overfitting in Neural Nets With Regularization | by Rishit Dagli | Towards Data Science

Rishit Dagli ・ Apr 18, 2020 ・ 10 min read
towardsdatascience.com

Yeah! By including all of these concepts I hope you can start better tuning your hyperparameters and building better models and start perfecting the “Art of Hyperparameter tuning”🚀. I hope you liked this article.

If you liked this article, share it with everyone😄! Sharing is caring! Thank you!

Many thanks to Alexandru Petrescu for helping me to make this better :)

💡TensorFlow Tip, use .cache

Rishit Dagli — Fri, 01 Jan 2021 14:28:55 +0000

💡#TensorFlowTip

Use .cache to save on some ops like opening file and reading data during each epoch

transformations before cache are only run for 1st epoch
can cache in-memory or on-disk
not repeatedly perform expensive CPU ops

See the speedup with .cache in this image.

Try it out for yourself:

💡TensorFlow Tip, use .prefetch

Rishit Dagli — Sat, 12 Dec 2020 13:36:33 +0000

💡 #TensorFlowTip
Use .prefetch to reduce your step time of training and extracting data

overlap the preprocessing and model execution
while the model executes training step n the input pipeline is reading the data for n+1 step
reduces the idle time for the GPU and CPU

See the speedup with .prefetch in this image.

Try it for yourself:

Computer Vision with TensorFlow

Rishit Dagli — Sat, 12 Dec 2020 10:30:21 +0000

All the code used here is available at the GitHub repository here.

Rishit-dagli / Deep-Learning-With-TensorFlow

All the resources and hands-on exercises for you to get started with Deep Learning in TensorFlow

Deep-Learning-With-TensorFlow

Consider leaving a ⭐ if you like this series of tutorials.

This repo gets you started with Deep Learning with TensorFlow. This will be all about about coding Machine Learning and Deep Learning algorithms. I believe in hands-on coding so we will have many exercises and demos which you can try yourself too. You would first have a blog to read to and have one or many Jupyter Notebooks which you can try and also have excercises which you can do. These blogs are also availaible as a blog series on Medium.

More parts coming soon!

Do you need any setup?

No! You would not need any hardware or software setup to work your…

View on GitHub

This is the second part of the series where I post about TensorFlow for Deep Learning and Machine Learning. In the earlier blog post, you learned all about how Machine Learning and Deep Learning is a new programming paradigm. This time you’re going to take that to the next level by beginning to solve problems of computer vision with just a few lines of code! I believe in hands-on coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyper-parameters and experiment with the code. We will also be working with some real-life data sets and apply the discussed algorithms to them too. If you have not read the previous article consider reading it once before you read this one here.

Introduction

So fitting straight lines seems like the “Hello, world” most basic implementation learning algorithm. But one of the most amazing things about machine learning is that, that core of the idea of fitting the x and y relationship is what lets us do amazing things like, have computers look at the picture and do activity recognition, or look at the picture and tell us, is this a dress, or a pair of pants, or a pair of shoes; really hard for humans, and amazing that computers can now use this to do these things as well. Right, like computer vision is a really hard problem to solve, right? Because you’re saying like dress or shoes. It’s like how would I write rules for that? How would I say, if this pixel then it’s a shoe, if that pixel then it's a dress? It’s really hard to do, so the labeled samples are the right way to go. One of the non-intuitive things about vision is that it’s so easy for a person to look at you and say, you’re wearing a shirt, it’s so hard for a computer to figure it out.

Because it’s so easy for humans to recognize objects, it’s almost difficult to understand why this is a complicated thing for a computer to do. What the computer has to do is look at all numbers, all the pixel brightness value, saying look at all of these numbers saying, these numbers correspond to a black shirt, and it’s amazing that with machine and deep learning computers are getting really good at this.

Computer Vision

Computer vision is the field of having a computer understand and label what is present in an image. When you look at this image below, you can interpret what a shirt is or what a shoe is, but how would you program for that? If an extraterrestrial who had never seen clothing walked into the room with you, how would you explain the shoes to him? It’s really difficult, if not impossible to do right? And it’s the same problem with computer vision. So one way to solve that is to use lots of pictures of clothing and tell the computer what that’s a picture of and then have the computer figure out the patterns that give you the difference between a shoe, and a shirt, and a handbag, and a coat.

Let’s Get started

Fortunately, there’s a data set called Fashion MNIST (not to be confused with handwriting MNIST data set- that’s your exercise) which gives 70,000 images spread across 10 different items of clothing. These images have been scaled down to 28 by 28 pixels. Now usually, the smaller the better because the computer has less processing to do. But of course, you need to retain enough information to be sure that the features and the object can still be distinguished. If you look at the image you can still tell the difference between shirts, shoes, and handbags. So this size does seem to be ideal, and it makes it great for training a neural network. The images are also in grayscale, so the amount of information is also reduced. Each pixel can be represented in values from zero to 255 and so it’s only one byte per pixel. With 28 by 28 pixels in an image, only 784 bytes are needed to store the entire image. Despite that, we can still see what’s in the image below, and in this case, it’s an ankle boot, right?

You can know more about the fashion MNIST data set at this GitHub repository here. You can also download the data set from here.

So what will handling this look like in code? In the previous blog post, you learned about TensorFlow and Keras, and how to define a simple neural network with them. Here, you are going to use them to go a little deeper but the overall API should look familiar. The one big difference will be in the data. The last time you had just your six pairs of numbers, so you could hard code it. This time you have to load 70,000 images off the disk, so there will be a bit of code to handle that. Fortunately, it’s still quite simple because Fashion MNIST is available as a data set with an API call in TensorFlow.

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) =              fashion_mnist.load_data()

Consider the code fashion_mnist.load_data() . What we are doing here is creating an object of type MNIST and loading it from the Keras database. Now, on this class, we are running a method called load_data() which will return four lists to us train_images , train_labels , test_images and test_labels . Now, what are these you might wonder? So, when building a neural network like this, it's a nice strategy to use some of your data to train the neural network and similar data that the model hasn't yet seen to test how good it is at recognizing the images. So in the Fashion MNIST data set, 60,000 of the 70,000 images are used to train the network, and then 10,000 images, one that it hasn't previously seen, can be used to test just how good or how bad the model is performing. So this code will give you those sets.

So for example, the training data will contain images like this one, and a label that describes the image like this. While this image is an ankle boot, the label describing it is number nine. Now, why do you think that is? There are two main reasons. First, of course, is that computers do better with numbers than they do with texts. Second, importantly, is that this is something that can help us reduce bias. If we labeled it as an ankle boot, we would be of course biased towards English speakers. But with it being a numeric label, we can then refer to it in our appropriate language be it English, Hindi, German, Mandarin, or here, even Irish. You can learn more about bias and techniques to avoid it here.

Coding a Neural Network for this

Okay. So now we will look at the code for the neural network definition. Remember last time we had a sequential with just one layer in it. Now we have three layers.

model = keras.Sequential([

    keras.layers.Flatten(input_shape = (28, 28)),
    keras.layers.Dense(128, activation = tf.nn.relu),
    keras.layers.Dense(10, activation = tf.nn.softmax)

])

The important things to look at are the first and the last layers. The last layer has 10 neurons in it because we have ten classes of clothing in the data set. They should always match. The first layer is a Flatten layer with the input shaping 28 by 28. Now, if you remember our images are 28 by 28, so we’re specifying that this is the shape that we should expect the data to be in. Flatten takes this 28 by 28 square and turns it into a simple linear array. The interesting stuff happens in the middle layer, sometimes also called a hidden layer. This is 128 neurons in it, and I’d like you to think about these as variables in a function. Maybe call them x1, x2 x3, etc. Now, there exists a rule that incorporates all of these that turns the 784 values of an ankle boot into the value nine, and similar for all of the other 70,000.

So, what the neural net does is it figure out w0 , w1 , w2 … w n such that (x1 * w1) + (x2 * w2) ... (x128 * w128) = y

You’ll see that it’s doing something very, very similar to what we did earlier when we figured out y = 2x — 1. In that case, the two were the weight of x. So, I’m saying y = w1 * x1, etc. The important thing now is to get the code working, so you can see a classification scenario for yourself. You can also tune the neural network by adding, removing, and changing layer size to see the impact.

As we discussed earlier to finish this example and writing the complete code we will use Tensor Flow 2.x, before that we will explore few Google Colaboratory tips as that is what you might be using. However, you can also use Jupyter Notebooks preferably in your local environment.

Colab Tips (OPTIONAL)

Map your Google Drive

On Colab notebooks you can access your Google Drive as a network mapped drive in the Colab VM runtime.

# You can access to your Drive files using this path "/content
# /gdrive/My Drive/"

from google.colab import drive
drive.mount('/content/gdrive')

Work with your files transparently on your computer

You can sync a Google Drive folder on your computer. Along with the previous tip, your local files will be available locally in your Colab notebook.

Clean your root folder on Google Drive

There are some resources from Google that explains that having a lot of files in your root folder can affect the process of mapping the unit. If you have a lot of files in your root folder on Drive, create a new folder and move all of them there. I suppose that having a lot of folders on the root folder will have a similar impact.

Hardware acceleration

For some applications, you might need a hardware accelerator like a GPU or a TPU. Let’s say you are building a CNN or so 1 epoch might be 90–100 seconds on a CPU but just 5–6 seconds on a GPU and in milliseconds on a TPU. So, this is definitely helpful. You can go to-

Runtime > Change Runtime Type > Select your hardware accelerator

A fun feature

This is called power level. Power level is an April fools joke feature that adds sparks and combos to cell editing. Access using-

Tools > Settings > Miscellaneous > Select Power

Walk through the notebook

The notebook is available here. If you are using a local development environment, download this notebook; if you are using Colab click the open in colab button. We will also see some exercises in this notebook.

First, we use the above code to import TensorFlow 2.x, If you are using a local development environment you do not need lines 1–5.

Then, as discussed we use this code to get the data set.

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

We will now use matplotlib to view a sample image from the dataset.

import matplotlib.pyplot as plt
plt.imshow(training_images[0])

You can change the 0 to other values to get other images as you might have guessed. You’ll notice that all of the values in the number are between 0 and 255. If we are training a neural network, for various reasons it’s easier if we treat all values as between 0 and 1, a process called **‘normalizing’ **and fortunately in Python it’s easy to normalize a list like this without looping. You do it like this:

training_images  = training_images / 255.0
test_images = test_images / 255.0

Now in the next code block in the notebook we have defines the same neural net we earlier discussed. Ok, so you might have noticed a change we use the softmax function. Softmax takes a set of values, and effectively picks the biggest one, so, for example, if the output of the last layer looks like [0.1, 0.1, 0.05, 0.1, 9.5, 0.1, 0.05, 0.05, 0.05], it saves you from fishing through it looking for the biggest value, and turns it into [0,0,0,0,1,0,0,0,0] — The goal is to save a lot of coding!

We can then try to fit the training images to the training labels. We’ll just do it for 10 epochs to be quick. We spend about 50 seconds training it over five epochs and we end up with a loss of about 0.205. That means it’s pretty accurate in guessing the relationship between the images and their labels. That’s not great, but considering it was done in just 50 seconds with a very basic neural network, it’s not bad either. But a better measure of performance can be seen by trying the test data. These are images that the network has not yet seen. You would expect performance to be worse, but if it’s much worse, you have a problem. As you can see, it’s about 0.32 loss, meaning it’s a little bit less accurate on the test set. It’s not great either, but we know we’re doing something right.

I have some questions and exercises for you 8 in all and I recommend you to go through all of them, you will also be exploring the same example with more neurons and things like that. So have fun coding.

You just made a complete fashion MNIST algorithm that can predict with pretty good accuracy the images of fashion items.

Callbacks (Very important)

How can I stop training when I reach a point that I want to be at? What do I always have to hard code it to go for a certain number of epochs? So in every epoch, you can callback to a code function, having checked the metrics. If they’re what you want to say, then you can cancel the training at that point.

It’s implemented as a separate class, but that can be in-line with your other code. It doesn’t need to be in a separate file. In it, we’ll implement the on_epoch_end function, which gets called by the callback whenever the epoch ends. It also sends a logs object which contains lots of great information about the current state of training. For example, the current loss is available in the logs, so we can query it for a certain amount. For example, here I’m checking if the loss is less than 0.7 and canceling the training itself. Now that we have our callback, let’s return to the rest of the code, and there are two modifications that we need to make. First, we instantiate the class that we just created, we do that with this code.

callbacks = myCallback()

Then, in my model.fit, I used the callbacks parameter and pass it to this instance of the class.

model.fit(training_images, training_labels, epochs = 10, callbacks = [callbacks])

And now we pass the callback object to the callback argument of the model.fit() .

Use this notebook to explore more and see this code in action here. This notebook contains all the modifications we talked about.

Excercise 2

You learned how to do classification using Fashion MNIST, a data set containing items of clothing. There’s another, similar dataset called MNIST which has items of handwriting — the digits 0 through 9.

Write an MNIST classifier that trains to 99% accuracy or above, and does it without a fixed number of epochs — i.e. you should stop training once you reach that level of accuracy.

Some notes:

It should succeed in less than 10 epochs, so it is okay to change epochs = to 10, but nothing larger
When it reaches 99% or greater it should print out the string “Reached 99% accuracy so canceling training!”

Use this code line to get the MNIST handwriting data set:

mnist = tf.keras.datasets.mnist

Here’s a Colab notebook with the question and some starter code already written — here

My Solution

Wonderful! 😃 , you just coded for a handwriting recognizer with a 99% accuracy (that’s good) in less than 10 epochs. Let's explore my solution for this.

I will just go through the important parts.

The callback class-

class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
    if(logs.get('acc')>0.99):
        print("/nReached 99% accuracy so cancelling training!")
        self.model.stop_training = True

The main neural net code-

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation=tf.nn.relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
                                                  model.compile(optimizer='adam',
      loss='sparse_categorical_crossentropy',
      metrics=['accuracy'])
                                                                              history = model.fit(x_train, y_train, epochs=10, callbacks=[callbacks])

So, all you had to do was play around with the code and this gets done in just 5 epochs.

The solution notebook is available here.

Previous Blog in this series here

About Me

Hi everyone I am Rishit Dagli

Twitter

Website

If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at —

Get started with TensorFlow and Deep Learning

Rishit Dagli — Sat, 12 Dec 2020 08:51:09 +0000

All the code used here is available at the GitHub repository here.

Rishit-dagli / Deep-Learning-With-TensorFlow

All the resources and hands-on exercises for you to get started with Deep Learning in TensorFlow

Deep-Learning-With-TensorFlow

Consider leaving a ⭐ if you like this series of tutorials.

More parts coming soon!

Do you need any setup?

No! You would not need any hardware or software setup to work your…

View on GitHub

This is the first part of a series where I will be posting many blog posts about coding Machine Learning and Deep Learning algorithms. I believe in hands-on coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyper-parameters and experiment with the code. That is the best way to learn and code efficiently. If you don’t know what hyper-parameters are don’t worry. We will walk through Tensor Flow’s core capabilities and then also explore high level API’s in TensorFlow and so on.

Is this series for me?

If you are a person who is looking to get started with Tensor Flow, Machine Learning, and Deep Learning but don’t have any knowledge about Machine Learning you can do this.
If you already know about Machine Learning and Deep Learning and don’t know about Tensor Flow you can follow these. As you might be using some other frameworks or libraries like Caffe, Theano, CNTK, Apple ML or PyTorch and might also be knowing about the maths behind it you can skip the sections marked OPTIONAL.
If you know about the basics of Tensor Flow then you can skip some of the blog posts.

Is any software setup needed?

No!!

Not at all, you don’t need to install any software at all to complete the exercises and follow the code at all. We will be using Google Colaboratory. Colaboratory is a research tool for machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use. Colaboratory works with most major browsers and is most thoroughly tested with the latest versions of Chrome, Firefox, and Safari. And what more, it is completely free to use. All Colaboratory notebooks are stored in Google Drive. Colaboratory notebooks can be shared just as you would with Google Docs or Sheets. Simply click the Share button at the top right of any Colaboratory notebook, or follow these Google Drive file sharing instructions. Before, you ask me Colaboratory is really reliable and fast too.

However, you can also use a local development environment preferably Jupyter Notebooks but we will not be covering how to set up the local environment in these blogs.

Which Version of Tensor Flow?

The answer to this question is that in this blog we will cover both. If you are using Colab notebooks add this code.

Introduction

Coding has been the bread and butter for developers since the dawn of computing. We’re used to creating applications by breaking down requirements into composable problems that can then be coded against. So for example, if we have to write an application that figures out a stock analytic, maybe the price divided by the ratio, we can usually write code to get the values from a data source, do the calculation and then return the result. Or if we’re writing a game we can usually figure out the rules. For example, if the ball hits the brick then the brick should vanish and the ball should rebound. But if the ball falls off the bottom of the screen then maybe the player loses their life.

We can represent that with this diagram. Rules and data go in answers come out. Rules are expressed in a programming language and data can come from a variety of sources from local variables up to databases. Machine learning rearranges this diagram where we put answers in data in and then we get rules out. So instead of us as developers figuring out the rules when should the brick be removed, when should the player’s life end, or what’s the desired analytic for any other concept, what we will do is we can get a bunch of
examples for what we want to see and then have the computer figure out the rules.

So consider this example, activity recognition. If I’m building a device that detects if somebody is say walking and I have data about their speed, I might write code like this and if they’re running well that’s a faster speed so I could adapt my code to this and if they’re biking, well that’s not too bad either. I can adapt my code like this. But then I have to do golf recognition too, now my concept becomes broken. But not only that, doing it by speed alone of course is quite naive. We walk and run at different speeds uphill and downhill and other people walk and run at different speeds to us.

The new paradigm is that I get lots and lots of examples and then I have labels on those examples and I use the data to say this is what walking looks like, this is what running looks like, this is what biking looks like and yes, even this is what golfing looks like. So, then it becomes answers and data in with rules being inferred by the machine. A machine-learning algorithm then figures out the specific patterns in each set of data that determines the distinctiveness of each. That’s what’s so powerful and exciting about this programming paradigm. It’s more than just a new way of doing the same old thing. It opens up new possibilities that were infeasible to do before.

So, now I am going to show you the basics of creating a neural network for doing this type of pattern recognition. A neural network is just a slightly more advanced implementation of machine learning and we call that deep learning. But fortunately it's actually very easy to code. So, we're just going to jump straight into deep learning. We'll start with a simple one and then we'll move on to one that does computer vision in about 10 lines of code.

Hello Neural Networks

To show how that works, let’s take a look at a set of numbers and see if you can determine the pattern between them. Okay, here are the numbers.

x = -1, 0, 1, 2, 3, 4

y = -3, -1, 1, 3, 5, 7

You easily figure this out y = 2x — 1

You probably tried that out with a couple of other values and see that it fits. Congratulations, you’ve just done the basics of machine learning in your head.

Okay, here’s our first line of code. This is written using Python and TensorFlow and an API in TensorFlow called keras. Keras makes it really easy to define neural networks. A neural network is basically a set of functions which can learn patterns.

model = keras.Sequential([keras.layers.Dense(units = 1, input_shape = [1])])

The simplest possible neural network is one that has only one neuron in it, and that’s what this line of code does. In keras we use the word Dense to define a layer of connected neurons. There is only one Dense here means that there is only one layer and there is only single unit in it so there is only one neuron. Successive layers in keras are defined in a sequence so the word Sequential . You define the shape of what's input to the neural network in the first and in this case the only layer, and you can see that our input shape is super simple. It's just one value. You've probably seen that for machine learning, you need to know and use a lot of math, calculus probability and the like. It's really good to understand that as you want to optimize your models but the nice thing for now about TensorFlow and keras is that a lot of that math is implemented for you in functions. There are two function roles that you should be aware of though and these are loss functions and optimizers.

model.compile(optimizer = 'sgd', loss = 'mean_squared_error')

This lines defines that for us. So, lets understand what it is.

Understand it like this the neural network has no idea of the relation between X and Y so it makes a random guess say y=x+3 . It will then use the data that it knows about, that’s the set of Xs and Ys that we’ve already seen to measure how good or how bad its guess was. The loss function measures this and then gives the data to the optimizer which figures out the next guess. So the optimizer thinks about how good or how badly the guess was done using the data from the loss function. Then the logic is that each guess should be better than the one before. As the guesses get better and better, an accuracy approaches 100 percent, the term convergence is used.

Here we have used the loss function as mean squared error and optimizer as SGD or stochactic gradient descent.

Our next step is to represent the known data. These are the Xs and the Ys that you saw earlier. The np.array is using a Python library called numpy that makes data representation particularly enlists much easier. So here you can see we have one list for the Xs and another one for the Ys. The training takes place in the fit command. Here we’re asking the model to figure out how to fit the X values to the Y values. The epochs equals 500 value means that it will go through the training loop 500 times. This training loop is what we described earlier. Make a guess, measure how good or how bad the guesses with the loss function, then use the optimizer and the data to make another guess and repeat this. When the model has finished training, it will then give you back values using the predict method.

Here’s the code of what we talked about.

xs = np.array([-1.0,  0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

model.fit(xs, ys, epochs=500)
print(model.predict([10.0]))

So, what output do you except — 19, right?

But when you try this in the workbook yourself you will see it gives me a value close to 19 not 19. Like you might receive-

[[18.987219]]

Why do you think this happens because the equation is y = 2x-1 .

There are two main reasons:

The first is that you trained it using very little data. There’s only six points. Those six points are linear but there’s no guarantee that for every X, the relationship will be Y equals 2X minus 1. There’s a very high probability that Y equals 19 for X equals 10, but the neural network isn’t positive. So it will figure out a realistic value for Y. The the second main reason. When using neural networks, as they try to figure out the answers for everything, they deal in probability. You’ll see that a lot and you’ll have to adjust how you handle answers to fit. Keep that in mind as you work through the code. Okay, enough talking. Now let’s get hands-on and write the code that we just saw and then we can run it.

Hands-on Hello with Neural Nets

If you have used Jupyter before you will find Colab easy to use. If you have not used Colab before consider watching this intro to it.

Now you know how to use Colab so let’s get started.

If you are using Jupyter Notebooks in your local environment download the code file here.

If you are using Colab follow the link here. You need to sign in with your Google account to run the code and receive a hosted run time.

I recommend you to spend some time with the notebook and try changing the epochs and see the effect on loss and accuracy and experiment with the code.

Exercise — 1

In this exercise you’ll try to build a neural network that predicts the price of a house according to a simple formula. So, imagine if house pricing was as easy as a house costs 50k + 50k per bedroom, so that a 1 bedroom house costs 100k, a 2 bedroom house costs 150k etc. How would you create a neural network that learns this relationship so that it would predict a 7 bedroom house as costing close to 400k etc.

Hint: Your network might work better if you scale the house price down. You don’t have to give the answer 400…it might be better to create something that predicts the number 4, and then your answer is in the ‘hundreds of thousands’ etc.

Note: The maximum error allowed is 400 dollars

Make a new Colab Notebook and write your solution for this exercise and try it with some numbers the model has not seen.

My Solution

Wonderful, you just created a neural net all by yourself and some of you must be feeling like this (it’s perfectly ok)-

Let’s see my solution to this.

def house_model(y_new):
    xs=[]
    ys=[]
    for i in range(1,10):
        xs.append(i)
        ys.append((1+float(i))*50)

    xs=np.array(xs,dtype=float)
    ys=np.array(ys, dtype=float)

model = keras.Sequential([keras.layers.Dense(units = 1, input_shape = [1])])
    model.compile(optimizer='sgd', loss='mean_squared_error')    

    model.fit(xs, ys, epochs = 4500)
    return (model.predict(y_new)[0]/100)

And this gives me the answer to the required tolerate limit.

Here’s the Solution to this exercise — here

If you want to try this out in Colab you can go to the link and click open on colab button.

That was it, the basics. You even created a neural net all by yourself and made pretty accurate predictions too. Next, we will see about using some image classifiers with Tensor Flow so, stay tuned and explore till then.

Some useful resources for further reading

Learn more about Colab here
TensorFlow website here
TensorFlow docs here
GitHub repository here
A wonderful Neural Nets research paper here
A wonderful visualization tool by TensorFlow here
TensorFlow models and data sets here

Some TensorFlow research papers

Large scale Machine Learning with TensorFlow here
A Deep Level Understanding of Recurrent Neural Network & LSTM with Practical Implementation in Keras & Tensorflow here

Next Blog in this series here

About Me

Hi everyone I am Rishit Dagli

Twitter

Website

If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at —

Build K-Means from scratch in Python

Rishit Dagli — Sat, 12 Dec 2020 03:48:36 +0000

Introduction to K-means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)

Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The “Choosing K” section below describes how the number of groups can be determined.

This story covers:

Why use K-means
The steps involved in running the algorithm
An Python example with toy data set from scratch

Why use K-Means ?

The algorithm can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

Behavioral segmentation:

Segment by purchase history
Segment by activities on application, website, or platform
Define personas based on interests
Create profiles based on activity monitoring

Inventory categorization:

Group inventory by sales activity
Group inventory by manufacturing metrics

Sorting sensor measurements:

Detect activity types in motion sensors
Group images
Separate audio
Identify groups in health monitoring

Detecting bots or anomalies:

Separate valid activity groups from bots
Group valid activity to clean up outlier detection

In addition, monitoring if a tracked data point switches between groups over time can be used to detect meaningful changes in the data.

The Algorithm

K-Means is actually one of the simplest unsupervised clustering algorithm. Assume we have input data points x1,x2,x3,…,xn and value of K(the number of clusters needed). We follow the below procedure:

Pick K points as the initial centroids from the data set, either randomly or the first K.
Find the Euclidean distance of each point in the data set with the identified K points — cluster centroids.
Assign each data point to the closest centroid using the distance found in the previous step.
Find the new centroid by taking the average of the points in each cluster group.
Repeat 2 to 4 for a fixed number of iteration or till the centroids don’t change.

Here Let point a=(a₁,a₂) ; b=(b₁,b₂) and D be the Euclidean distance between the points then

Mathematically speaking, if each cluster centroid is denoted by *cᵢ, *then each data point x is assigned to a cluster based on

arg (min (cᵢ ϵ c) D(cᵢ,x)²)

For finding the new centroid from clustered group of points —

Choosing the right K

So it’s really problematic if you choose the wrong K, so this is what I am talking about.

So this was a clustering problem for 3 cluster, K = 3 but if I choose K = 4, I will get something like this and this will definitely not give me the kind of accuracy of predictions, I want my model too. Again if I choose K = 2, I get something like this —

Now this isn’t a good prediction boundary too. You or me can figure out the ideal K easily in this data as it is 2 dimensional, we could figure out the K for 3 dimensional data too, but lets say you have 900 features and you reduce them to around a 100 features, you cannot visually spot out the ideal K (these problems usually occur in gene data). So, you want a proper way for figuring or I would say guessing an approximate K. To do this we have may algorithms like cross validation, bootstrap, AUC, BIC and many more. But we will discuss one of the most simple and easy to comprehend algorithm called “elbow point”.

The basic idea behind this algorithm is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the **elbow point **and that will be our optimal K.

For the same data shown above this will be the cost(SSE — sum of squared errors) vs K graph created —

So, our ideal K is 3. Let us take another example where ideal K is 4. This is the data-

And, this is the cost vs K graph created-

So, we easily infer that K is 4.

Python Implementation

Here we will use the libraries, Matplotlib to create visualizations and Numpy to perform calculation.

So first we will just plot our data —

import matplotlib.pyplot as plt
import numpy as np

from matplotlib import style
style.use('ggplot')

X = np.array([[1, 2],
              [1.5, 1.8],
              [5, 8 ],
              [8, 8],
              [1, 0.6],
              [9,11]])

plt.scatter(X[:,0], X[:,1], s=150)
plt.show()

To make our work organized we will build a class and proceed as in the steps above and also define a tolerance or threshold value for the SSE’s.

class K_Means:
    def __init__(self, k=2, tol=0.001, max_iter=300):
        self.k = k
        self.tol = tol
        self.max_iter = max_iter

    def fit(self,data):

        self.centroids = {}

        for i in range(self.k):
            self.centroids[i] = data[i]

        for i in range(self.max_iter):
            self.classifications = {}

            for i in range(self.k):
                self.classifications[i] = []

            for featureset in data:
                distances = [np.linalg.norm(featureset-self.centroids[centroid]) for centroid in self.centroids]
                classification = distances.index(min(distances))
                self.classifications[classification].append(featureset)

            prev_centroids = dict(self.centroids)

            for classification in self.classifications:
                self.centroids[classification] = np.average(self.classifications[classification],axis=0)

            optimized = True

            for c in self.centroids:
                original_centroid = prev_centroids[c]
                current_centroid = self.centroids[c]
                if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
                    print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
                    optimized = False

            if optimized:
                break

    def predict(self,data):
        distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
        classification = distances.index(min(distances))
        return classification

Now we can do something like:

model = K_Means()
model.fit(X)

for centroid in model.centroids:
    plt.scatter(model.centroids[centroid][0], model.centroids[centroid][1],
                marker="o", color="k", s=150, linewidths=5)

for classification in model.classifications:
    color = colors[classification]
    for featureset in model.classifications[classification]:
        plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)

plt.show()

Next, you can also use the predict function. So that was about K-Means and we are done with the code 😀

About Me

Hi everyone I am Rishit Dagli

Twitter

Website

If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at —

Forem: Rishit Dagli

How to Build Better Machine Learning Models

A high-level approach for Hyperparameter tuning🕹️

Use Mixed Precision Training for large networks🎨

Use Grad Check for backpropagation✔️

Cache Datasets💾

How to tackle overfitting ⭐

Changing the Model Architecture

Modifying Network Weights

Weight Regularization

Dropouts

Early stopping

Thank you for reading!

The Art of Hyperparameter Tuning in Deep Neural Nets by Example🎨

What is a hyperparameter anyway?🤷‍♀️

Which hyperparameters are more important?📑

Common approaches for the tuning process🕹️

Random Search

Hyperband

Choosing a Learning Rate α🧐

Choosing a Momentum Term β🧐

Choosing Model Design Hyperparameters🖌️

Number of Hidden Layers d

Number of Neurons w

Solving Overfitting in Neural Nets With Regularization | by Rishit Dagli | Towards Data Science

Rishit Dagli ・ Apr 18, 2020 ・ 10 min read towardsdatascience.com

💡TensorFlow Tip, use .cache

💡TensorFlow Tip, use .prefetch

Computer Vision with TensorFlow

Table of contents

Rishit-dagli / Deep-Learning-With-TensorFlow

All the resources and hands-on exercises for you to get started with Deep Learning in TensorFlow

Deep-Learning-With-TensorFlow

Table of Contents

Do you need any setup?

Introduction

Computer Vision

Let’s Get started

Coding a Neural Network for this

Colab Tips (OPTIONAL)

Walk through the notebook

Callbacks (Very important)

Excercise 2

My Solution

About Me

Get started with TensorFlow and Deep Learning

Table of contents

Rishit-dagli / Deep-Learning-With-TensorFlow

All the resources and hands-on exercises for you to get started with Deep Learning in TensorFlow

Deep-Learning-With-TensorFlow

Table of Contents

Do you need any setup?

Is this series for me?

Is any software setup needed?

Which Version of Tensor Flow?

Introduction

Hello Neural Networks

Hands-on Hello with Neural Nets

Exercise — 1

My Solution

Some useful resources for further reading

About Me

Build K-Means from scratch in Python

Introduction to K-means Clustering

Why use K-Means ?

The Algorithm

Choosing the right K

Python Implementation

About Me

Rishit Dagli ・ Apr 18, 2020 ・ 10 min read
towardsdatascience.com