Forem: Divyanshu Katiyar

Hyperparameter Optimisation for Machine Learning - Bayesian Optimisation

Divyanshu Katiyar — Wed, 01 Feb 2023 16:50:29 +0000

When building a machine-learning or deep learning model, have you ever ran into the dilemma about setting up certain parameters which directly affect your model, like how many layers should you stack? Or how many units of filters should be there for each layer? What activation function should you use? These architecture-level parameters are called as hyperparameters which play an important role in deep learning to set out the best configuration so as to produce the best performance of the model.
In this blog we will cover some of the concepts describing how bayesian optimisation works and how fast it is compared to random search and grid search hyperparameter optimisation methods.

Introduction

Algorithms today serve as proxies for decision making processes traditionally performed by humans. As we automate these decision making processes, we must ask the question recursively if our algorithms are behaving as intended. If not, how far is the algorithm from the the state of the art standards? Sometimes the best trained model may yet not be the best because of the amount of inaccuracies when run on the validation data. Hyperparameter optimisation is the process of tuning the parameters of a machine learning model to improve its performance. These are not learned from the data, but set manually by the user. For example, if the model in question is a neural network (NN), the user can set the learning rate of the NN which counts as a hyperparameter. More and more complex machine learning models usually require more parameters to be fine-tuned. The goal of this is to find the set of hyperparameters that result in the best performance of the model.

Grid Search

One of the most common methods for such kind of a problem is grid search, where a pre-defined set of values is used to train and evaluate the model. The process is to train the model over and over again for every combination of values of the hyperparameters and choose the combination which results in the model delivering the best performance. However, a consequence of this is that it can be extremely time-consuming when the number of hyperparametes and the size of the values are large. For example, suppose we have 3 hyperparameters, each of which takes a list of 3 values. The number of combinations it has to process would be 3 x 3 x 3 = 27. More generally, if 'm' is the number of hyperparameters to be optimised and each of them contains n values in a list, then the number of combinations would be
$m^n$ , which becomes a problem when the number of samples is very large. We need a more efficient way to reduce the time complexity of this fine-tuning.

Random Search

Random search is another method where random values are chosen from a pre-defined range for hyperparameters. The model is trained each time and evaluated for each set of random values. It performs much faster than the grid search algorithm since it uses random sampling, however, it can still be very time-consuming and might not find the best set of hyperparameters. Let us assume a hypervolume (measure of the size of the feasible region in a multi-dimensional space) $v_{\epsilon}$ where a function takes values within $\epsilon$ of its maximum. The random search then will sample this hypervolume with probability

P(\epsilon) = \frac{v_{\epsilon}}{V}

where 'V' is the total search space volume. If the total search space is given as

R^d

and the hypervolume spans with

r^d

('d' being the input dimension), the random search would have to process number of samples in the order of

\left(\frac{r}{R} \right)^{-d}

. If r << R then this becomes really expensive!

Bayesian Optimisation

To mitigate the aforementioned problem, there is yet another method which is called Bayesian optimisation. It is an alternative method which works faster and more efficient than the grid search and random search algorithms. It uses Bayesian inference to model the unknown function that maps the parameter values to the performance of the model. The major advantage here is that it uses the information about the past iterations to inform the next set of iterations. If you recall Bayes theorem, the conditional probability of an event A, given that event B has already occurred is given by:

\frac{P(B|A)*P(A)}{P(B)}

We can further simplify this by removing the normalising value P(B) and we are left with:

P (A ∣ B) = P (B ∣ A) * P (A)

What we have calculated here is known as the posterior probability and it is the calculated using the reverse conditional probability (P(B|A)), also called the likelihood and the prior probability (P(A)). Suppose we have some sample values

x = {x_1, x_2, ....., x_n}

that we evaluate using an objective function f(x) and create a dataset (D) out of the samples and the values returned by the function on those samples. The likelihood in this case is defined as the probability of observing the data given the function P(D|f). This is how we can maximise the likelihood to observe for the best hyperparameters from the sample.

Applications in NLP and bricks

In NLP, the hyperparameter optimisation can be used to tune the hyperparameters of a word embedding model, such as word2vec, GloVe, etc. We can train the embedding models with different sets of hyperparameters and evaluate the performance on basic NLP tasks like text classification, named entity recognition, sentiment analysis, etc. It will soon be available to be used as an active learner, along with the other active learners like random search and grid search in bricks. Since it is an active learner, there is no live runtime environment for it in bricks. Instead, one has to copy the code from the code snippet and paste it in refinery. In refinery, we use sentence embeddings (one vector per sentence) to train the ML models, which can be obtained from large language models such as BERT, or even simpler methods like TF-IDF/BoW. Word2Vec/Glove produce word embeddings (one vector per word), which are more useful if you are interested to learn about the relationships between words and phases. You can find more information about the active transfer learning and applications in refinery in this blog.

Here is an example which shows the implementation of the bayesian optimisation for a text classification task:

from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization

def bayesian_optimization(sentences, labels, n_iter=10, cv=5, random_state=42):
    """
    Perform Bayesian optimization to tune the hyperparameters of a word2vec model for text classification.
    """

    def cross_validation_score(dim, window, negative):
        """
        Function to evaluate the performance of the word2vec model on the text classification task
        """
        model = Word2Vec(sentences, vector_size=int(dim), window=int(window), negative=int(negative))
        X = model[model.wv.vocab]  # load the vocabulary
        clf = RandomForestClassifier(random_state=random_state)
        score = cross_val_score(clf, X, labels, cv=cv).mean()
        return score

    optimizer = BayesianOptimization(
        f=cross_validation_score,
        pbounds={
            "dim": (50, 100, 150), 
            "window": (2, 4, 6),
            "negative": (5, 10, 15)  # the hyperparameters used
        },
        random_state=random_state
    )
    optimizer.maximize(init_points=5, n_iter=n_iter)  # maximize the likelihood function
    best_params = optimizer.max["params"]
    best_params["dim"] = int(best_params["dim"])
    best_params["window"] = int(best_params["window"])
    best_params["negative"] = int(best_params["negative"])
    return best_params

To conclude, grid search is great if you already know what hyperparameters work well and their sample space is not too big. Random search works good if you have no knowledge of the hyperparameters to fine-tune. Try not to use it on too large of sample sizes due to reasons explained above. Bayesian optimisation is a powerful method and can be more efficiently and effectively used over random search and grid search methods. In comparison, it is almost 10 times faster than the grid search method. It serves many crucial use cases in the domain of natural language processing. With its flexibility in the choice of the acquisition functions, it can be tailored to a wide range of problems.

Citations:

The theory behind Image Captioning

Divyanshu Katiyar — Mon, 09 Jan 2023 10:17:38 +0000

Introduction

One of the most challenging tasks in artificial intelligence is automatically describing the content of an image. This requires the knowledge of both computer vision using artificial neural networks, and natural language processing. This can have great impact in many different domains - be it to make it easier for visually impaired people of the community to understand the contents of the images on the web, or for tedious tasks like data labelling where data is in the form of images. In this article, we will walk through the basic concepts that are needed in order to create your own image captioning model.

Textual description of an image

In principle, converting an image into text is a significantly hard task. The description should not only contain the objects highlighted in the image but also the context of the image. On top of that the output has to be expressed in a natural language like English, German, etc., so a language model is also needed to complete the picture.

In the above image we see people on a vacation hiking on the foothills of a mountain range. Let us say that we want to generate a text describing this image. The image is used as input I and is fed to the model (called the Show and Tell model, developed by Google) which is trained to maximise the likelihood p( S | I ) of producing a sequence of words S = {S₁, S₂, ...., Sₙ}, where each word Sₖ comes from a given dictionary which describes the image accurately.
In order to process the input data, we use Convolutional Neural Networks (CNNs) as "encoders" and the output of the CNN is fed to a type of Recurring neural network called Long-Short Term memory (LSTM) network which is responsible for generating natural language outputs. Before describing the model, let us briefly look into CNNs and LSTM.

Convolutional neural networks

CNN is a type of neural network which is used mainly for image classification and recognition. As the name suggests, it uses a mathematical operation called convolution to process the data. The CNN consists of an input layer, single or multiple hidden layers, and an output layer. The middle layers are called hidden because their inputs and outputs are masked by the activation function.

Convolutions operate over 3D tensors called feature maps with two spatial axes and a channel axis. The hidden layers (convolutional layers) are made up of multiple convolutions that scan the input data and apply filters to extract output feature. This output feature is also a 3D tensor which is passed through a non-linear activation function in order to induce non-linearity.
The output of the convolutional layers is passed through a pooling layer, which aggressively downsamples the feature maps and reduce computation complexity. Eventually, the output of the pooling layer is passed through a fully connected dense layer, which computes the final prediction.
Below is an example of how to instantiate a convolutional neural network in python:

from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers

model = models.Sequential()
model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(120, 120, 10)))
model.add(layers.MaxPool2D((2,2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss="binary_crossentropy",
              optimizer=optimizers.RMSprop(learning_rate=1e-3), 
              metrics=['acc'])

Here we have assumed that the user already has the pre-processed input data. Let us assume that this data is split into training and test sets. In the next step, we can fit the model and save it.

fitting = model.fit_generator(
            training_data,
            steps_per_epoch=80,
            epochs=25,
            validation_data=validation_data,
            validation_steps=70
          )

model.save("output_data.h5")

Long-short term memory

LSTM networks are type of recurrent neural networks that are well suited for modelling long-term dependencies in data. They are called "long-short term" because they can remember information for long periods of time, but they can also forget information that is no longer relevant.
RNN was a consequence of the failure of the feedforward neural networks.
Problems with feedforward neural networks?

Not designed for sequences and time series
Do not model memory - in the sense that they do not retain information from previous data points when processing new data.

RNNs solve this issue conveniently. The recursive formula defined as:

S_t = F_w(S_{t-1}, X_t)

states that the new state at time 't' is a function of the old state at 't-1' and input at time 't'. This makes the RNNs different from other neural nets (NNs) since NNs learn from backpropagation and RNNs learn from backpropagation through time!

The output from this network is now used to calculate the loss.

In the image shown above, we describe a recurrent neural network which is run for - say - 100 time steps. Our aim is to compute the loss. Let us assume that at each state, out gradient is 0.01. As we go back a 100 time steps, the update in our weights is

\Delta w = (0.01)^{100} \approx 0

which is negligible. Thus, the neural network won't learn at all! This is known as the vanishing gradient problem. In order to solve this, we need to add some extra interactions to the RNN. This gives rise to the Long-Short Term Memory.
LSTM, like any other NN, consists of three main components: input layer, single or multiple hidden layers, and the output layer. What makes it different are the operations happening in the hidden layers. The hidden layer consists of three gates - input gate, forget gate and output gate - and one cell state. The memory cells are responsible for storing and manipulating the information over time. Each memory cell is connected to a gate which decides what information stays and what information is forgotten. Gosh these machines are getting smart!
To describe the functionality of the gates mathematically, we can look at this expression:

g_t = \sigma (W_g S_{t-1} + W_g X_t)

where 'g' represents either input(i), forget(f), or output(o) gates; 'W' denotes the respective weights for the gates, 'S' denotes the state at 't-1' time step and 'X' is the input.

The above image shows the functionality of the LSTM network. The important part is that the network can decide what information to discard and what to keep. This resolves the vanishing gradient problem that we face in a normal RNN.
Here is a simple implementation of the LSTM network in keras-

from tensorflow.keras.layers import LSTM
from tensorflow.keras import models

model = models.Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

model_fitting = model.fit(input_train, y_train,
                    epochs=30,
                    batch_size=128,
                    validation_split=0.2)

Back to the textual description

Now that we have gone through the concepts of the tools required, it is understood that using CNNs to process the image inputs and LSTM for natural language output we can build rather accurate models to generate image captions. For that, use a CNN as an encoder, by pre-training it first for image classification and using the last hidden layer as an input to the RNN "decoder" that generates sentences.
The model is trained to maximise the likelihood of generating correct description given the image which is given as:

\theta^* = arg max \sum_{I,S} log p(S|I;\theta)

where θ are the parameters of the model, I is the input image, and S is correct word. The loss is described as the negative log likelihood of the correct word at each step:

\sum_{t=1}^N log p_t(S_t)

Once the loss is minimized and the likelihood maximized, we have to consider the epoch where the validation loss is minimum. And tada! We have our model ready. All you would have to do is to input the images and the expected output should be a sentence describing that image.

This article is more about the in-depth knowledge of the tools used to build this use case. Once you are proficient enough, you can create your own use case and build your own models for it.

Citations: