Forem: Paul Karikari

When Software Is Not Soft Anymore: The Nature of Software Complexity.

Paul Karikari — Thu, 23 Apr 2020 15:43:12 +0000

Introduction

One morning you show up to work and your boss walked to you and asked you to implement a feature that is in high demand by clients of an application you’ve developed with your team sometime in the past.
As usual, your boss asked your estimated time of delivery and with much enthusiasm, you gave a couple of weeks (assuming 3weeks) for the feature to be completed.
A week passed by and you’ve not been able to add any substantial contribution to the project, you spend more time than usual staring at your screen and alternating between multiple opened tabs.
Managers keep asking for updates on your progress but you can barely give meaningful feedback. All you could say is “I’m working on it”

You realized it’s now more difficult to add new features or make changes than when the project was freshly being developed in the past. But wait! Isn’t Software supposed to be soft?
As in, shouldn’t software be able to be reshaped into any form by making changes or adding features easily?

Since the early days, Software has been hailed for the possible changes that can be made to it as compared to hardware which is impossible to change once manufactured and has to be replaced when changes are required. The possibility of making changes to software when requirement demands it has always been the selling point for software for as long as we know, yet there comes a time when making changes to software seems close to impossible and sometimes the software project has to be rewritten from scratch just like how hardware has to be replaced when changes are required.
This hindrance to adding features or making changes to software usually comes at a cost, which is unexpected considering the fact that it is well known that software should allow changes to be made when needed.

What makes adding features or changes to software more difficult?

Now, what causes this hindrance? what makes adding features more difficult?
The complexity of software is what makes it difficult to make changes to it.
This comes in many forms and almost every developer experiences it at some point in their career.
The ability to recognize and reduce complexity in software is a very important skill for a software developer and that’s what distinguishes a great developer from the others.

Understanding software complexity.

Complexity in software can be defined as anything related to the structure of the software system that makes it difficult to understand and modify.
The various forms of software complexity are:

Difficulty in understanding what a piece of code does.
Takes too much effort to make small improvement or it’s unclear which part of the system to modify in order to make a small improvement.
When fixing one bug introduces or creates another bug.

When software is difficult to understand and modify it is considered as complicated or complex but when it’s easy to understand and modify, then it’s simple.

Size Doesn’t Matter.

The word Complex is often used to describe large software systems with very sophisticated features but for the purpose of this article when a large system is easy to understand and modify it’s not complex.
In other words, when a small software system is difficult to understand and takes too much effort to modify then it’s considered as complex.
Complexity is what you face as a developer at a particular time when you are trying to achieve a goal. It doesn’t relate to the overall size or functionality of the software system.

You Read More Code Than You Write.

If you’ve been in software development for a while I bet you’ve already come to the realization that developers read more code than they write.
The complexity of a piece of code is more obvious to readers of your code. If your own code seems simple to you but others find it difficult to understand then it’s complex. Your job as a developer is not just writing code that works but also to write code that others can understand and work with easily

Attributes Of Software Complexity

Generally, complexity manifests in three ways.

Change amplification: This happens when multiple parts of a software system have to be modified to satisfy one simple requirement.
eg. Consider a web application that has multiple frontend templates with colors defined as inline CSS on each page. When the theme or color palettes of the web application change, all other pages have to be updated.

Cognitive Load: This refers to the amount of information a developer has to know about the system in order to modify it. A system that requires a developer to spend more time learning a lot of information before accomplishing a task is said to have a high cognitive load. This can lead to unwanted bugs when a developer misses some vital information about the software system.
eg. Using a resource-intensive class that has no in-built mechanism to free the acquired resources but expects the developer to know when to free those resources.

Unknown unknowns: This is when a developer doesn’t know which part of the software system to modify or doesn’t know the information needed to accomplish a task.
eg. When a developer is tasked to make changes to a system that uses a library that is not well documented makes it very difficult for the developer to complete the task assigned.

Causes of Complexity

Complexity is mostly caused by dependencies and obscurity.

Dependencies:
A dependency exists when a piece of code can’t work in isolation but depends on other pieces of code or other parts of the software system to function properly.
Technically, Anything that a piece of software requires to do what it is intended to do can be classified as a dependency.
Dependencies are inevitable and are used in almost any software system.
Whenever you call a function in your code, you create a dependency between your code and the implementation of that function. When a new parameter is added to the function or there are some changes in the implementation of the function, it affects your code directly or indirectly (remember Change amplification ?).

Obscurity:
This happens when an important piece of information is not obvious. eg Not using meaningful variable or function names or the documentation of a function is very lacking and does not state the information needed to use the function properly.

Complexity is Incremental

Complexity doesn’t just happen. It accumulates over a period of time.
This happens because many small dependencies and obscurities build up over time. Eventually, this makes it difficult to understand and modify the software system. Tasks that suppose to take little time to accomplish will then take too much time to complete.

It’s All about complexity

There has been a wave of ideas, experienced engineers, and industry experts that have given talks, authored books, and other resources with one common goal, to help other developers build maintainable software systems.
Most of the ideas shared are techniques or methods to reduce complexity in software systems.
Ideas like DRY, YAGNI, SOLID, books like clean code and clean architecture by Robert C. Martin and Refactoring by Martin Fowler and many more have one thing in common, that is building software that is easy to understand and to modify.

Conclusion

A combination of dependency and obscurity leads to complexity in software systems which manifests in the form of change amplification, cognitive load, and unknown unknowns.
Building software systems with the intensions of making them easy to understand and modify can yield long term benefits in the future which might not be obvious at the beginning of a software project but it’s worth putting in the extra effort of making it as simple as possible.

Resources

A Philosophy of Software Design by John Ousterhout
Clean Code Robert C. Martin
Clean Architecture: A Craftsman’s Guide to Software Structure and Design by Robert C. Martin
Refactoring by Martin Fowler
Code Simplicity: The Fundamentals of Software by Max Kanat-Alexander

Build, Train and Deploy Tensorflow Deep Learning Models on Amazon SageMaker: A Complete Workflow Guide.

Paul Karikari — Wed, 15 Apr 2020 21:58:54 +0000

Introduction

Machine learning(ML) projects mostly follow a workflow that involves generating example data, training a model and deploying the model.
These steps have subtasks and are iterative.
More often ML engineers and data scientists need an environment where they can experiment and prototype ideas faster.
After prototyping, deploying and scaling machine learning models is also a mystery that is known to few.

It will be ideal and convenient if without any tiresome setup ML engineers and data scientists can easily go from experimentation or prototyping to deploying production-ready and scalable ML models. This is where Amazon SageMaker comes in.

What is Amazon SageMaker:

Sagemaker was built to provide a platform to support the development and deployment of machine learning models.
Quoting from the official website:

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.
Traditional ML development is a complex, expensive, iterative process made even harder because there are no integrated tools for the entire machine learning workflow. You need to stitch together tools and workflows, which is time-consuming and error-prone. SageMaker solves this challenge by providing all of the components used for machine learning in a single toolset so models get to production faster with much less effort and at lower cost.
source: https://aws.amazon.com/sagemaker/

Features of Amazon SageMaker:

Sagemaker Provides customizable Amazon ML instances with developer-friendly notebook environment preloaded with ML frameworks and libraries.
Seamless integration with AWS storage services such as( s3, RDS DynamoDB, Redshift, etc) for analysis.
SageMaker provides 15+ most commonly used ML algorithms and also supports building custom algorithms.

How SageMaker Works

To train models on sagemaker , you will have to create a training job by specifying the path to your training data on s3, the training script or built-in algorithm and the EC2 container for training.

After training, the Model artifacts are uploaded to s3. From this artifact, a model can be created and deployed on EC2 containers with endpoint configuration for prediction or inference.

What we will build

In this tutorial, we will build an ML model to predict the sentiment of a text.
The details of processing the data and building the model are well explained in my previous tutorial. We will focus on training and deploying the model on Amazon Sagemaker.
Optionally, I accompanied this tutorial with a complete notebook to upload in your Sagemaker notebook instance to run alongside this tutorial if you want.

We are building a custom model and it’s much more convenient to use the sagemaker python SDK for training and deploying the model.
Same tasks can be accomplished by using the web UI of sagemaker, mostly when using built-in algorithms.

Steps:

Step 1: Create an Amazon S3 Bucket
Step 2: Create an Amazon SageMaker Notebook Instance
Step 3: Create a Jupyter Notebook
Step 4: Download, Explore, and Transform the Training Data (refer to the previous tutorial)
Step 5: Train a Model
Step 6: Deploy the Model to Amazon SageMaker
Step 7: Validate the Model
Step 8: Integrating Amazon SageMaker Endpoints into Internet-facing Applications
Step 9: Clean Up

Create an Amazon s3 Bucket:

First, we create an s3 bucket. This is where we will store the training data and also where the model artifacts will be saved later.
Create a bucket called tensorflow_sentiment_analysis

Create an Amazon SageMaker Notebook Instance:

Go to Sagemaker in the AWS console on the left panel click on Notebook instance (1) *and then click on create *Notebook instance (2).

on the next page enter the name of the notebook, any name of your choice will work. You can leave the rest as default for the purpose of this tutorial. After that click on create notebook instance

The notebook will be created and the status will be pending for a short period of time and then will switch to InService. At this stage, you can click on open Jupyter or Open Jupyter lab. The difference between the two are differences in UI.
I prefer to use Jupyter lab because it has file explorer and supports multiple tabs for opened files and also feels more like an IDE

Download, Explore, and Transform the Training Data

Download the dataset and upload it to your notebook instance. Refer to this tutorial for the explanation of the exploration and transformation of data.

The data is transformed and saved into s3.

Before you can use the sagemaker SDK API you have to create a session,
then you call the upload_data with the name of the data and key prefix which is the path of the s3 bucket.
This returns the complete s3 path of the data file. You can query to verify as shown above.

Training the Model

To Train a TensorFlow model you have to use TensorFlow estimator from the sagemaker SDK

**entry_point: **This is the script for defining and training your model. This script will be run in a container. (more on this later)

role: The role assigned to the running notebook. you get that by running the coderole = sagemaker.get_execution_role()

train_instance_count: The number of container instances to spin up for training the model.

train_instance_type: The instance type of container to be used for training the model.

framwork_version: TensorFlow version used in the training script. you get that by running tf_version = tf.version

py_version: Python version used.

script_mode: If set to True the estimator will use the Script Mode containers (default: False). This will be ignored if py_version is set to ‘py3’.
This allows for running arbitrary script code in a container.

hyperparameters: The are parameters needed to run the training script.

Now that you know what each parameter means, let’s understand the content of the training script.

  %%writefile train.py

  import argparse
  import os
  import tensorflow as tf
  from tensorflow.keras.preprocessing.text import Tokenizer
  from tensorflow.keras.preprocessing.sequence import pad_sequences
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import LSTM,Dense
  from tensorflow.keras.layers import Embedding, Dropout
  import pandas as pd

  if __name__ == '__main__':

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line    arguments to the script.

    parser.add_argument(‘--pochs’, type=int, default=10)
    parser.add_argument(‘--batch-size’, type=int, default=100)
    parser.add_argument(’--learning-rate’, type=float, default=0.1)

    parser.add_argument(‘--gpu-count’, type=int, 
                                      default=os.environ['SM_NUM_GPUS'])

    # input data and model directories
     parser.add_argument(‘--model-dir’, type=str, 
                                     default=os.environ['SM_MODEL_DIR'])

     parser.add_argument(‘--train’, type=str, 
                                default=os.environ['SM_CHANNEL_TRAIN'])

     args, _ = parser.parse_known_args()

     epochs = args.epochs
     lr = args.learning_rate
     batch_size = args.batch_size
     gpu_count = args.gpu_count
     model_dir = args.model_dir
     training_dir = args.train

     training_data = pd.read_csv(training_dir+’/train.csv’,sep=’,’)
     tweet = training_data.text.values
     labels = training_data.airline_sentiment.values

     num_of_words = 5000
     token = Tokenizer(num_words=num_of_words)
     token.fit_on_texts(tweet)

     vocab_size = len(token.word_index) + 1 # 1 is added due to 0 index

     tweet_sequence = token.texts_to_sequences(tweet)

     max_len = 200
     padded_tweet_sequence = pad_sequences(tweet_sequence, 
                                                     maxlen=max_len)

     # Build the model
     embedding_vector_length = 32
     model = Sequential() 
     model.add(Embedding(vocab_size, embedding_vector_length,   
                                            input_length=max_len))
     model.add(Dropout(0.2))
     model.add(LSTM(100)) 
     model.add(Dropout(0.2))
     model.add(Dense(1, activation=’sigmoid’)) 
     model.compile(loss=’binary_crossentropy’,optimizer=’adam’,
                                               metrics=[‘accuracy’]) 

     model.fit(padded_tweet_sequence,labels,validation_split=0.3,    
                        epochs=epochs, batch_size=batch_size, verbose=2)

     tf.saved_model.simple_save(
     tf.keras.backend.get_session(),
     os.path.join(model_dir, ‘1’),
     inputs={‘inputs’: model.input},
     outputs={t.name: t for t in model.outputs})

The first line is a command to write the content of the cell to a file train.py.

Because SageMaker imports your training script, you should put your training code in a main guard (if name=='main':) so that SageMaker does not inadvertently run your training code at the wrong point in execution.

All hyperparameters are passed to the script as command-line arguments.
The training script also gets access to environment variables in the training container instance. Such as the following

SM_MODEL_DIR: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.
SM_NUM_GPUS: An integer representing the number of GPUs available to the host.
SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the Tensorflow estimator’s fit call, named ‘train’ and ‘test’, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

The midsection of the script is the usual model definition and training.
The last part of the saves the model artifacts to the s3 path provided. Take note of how the path is created by appending a numeric to it.

To start training call the fit method and pass the training data path to it to start training. This creates a training job on sagemaker. You can check the training jobs section to see the job created.

If everything goes well you should see the output below at the last section of the output logs.

Deploy the Model to Amazon SageMaker

To deploy we call the deploy method on the estimator by passing the following parameters.

initial_instance_count: The initial number of inference instance to lunch.
This can be scaled up if the request load increases.

instance_type: The instance type for the inference container.

endpoint_name: A unique name for the model endpoint.

Validating the Model

After calling the deploy method, the endpoint for the model is returned and this can be used to validate the model by using test data as shown below.

Integrating Amazon SageMaker Endpoints into Internet-facing Applications.

The end use of ML models is for applications to send requests to it for inference/prediction. This can be accomplished you using API gateway and lambda function.

Applications will make requests to the API endpoint, this will trigger a lambda function, the lambda function will preprocess the data to what the input model expects. ie convert text input to numeric representation and then send this to the model for prediction.
The prediction result is received by the lambda function which is then returned to the API gateway to be sent to the users.

Clean Up

Make sure to call end_point.delete_endpoint()to delete the model endpoint.
After go ahead and delete any files uploaded by sagemaker from your s3 bucket.

Conclusion

In this tutorial, you learned how to train and deploy deep learning models on Amazon Sagemaker.
Here is a link to the complete Notebook.

Resources

Deep Learning LSTM for Sentiment Analysis in Tensorflow with Keras API

Paul Karikari — Thu, 13 Feb 2020 14:16:38 +0000

Introduction

Sentiment analysis is the process of determining whether language reflects a positive, negative, or neutral sentiment.
Analyzing the sentiment of customers has many benefits for businesses. eg.

A company can filter customer feedback based on sentiments to identify things they have to improve about their services.
A company can manage their online reputation easily by monitoring the sentiment of comments customers write about their products

In this tutorial, we will build a Deep learning model to classify text as either negative or positive.

Requirements

Data: The data used is a collection of tweets about a major U.S airline available on Kaggle.
Tensorflow version 1.15.0 or higher with Keras API
Pandas
Numpy

Data Preparation

let’s see how the data looks like:

import pandas as pd

df= pd.read_csv('Tweets.csv', sep=',')
df.head(10)

Steps to prepare the data:

Select relevant columns: The data columns needed for this project are the airline_sentiment and text columns. we are solving a classification problem so text will be our features and airline_sentiment will be the labels.

Machine learning models work best when inputs are numerical. we will convert all the chosen columns to their needed numerical formats.

Transform airline_sentiment column to numerical category:
Transform text column to a vector of numbers. (more on this later)

    #select relavant columns
    tweet_df = df[['text','airline_sentiment']]

We need to classify tweets as either negative or positive, so we will filter out rows with neutral sentiment.

    tweet_df = tweet_df[tweet_df['airline_sentiment'] != 'neutral']

    # convert airline_seentiment to numeric
    sentiment_label = tweet_df.airline_sentiment.factorize()

Calling the factorize method returns an array of numeric categories and an index of the categories. In this case, index 0 is positive and index 1 is negative sentiment respectively.

Preparing text for NLP:

As I said earlier, Inputs to machine learning models need to be in numeric formats.
This can be achieved by the following:

Assign a number to each word in the sentences and replace each word with their respective assigned numbers.
Use word embeddings. This is capable of capturing the context of a word in a sentence or document.

    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences

    tweet = tweet_df.text.values
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(tweet)

    vocab_size = len(tokenizer.word_index) + 1

    encoded_docs = tokenizer.texts_to_sequences(tweet)

    padded_sequence = pad_sequences(encoded_docs, maxlen=200)

From the above code:

we get the actual texts from the data frame
Initialize the tokenizer with a 5000 word limit. This is the number of words we would like to encode.
we call fit_on_texts to create associations of words and numbers as shown in the image below.

    print(tokenizer.word_index)

calling text_to_sequence replaces the words in a sentence with their respective associated numbers. This transforms each sentence into sequences of numbers.

    print(tweet[0])
    print(encoded_docs[0])

From the above result, you can see the tweet is encoded as a sequence of numbers. eg. to and the are converted to 1 and 2 respectively.
Check the word index above to verify.

The sentences or tweets have different number of words, therefore, the length of the sequence of numbers will be different.
Our model requires inputs to have equal lengths, so we will have to pad the sequence to have the chosen length of inputs. This is done by calling the pad_sequence method with a length of 200.
All input sequences will have a length of 200.

    print(padded_sequence[0])

Build Model

Now that we have the inputs processed. It's time to build the model.

    # Build the model

    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import LSTM,Dense, Dropout,
    from tensorflow.keras.layers import SpatialDropout1D
    from tensorflow.keras.layers import Embedding

    embedding_vector_length = 32

    model = Sequential()

    model.add(Embedding(vocab_size, embedding_vector_length,     
                                         input_length=200) )

    model.add(SpatialDropout1D(0.25))

    model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))

    model.add(Dropout(0.2))

    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',optimizer='adam', 
                               metrics=['accuracy'])

    print(model.summary())

This is where we get to use the LSTM layer. The model consists of an embedding layer, LSTM layer and a Dense layer which is a fully connected neural network with sigmoid as the activation function.

Dropouts are added in-between layers and also on the LSTM layer to avoid overfitting.

LSTM

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Train Model

    history = model.fit(padded_sequence,sentiment_label[0],
                      validation_split=0.2, epochs=5, batch_size=32)

The model is trained for 5 epochs which attains a validation accuracy of ~92%.

Note: Your result may vary slightly due to the stochastic nature of the model, try to run it a couple of times and you will have averagely about the same validation accuracy.

Testing Model

    test_word ="This is soo sad"

    tw = tokenizer.texts_to_sequences([test_word])
    tw = pad_sequences(tw,maxlen=200)
    prediction = int(model.predict(tw).round().item())

    sentiment_label[1][prediction]

The model is tested with a sample text to see how it predicts sentiment and we can see that it predicted the right sentiment for the sentence.

You can run the entire notebook on Google Colab here or check the entire notebook on Github

Resources

In this tutorial, you learned how to use Deep learning LSTM for sentiment analysis in Tensorflow with Keras API.

What is Data: A beginner's guide to understanding what Data means.

Paul Karikari — Thu, 13 Feb 2020 09:44:43 +0000

Introduction

You’ve probably heard of the word “data” several times maybe in school, from the news, in your daily work or profession, stumbled upon a couple of times on the Internet or anywhere you might find yourself and if you are a data scientist, well your entire profession depends on it.

Data is limitless and its present anywhere in the universe, yet using the term data can sometimes be confusing because nearly everyone has an idea of what it means to them.
[My data is not your data 😃]

Definition

In computing, data may be in the form of text, documents, images, audio, and video. At its rudimentary level data is a bunch of ones and zeros.

In statistics data is defined as facts or figures from which conclusion can be drawn.

IT professionals will describe data in terms of entities and attributes.

In layman’s terms, data describes a person, place, object, event or concept in the user context or environment with its meaning dependent on its organization.
eg.

In computing different organization of 1’s and 0’s means different things,
[0001 = 1 and 0010 = 2].
In biology different sequence of the genome (A, C, G, and T) result in different genetic code which represents different individuals or species.
Listing the purchase history with the identity of a customer represents the purchasing habit of that particular individual.
Your tweets could be a random arrangement of any of the 26 characters in English and spaces. Yet you chose to arrange them in a way to convey meaning.

If data is not put into context it’s of no value to humans or computers. Context is key.

In the context of computing, 0001 is the binary representation of 1.
In the context of Italian, your tweet in English means nothing even though they might contain the same sequence of characters.

Some say that “facts” are things that can be shown to be true, to exist, or to have happened.

Ideally Data can be defined as the factual representation of the attributes of anything.

Well, I say “Ideally” because data are not always factual. Simply put, data can be wrong. Part or whole data can sometimes represent something entirely different from what you expect or intend to measure. eg. Schoolboy finds a flaw in Nasa Data and Math Error To Cost Maryland $31 Million

Data that is factual or true or serves the needs of the problem domain is sometimes referred to as good data or signal.
Data that is false, or invalid or does not serve the needs of the problem domain is sometimes called bad data or noise.

Data that describes a set (more than one) of data is called metadata and a set of data is often referred to as a dataset.

Anatomy of data

Let's consider a scenario (circumstance or a particular experiment) where you want to learn about the kinds of passengers with whom you board the same bus/train at your local bus/train station. So you gathered some information about each individual which becomes your dataset. [stalker 😏]

Datasets are typically displayed in tables, as shown below.

A dataset is a set of data identified with a particular experiment, scenario,subject or circumstance.

In the table, rows represent individuals and columns represent variables

From the above we can say that:
Data are pieces of information about individuals organized into variables

By an individual, we mean a particular person or object.
In our scenario, the passengers are the individuals.
Individuals are sometimes called observations, cases, vector or feature vector.

By a variable, we mean a particular characteristic of the individual. In our scenario, the variables are Age, Height, Seat Number, Gender, Class.
Variables are sometimes called observations, variables, or features.

Each row gives us all of the information about a particular individual (in this case a passenger), and each column gives us information about a particular characteristic of all of the passengers.

Types of data

Data can be classified in many ways and from different perspectives which deserves its own blog but in short, Data can be classified as raw or processed, structured or unstructured and can also be classified as qualitative or quantitative.

Names, Names, and More Names

If you follow carefully you would realize that there are different ways of naming the same thing which stems from the field of study, preference or mere convention. This can be overwhelming for a beginner or someone new to a particular field but don’t be discouraged. You might already know what a term means. It's all a matter of familiarity. Don’t feel bad to ask or search the Internet.

A Gentle Introduction To Activation Functions in Deep Learning.

Paul Karikari — Mon, 03 Feb 2020 20:34:18 +0000

Introduction

When you get started with deep learning you will definitely come across the term activation functions also known as Neural Transfer Functions.
In this blog, I will explain what activation functions are and why they are used in deep learning models.

NOTE: I assume you have a basic understanding of neural networks.

The goal of Machine Learning/Deep learning algorithms is to recognize patterns in data. This pattern can be considered as a function from a mathematical point of view. The ability of machine learning algorithms to approximate this underlying function in a given data is what makes such algorithms very powerful.
The recognition of this function or pattern makes it possible for the model to predict the output of new data.

The pattern/function underlying data can be simple such as a linear relation and sometimes complex such as a non-linear relation.

A Simple Artificial Neuron

Deep learning models usually consist of many neurons stacked in layers.
Let’s consider a single neuron for simplicity.

The operations performed by a neuron basically involve multiplication and summation operations which are linear and produce an output.
After this, an activation function is applied to produce the final out of the neuron.

Without applying the activation function, the above will just be like a linear function that maps inputs to outputs.

This makes the neuron only approximate linear functions. As a result, the model can’t recognize complex patterns in data.

Why are activation functions needed?

In order for neural networks to approximate non-linear or complex functions, there has to be a way to add a non-linear property to the computation of results.
Using activation functions serves the purpose of introducing non-linearity into the model. This makes it possible for the deep learning models to find complex patterns in data.

Can any non-linear function be used as an activation function?

No, before a function can be considered as a good candidate for Deep learning models it should have the following properties:

Non-linear
This is required to introduce non-linearity in the model.
Monotonic
A function that is either entirely non-increasing or non-decreasing.
Differentiable
Deep learning algorithms update their weights via an algorithm called backpropagation. This algorithm can work when the activation function used is differentiable. ie it’s derivates can be calculated.

Types of Activation Functions.

Most useful activation functions are non-linear functions. The following lists common activation functions.

1. Tahn or Hyperbolic tangent function

This function has an upper bound of 1 and a lower bound of -1, therefore it will produce outputs between the ranges of 1 to -1.

2. Sigmoid or logistic function

This function outputs values between (0,1) and it is centered at 0.

3. Relu (Rectified Linear Unit)

This function produces values between 0 to infinity. ie it only outputs positive values.

4. Leaky Relu

This is a variation of Relu. unlike Relu, Leaky Relu allows more output values.
It outputs values between 0.01 to infinity

5. Softmax

This function is mostly used for multiclass prediction problems and it outputs class probabilities for a given input.

Which Activation Function Should I use?

Activation functions have strengths and weaknesses which is based on how well they allow the model to learn features for generalization.

The choice of activation function also depends on the problem you're trying to solve.

Relu is commonly used for hidden layers and sigmoid / softmax is commonly used for the output layer.
sigmoid for binary classification problems and softmax for multiclass classification problems.
Tanh is avoided most of the time due to dead neuron problem.
Sigmoid and Tanh functions are sometimes avoided due to the vanishing gradient and dead neuron problems.
If we encounter a case of dead neurons in the networks the leaky ReLU function is the best choice.

Resources

In this article, you learned what activation functions are and why they are needed in deep learning models and you also saw commonly used activation functions.
I hope this article served the purpose of introducing you to Activation functions.