Forem: Mansi Saxena

Dive into ML with this RoadMap!

Mansi Saxena — Fri, 20 Aug 2021 19:23:11 +0000

If you've been reading about the amazing advancements in the world of Artificial Intelligence and Machine Learning but feel overwhelmed by its complexity, this post is just for you! After reading this blog, you should have a clear understanding of how to embark on this journey of learning Machine Learning the right way, so stick with me till the end.

Pre-requisites

First things first, what are the pre-requisites of learning Machine Learning? Just knowing the programming languages are not enough; you must know the mathematics behind each algorithm too. The important topics one must familiarize themselves with are:

Linear Algebra
Calculus
Statistics

If you do not have a mathematical background, Khan Academy is a good place to get started on the basics. This Coursera Specialization, Mathematics for Machine Learning is also a good resource if you can devote long hours for MOOCs. Other resources are mentioned in these links below:

For Linear Algebra
For Statistics
For Calculus

If you are unfamiliar with Calculus, I would recommend you to go through one of the books given in the link above and understand the basics of differentiation and integration as it is essential to the path to becoming a Machine Learning expert.

Coding fundamentals

Once you have gotten confident with your math, shift your focus to the coding part. Many languages are used for writing Machine Learning programs, like Python, R, Java and so on. However, python is the most recommended because of its several libraries and frameworks that have simplified the task of writing complex code. Another reason I would recommend Python is because there are far more Machine Learning tutorials written in python than there are in R. Thus, it is easier for a python programmer to get help from the data science community than an R programmer. However, R is also known for its various data visualization libraries. Thus, there is no harm in learning both languages and utilizing their best features. You can always learn one and move to the next. For a beginner, I would recommend python.

There are several resources to learn basic python but my favorite one is this Coursera specialization, Python for Everybody by Charles Russell Severance of University of Michigan. If this does not suit you, you may try other resources given in this link

And voila, you finally have all the pre-requites you need to get started on your journey into the world of Machine Learning!

Taking the first step

The first step is to complete these two renowned courses. One will teach you about the Mathematics behind each Machine Learning algorithm by none other than the great Andrew NG, and the other one will focus on the programming part. You may choose to do them simultaneously. Take your time with them as this will lay the foundations for this field.

Machine Learning by Andrew NG, Stanford University. You do not have to buy the course; you may audit it too. Just focus on watching all the videos in this course. There is also a YouTube playlist with all the videos in this course which I will link here. If you really are a ML-nerd, you'll be hooked on this course! (PS: if you do not know who Andrew NG is, google him RIGHT NOW, you won't regret it ;).
The second course is Python for Data Science and Machine Learning Bootcamp by Jose Portilla. This course can be a little heavy as it is introduces you to all the major programming concepts used in Machine Learning. So take your time with it and keep trying out the codes and functions taught in the course yourself, just listening to videos will not help much until you get your hands dirty.

Getting your hands dirty

While coding, if you get stuck with an error and you are unable to solve it, search the error on stack overflow. There would definitely be someone who has been in your shoes before and has suffered through the error that you are facing now. Read through the answers and solutions that others have suggested to solve the error. On the off chance that the error you are facing is not encountered by anyone else, post your own query. Don't be shy; you'd be surprised at how beginner-friendly and helpful the data science community is. After all - everyone was once a beginner.

Applying what you learnt - Starting with some baby projects

After completing these courses, you can safely say that you now have a good understanding of the classical algorithms! You can now start working on some baby projects. Find datasets on Kaggle that interest you and put your newly learnt python skills to use. You may also try going through the code that other developers have written. However, some of it might be too complex - so do not be too hard on yourself if you are unable to understand all of it. With each dataset that you work with, you will learn new functions and concepts of data cleaning, data augmentation, preprocessing, data encoding and so on.

The codes of some of the baby projects that I had made are on my GitHub. They should be easy to follow, not too complex.

HR Analytics Employee Retention using Logistic Regression
Breast Cancer Classification using Decision Trees
Cleaning Student Profile Data
Preprocessing and Cleaning Stroke Data
Recognizing Hand Written Digits using PCA and SVM techniques
Clustering Credit Card Data using Gaussian Mixtures and PCA
Clustering Geo-Locations using K-Means clustering
Using Numpy and Matplotlib for Image Processing
Data Visualization of Australian Wildfires
Comparing the classification algorithms for Mushroom Classification
Comparing the classification algorithms for Credit Card Frauds
Data Visualization and Comparing the classification algorithms for Household Electricity Consumption
Data Visualization and Comparing the classification algorithms for grades of Maths and Portuguese class students

I would urge you to first try them yourselves, and then check my codes for reference. Whenever you come across a new function, read the documentation and check what it does. Make sure you understand all of it.

And that's it!

With this, you should now have a concrete understanding of the Machine Learning algorithms and how to use them. You should also be fairly acquainted with some data cleaning, data preprocessing and data visualization techniques.

Deep Learning - the path after Machine Learning

If you have found the journey up till now interesting, you may dive into the Deep Learning as well. The best way to do so is by getting started with this in-depth Deep Learning specialization by Andrew NG. It will require some dedication as it consists of 5 courses, but it is very thorough and you will not need any material apart from this.

Your path from here to becoming a Data Scientist

When you start on this path of Data Science, you must be aware that in this domain, learning never stops. Once you complete the above specialization you can continue on this path by -

Participating in Kaggle competitions.
Reading the best research papers in the topics from your interest.
Doing more MOOCs from Coursera. I would recommend the courses from the DeepLearning.ai foundation.
Start working on your own projects. Try developing them into products for common users to use. You may also try publishing it in a reputed journal.
Share your knowledge with the world - help other beginners on stack overflow and write blogs.
Push your work to GitHub for others to learn from.

Do like this post if it helped you. If you have any other suggestions or recommendations, let me know in the comments below.

Happy Learning! <3

Java Question Bank with Solutions

Mansi Saxena — Wed, 11 Aug 2021 07:09:42 +0000

Want to practice those newly learned Java concepts, but do not have a question bank with solutions? Look no further! This GitHub repository has it all!

The concepts covered are -

Basic Java Questions
Array Questions
String Questions
Object Oriented Questions - Classes and Objects
Object Oriented Questions - Interfaces, Inheritance, Abstract Classes, Packages
Exception Handling Questions
Multi-Threading Questions
File Handling Questions
Collections Questions- ArrayList
JDBC Questions (with concepts)
JavaFX Questions

If these solutions help you, let me know and reach out to me for any further help/doubts. Happy Learning!

Starting a beginner-friendly Machine Learning Series!

Mansi Saxena — Thu, 29 Jul 2021 08:00:06 +0000

Planning to start a Machine Learning and Deep Learning series, 1-2 posts every week.

Some of the topics I plan on covering are -

Why learn it?
Curated list of the best resources.
Basic installations.
Numpy and pandas basics.
Matplotlib and Seaborn basics.
Data Modelling.
Linear Regression.
Classification algorithms.
Clustering algorithms.
Model evaluation.
Bias and Variance.

Let me know if you want me to cover some specific topics!

Make the first step towards that project!

Mansi Saxena — Tue, 27 Jul 2021 05:01:44 +0000

What's that project idea you've had on your mind for quite some time, but haven't quite been able to start yet?

I know I've had quite a few ideas lurking in the back of my mind every now and then.

So, here's your reminder to take your first step towards it. Jot down those ideas, set a date and time and get started! 🥂

Text preprocessing and email classification using basic Python only

Mansi Saxena — Mon, 26 Jul 2021 20:19:09 +0000

Classifying emails as spam and non spam? Isn't that the "hello world" of Natural Language Processing? Hasn't every other developer worked on it?

Well, yes. But what about writing the codes from scratch without using inbuilt libraries? This blog is for those who have used the inbuilt python libraries but aren't quite sure about what goes on behind them. Find the full code here. After reading this blog, you will gain a better understanding of the entire pipeline. So let's jump right in!

The basic steps in this problem are -

Preprocessing the emails
Finding a list of all the unique words in the emails
Extracting feature vectors for each email
Applying Naive Bayes Classifier (using inbuilt library)

For the purpose of demonstration, I have made a basic dataset. Spam emails are labelled as positive while others as negative. -

First, read the emails and store them in a list. This has been shown below using the csv reader.

emails = []
with open('emaildataset.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        emails.append((row[0].strip(), row[1].strip()))

1. Preprocessing

We can now move onto the preprocessing stage. The emails are first converted to lowercase and then split into tokens. Then, we apply 3 basic preprocessing processes on the tokens: punctuation removal, stopword removal and stemming. Let us go over these in detail.

Punctuation Removal
This process involves removing all punctuations in a string, which we do using python's string function, replace(). The function below takes a string as input, replaces them with an empty string, and returns a string without punctuations. More punctuations can be added to the list, or a regex of punctuations can be used.

def punctuation_removal(data_string):
    punctuations = [",", ".", "?", "!", "'", "+", "(", ")"]
    for punc in punctuations:
        data_string = data_string.replace(punc, "")
    return data_string

Stopword Removal
This process involves removing all the commonly used words that are used to make a sentence grammatically correct, without adding much meaning. The function given below takes a list of tokens as input, parses through it, checks if any of them are in a specified list of stopwords and returns a list of tokens without stopwords. More stopwords can be added to the list.

def stopword_removal(tokens):
  stopwords = ['of', 'on', 'i', 'am', 'this', 'is', 'a', 'was']
    filtered_tokens = []
    for token in tokens:
        if token not in stopwords:
            filtered_tokens.append(token)
    return filtered_tokens

Stemming
This process is the last step in the preprocessing pipeline. Here, we convert our tokens into their base form. Words like "eating", "ate" and "eaten" get converted to eat. For this, we use the help of python dictionaries, with the key and value pairs defined as the base form token and a list of the word in other forms. Eg, {"eat": ["ate", "eaten", "eating"]}. This helps in normalizing the words in our data/corpus.

We parse through each token and check if it is present in the list of words not in their base form. If it is, then the base form of that word is used. This is demonstrated in the function below.

def stemming(filtered_tokens):
    root_to_token = {'you have':['youve'],
                    'select':['selected', 'selection'],
                    'it is':['its'],
                    'move':['moving'],
                    'photo':['photos'],
                    'success':['successfully', 'successful']
    }

    base_form_tokens = []
    for token in filtered_tokens:
        for base_form, token_list in root_to_token.items():
            if token in token_list:
                base_form_tokens.append(base_form)
            else:
                base_form_tokens.append(token)
    return base_form_tokens

Now, using the functions defined above, we form a main preprocessing pipeline, as shown below:

tokens = []
for email in emails:
    email = email[0].lower().split()
    for word in email:
        clean_word = punctuation_removal(word)
        tokens.append(clean_word)
tokens = set(tokens)
filtered_tokens = stopword_removal(tokens)
base_form_tokens = stemming(filtered_tokens)

2. Finding unique words

After the emails are converted to a list of tokens in their base form, without punctuations and stopwords, we apply the set() function to get the unique words only.

unique_words = []
unique_words = set(base_form_tokens)

3. Extracting feature vectors

We define each feature vector to be of the same length as the list of unique words. For each unique word, if it is present in the particular email, a 1 is added to the vector, else a 0 is added. Eg, for the email "Hey, it's betty!" with the list of unique words being ["hello", "hey", "sandwich", "i", "it's", "show"], the feature vector is [0, 1, 0, 0, 1, 0]. Note that "betty" is not present in the list of unique words, thus it is ignored in the final result.

This is demonstrated in this code snippet below where the feature vector is a python dictionary with keys being the unique words and values being 0 or 1 depending on whether the word is present in the email. The label for each email is also stored.

feature_vec = {}
for word in unique_words:
    feature_vec[word] = word in base_form_tokens
pair = (feature_vec, email[1]) #email[1] is the label for each email
train_data.append(pair)

This way, we generate our training data. The complete pipeline till this stage is given the code snippet below.

train_data = []
for email in emails:
    tokens = []
    word_list = email[0].lower().split()
    for word in word_list:
        clean_word = punctuation_removal(word)
        tokens.append(clean_word)
    filtered_tokens = stopword_removal(tokens)
    base_form_tokens = stemming(filtered_tokens)
    feature_vec = {}
    for word in unique_words:
        feature_vec[word] = word in base_form_tokens
    pair = (feature_vec, email[1]) 
    train_data.append(pair)

4. Applying Naive Bayes Classifier

The Naive Bayes Classifier is imported from the nltk module. We can now find feature vectors for any email (say, "test_features") and classify if it is spam or not.

from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)
output = classifier.classify(test_features)

The complete pipeline for testing is given below -

def testing(email_str):
    tokens = []
    word_list = email_str.lower().split()
    for word in word_list:
        clean_word = punctuation_removal(word)
        tokens.append(clean_word)
    filtered_tokens = stopword_removal(tokens)
    base_form_tokens = stemming(filtered_tokens)
    test_features = {}
    for word in unique_words:
        test_features[word] = word in base_form_tokens
    output = classifier.classify(test_features)
    return output

With this, you now know the ins and outs of any basic natural language processing pipeline. Hope this helped!

Logistic Regression at a glance

Mansi Saxena — Wed, 21 Jul 2021 19:56:15 +0000

What is Logistic Regression?

In problems where a discrete value (0, 1, 2...) is to be predicted based on some input values, Logistic Regression can be very handy. Examples of such problems are - detecting if a student will be selected in a graduate program depending on his profile, or if an Instagram account has been hacked depending on its recent activity. These problems can be solved by "Supervised Classification Models", one of which is Logistic Regression.

To build such a model, we need to supply the model with some training data, ie, samples of various data values as inputs and their corresponding discrete valued outputs. The input can be defined in terms of several independent features on which the output depends. For instance, if we take the problem of predicting if an Instagram account has been hacked, we can define some independent features such as "activity time", "5 recent texts", "5 recent comments", "10 recently liked posts" and so on. Using this input training data, the model essentially "learns" what the traits of a hacked Instagram account and uses this knowledge to make predictions on other accounts to check if they are hacked.

However, you and I both know it is not that simple. So what goes on behind this black box?

Diving into the math!

First, let us set some notations.

If we have "n" features and "m" training samples, they can be arranged in an "n*m" matrix consisting of training samples as column vectors horizontally stacked together as given in the image below. Let us call this matrix X.

It has a corresponding vector which contains the discrete valued outputs for each training sample. It is a single column vector of dimension m*1. Let us call this vector Y.

With the notations set and out of the way, let's get to the heart of logistic regression!

The equations

We first calculate the probability that the output value of a particular input is 1 (given that the set of output labels = {0, 1}), which is also denoted as given below -

First, a hypothesis value Z is calculated by finding the transpose of a weight parameter W (column vector of dimensions n*1) multiplied with the matrix X (matrix of dimensions n*m), and then added to another bias parameter b (row vector of dimension 1*m). This gives us Z, a row vector of dimension 1*m.

Then, an irregularity function "sigmoid" is applied to Z to give us the predicted probability for that particular input set. It outputs a value between 0 and 1 as shown in the figure below.
The equation for the sigmoid function is -

Thus, our final equation becomes -
This gives us a row vector of dimension 1*m. It has the predicted probabilities of the m-training samples. When the probability is greater than 0.5, it is classified as output 1, and if the probability is less than 0,5, then it is classified as output 0.

Here, the parameters W and b are trained and set to optimal values that give the highest accuracy in predicting the probability that the output is 1. A loss value is calculated for each training example, and depending on the value, the parameters are adjusted to give better results and reduce this loss value. This is essentially what is referred to as "training" a model. A low loss value suggests that the model has been successfully trained (or that the model is overfitting, but that is a concept for another blog 😁). This loss value is calculated by the equation -

Thus, we see that -

Using the loss function, we calculate the cost function, which is an addition of all the loss function values over all the training examples. It is calculated using the formula below -

Now, to adjust the values of the parameters W and b, we use the famous gradient descent algorithm (which is also for another blog 😁). This formula is given below -

This formula comes from the gradient descent algorithm. Here, the parameter alpha is called learning rate. A large learning rate causes large adjustments in parameters while a small learning rate causes smaller adjustments. It can be tuned according to our requirements.

And viola! That wraps up one iteration of training our Logistic Regression Model! Connect enough of these together with slight modification, and we get a neural net!

Hope you enjoyed reading this, thank you for reading till the end!