Forem: Apoorva Dave

Scrape Instagram using python

Apoorva Dave — Sun, 13 Jun 2021 10:21:47 +0000

This post talks about how we can connect to Instagram using python and extract list of followers, people whom you follow and the list of people you should unfollow (people whom you follow but they don't 😛)

We can easily do the above using a built in python package called instaloader. In case you directly want to jump to code and see it in action here it is - https://github.com/apoorva-dave/instagram-scraper

Dependencies

Python3
Instaloader
Numpy

Code

Create a file insta_scraper.py which will handle the below 4 steps

Create a session

We get an instance of Instaloader in below codeblock and login using the username and password provided by the user. Once that is done, profile instance is created in order to fetch the profile's metadata.

 def create_session(self):

        L = instaloader.Instaloader()
        L.login(self.username, self.password) # Login or load session
        self.profile = instaloader.Profile.from_username(L.context, self.username) # Obtain profile metadata

Get list of followers

  def scrape_followers(self):

        for follower in self.profile.get_followers():
            self.followers_list.append(follower.username)

Get list of following

def scrape_following(self):

        for followee in self.profile.get_followees():
            self.following_list.append(followee.username)

Get unfollow list

This would generate a unfollowers_<USERNAME>.txt file in your present directory containing the list of people whom you follow but they don't.

 def generate_unfollowers_list(self):

        unfollow_list = np.setdiff1d(self.following_list, self.followers_list) # unfollow people who are only in following list and not in followers list
        print("People to unfollow: ", unfollow_list)
        filename = "unfollowers_" + self.username + ".txt"
        file = open(filename, "w")
        for person in unfollow_list:
            file.write(person + "\n")
        file.close()

The code can then be executed from a runner script main.py which would invoke create_session() using the username and password of the user. The design has been kept in such a way so as to make sure user's username and password is only needed while creating session post which we can directly invoke APIs scrape_followers() etc as per the requirement.

Instaloader is a very efficient package. We can do much more using it. Please check the documentation here for more details - https://instaloader.github.io/as-module.html

You can find the entire running code here with a README to provide steps to run.

This is it for this article. Happy learning!! <3

Guide to Problem Solving

Apoorva Dave — Tue, 10 Mar 2020 06:03:21 +0000

Hello coders! This small post is for people who are looking for a job change and are stuck in Problem Solving rounds of interviews. I was also studying and found there are not many resources to help us out. So I decided to make youtube videos while practicing. This would help me build confidence, can be used by other folks and by me as well for future reference. Below is the link to my video

Please like and subscribe if you find the video helpful 😊

Let me know in the comments in case you want me to make videos on some particular topic. 😊

Nevertheless, Apoorva continued to code

Apoorva Dave — Sat, 07 Mar 2020 12:01:44 +0000

It is at this time of the year again when people come forward and share their stories. Each woman has a story, and so do I. It doesn't matter if it's a story of success or failure. For people who don't know me, I am Apoorva, a Software Engineer working in the domain of Machine Learning. 😎

It always seems impossible until it’s done.

This quote suits me really well and I guess it can be related by most of the folks. The problem is the START. How what and when? It is always difficult to start something new or something in which we are not comfortable.

When I was in college, I used to dream of working in the best tech firms and to be so good that people would want to hire me, to be a part of their company. But as I grew up, I saw people who were way ahead of me, after which I started losing my confidence. This fear, in fact, stopped me to such an extent that the thought of coding used to scare me because I was very sure that I will fail badly.

I was inspired by Katie Bouman, the woman behind the first black hole image. 😲

I started appearing for interviews to get a job. In a few of them, I used to feel interviewers tend to think less of you because they see a girl sitting in front of them. And a girl who codes didn't sound familiar to them. This motivated me to actually start practicing and somehow prove them wrong. It was then I realized, it is not that complicated after all. I started solving easier questions (data structures and algorithms mostly) and then moved on to medium ones. I tracked my progress in a GitHub repository and made sure that I at least checked in 1 commit every day.

I improved a lot but still, there is a long way to go. In this process, I started feeling more equipped. If you work hard it is possible to do what you think is impossible. 😊

It does not matter how slowly you go as long as you do not stop

This is what motivates me to keep trying no matter how much time it takes :)

Environmental Sound Classification

Apoorva Dave — Sun, 15 Sep 2019 13:50:17 +0000

We have seen basics of Machine Learning, Classification and Regression. In this article, we will dive a little deeper and work on how we can do audio classification. We will train Convolution Neural Network, Multi Layer Perceptron and SVM for this task. The same code can be easily extended to train other classification models as well. I strongly recommend you to go through previous articles on basics of Classification if you haven't already done.

The main question here is to how we can handle audio files and convert it into a form which we can feed into our neural networks.

It will take less than an hour to setup and get your first working audio classifier! So let's get started! 😉

Dependencies

We will be using python. Before we can begin coding, we need to have below modules. This can be easily downloaded using pip.

keras
librosa
sounddevice
SoundFile
scikit-learn
matplotlib

Dataset

We are going to use ESC-10 dataset for sound classification. It is a labeled set of 400 environmental recordings (10 classes, 40 clips per class, 5 seconds per clip). It is a subset of the larger ESC-50 dataset

Each class contains 40 .ogg files. The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold.

Visualize Dataset

Before we can extract features and train our model, we need to visualize waveform for the different classes present in our dataset.

import matplotlib.pyplot as plt
import numpy as np
import wave
import soundfile as sf

The below function visualize_wav() takes an ogg file, reads it using soundfile module and returns the data and sample rate. We can use sf.wav() function to write wav file for the corresponding ogg file. Using matplotlib, we are plotting signal wave across time and generating the plot.

def visualize_wav(oggfile):

    data, samplerate = sf.read(oggfile)

    if not os.path.exists('sample_wav'):
        os.mkdir('sample_wav')

    sf.write('sample_wav/new_file.wav', data, samplerate)
    spf = wave.open('sample_wav/new_file_Fire.wav')
    signal = spf.readframes(-1)
    signal = np.fromstring(signal,'Int16')

    if spf.getnchannels() == 2:
        print('just mono files. not stereo')
        sys.exit(0)

    # plotting x axis in seconds. create time vector spaced linearly with size of audio file. divide size of signal by frame rate to get stop limit
    Time = np.linspace(0,len(signal)/samplerate, num = len(signal))
    plt.figure(1)
    plt.title('Signal Wave Vs Time(in sec)')
    plt.plot(Time, signal)
    plt.savefig('sample_wav/sample_waveplot_Fire.png', bbox_inches='tight')
    plt.show()

Waveplot for a dog's sound

You can run the same code to generate wave plot for different classes and visualize the difference.

Feature Extraction

For each audio file in the dataset, we will extract MFCC (mel-frequency cepstrum - we will have an image representation for each audio sample) along with it’s classification label. For this we will use Librosa’s mfcc() function which generates an MFCC from time series audio data.

get_features() takes an .ogg file and extracts mfcc using Librosa library.

def get_features(file_name):

    if file_name: 
        X, sample_rate = sf.read(file_name, dtype='float32')

    # mfcc (mel-frequency cepstrum)
    mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40)
    mfccs_scaled = np.mean(mfccs.T,axis=0)
    return mfccs_scaled

The dataset is downloaded inside a folder "dataset". We will iterate through the subdirectories (each class) and extract features from their ogg files. Finally we will create a dataframe with mfcc feature and corresponding class label.

def extract_features():

    # path to dataset containing 10 subdirectories of .ogg files
    sub_dirs = os.listdir('dataset')
    sub_dirs.sort()
    features_list = []
    for label, sub_dir in enumerate(sub_dirs):  
        for file_name in glob.glob(os.path.join('dataset',sub_dir,"*.ogg")):
            print("Extracting file ", file_name)
            try:
                mfccs = get_features(file_name)
            except Exception as e:
                print("Extraction error")
                continue
            features_list.append([mfccs,label])

    features_df = pd.DataFrame(features_list,columns = ['feature','class_label'])
    print(features_df.head())    
    return features_df

Train model

Once we have extracted features, we need to convert them into numpy array so that they can be feeded into neural network.

def get_numpy_array(features_df):

    X = np.array(features_df.feature.tolist())
    y = np.array(features_df.class_label.tolist())
    # encode classification labels
    le = LabelEncoder()
    # one hot encoded labels
    yy = to_categorical(le.fit_transform(y))
    return X,yy,le

X and yy are splitted into train and test data in ratio 80-20.

def get_train_test(X,y):

    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)
    return  X_train, X_test, y_train, y_test

Now we will define our model architecture. We will use Keras for creating our Multi Layer Perceptron network.

def create_mlp(num_labels):

    model = Sequential()
    model.add(Dense(256,input_shape = (40,)))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(256,input_shape = (40,)))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(num_labels))
    model.add(Activation('softmax'))
    return model

Once the model is defined, we need to compile it by defining loss, metrics and optimizer. The model is then fitted to training data X_train and y_train. Our model is trained for 100 epochs with a batch size of 32. The trained model is finally saved as .hd5 file in the disk. This model can be loaded later for prediction.


def train(model,X_train, X_test, y_train, y_test,model_file):    

    # compile the model 
    model.compile(loss = 'categorical_crossentropy',metrics=['accuracy'],optimizer='adam')

    print(model.summary())

    print("training for 100 epochs with batch size 32")

    model.fit(X_train,y_train,batch_size= 32, epochs = 100, validation_data=(X_test,y_test))

    # save model to disk
    print("Saving model to disk")
    model.save(model_file)

This is it! We have trained our Environmental Sound Classifier!! 😃

Compute Accuracy

Now obviously we want to check how well our model is performing 😛

def compute(X_test,y_test,model_file):

    # load model from disk
    loaded_model = load_model(model_file)
    score = loaded_model.evaluate(X_test,y_test)
    return score[0],score[1]*100

Test loss 1.5628961682319642
Test accuracy 78.7

Make Predictions

We can also predict the class label for any input file we provide using below code -

def predict(filename,le,model_file):

    model = load_model(model_file)
    prediction_feature = extract_features.get_features(filename)
    if model_file == "trained_mlp.h5":
        prediction_feature = np.array([prediction_feature])
    elif model_file == "trained_cnn.h5":    
        prediction_feature = np.expand_dims(np.array([prediction_feature]),axis=2)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector)
    print("Predicted class",predicted_class[0])
    predicted_proba_vector = model.predict_proba([prediction_feature])

    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

This function will load our pre-trained model, extract mfcc from the input ogg file you have provided and output range of probabilities for each class. The one with maximum probability is our desired class! 😃

For a sample ogg file of dog class, following were the probability predictions -

Predicted class 0
0                :  0.96639919281005859375000000000000
1                :  0.00000196780410988139919936656952
2                :  0.00000063572736053174594417214394
3                :  0.00000597824555370607413351535797
4                :  0.02464177832007408142089843750000
5                :  0.00003698830187204293906688690186
6                :  0.00031352625228464603424072265625
7                :  0.00013375715934671461582183837891
8                :  0.00846461206674575805664062500000
9                :  0.00000165236258453660411760210991

The class predicted is 0 which was the class label for Dog.

Conclusion

Working on audio files is not that tough as it sounded in the first place. Audio files can easily be represented in form of time series data. We have predefined libraries in python which makes our task more simpler.

You can also check the entire code for this in my Github repo. Here I have trained SVM, MLP and CNN for the same dataset and code is arranged in proper files which makes it easy to understand.

https://github.com/apoorva-dave/Environmental-Sound-Classification

Though I have trained 3 different models for this, there was very little variance in accuracy among them. Do leave comments if you find any way to improve this score.

If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸

Getting started with Tensorflow 2.0

Apoorva Dave — Sat, 18 May 2019 16:08:57 +0000

Tensorflow is an open source platform for machine learning. Using tensorflow, we can easily code, build and deploy our machine learning models.

Tensorflow 2.0 focuses on simplicity and ease of use, featuring updates like:

Easy model building with Keras.
Robust model deployment in production on any platform.
Powerful experimentation for research.
Simplifying the API by cleaning up deprecated APIs and reducing duplication.

This article is for those who want to know how they can start with Tensorflow 2.0. It will help you to create your own image classification model in less than an hour! So let’s get started 😃

Setting up Tensorflow 2.0

Install Tensorflow 2.0 package using pip -

pip install tensorflow==2.0.0-alpha0

To verify if it is installed correctly, try importing tensorflow and checking its version. (It should point to 2.0.0-alpha0)

Keras Overview

Before we can start with tensorflow, we should have a brief overview of what is Keras. Keras is a high-level neural networks API, written in Python and is capable of running on top of TensorFlow, CNTK, or Theano. Using Keras is extremely user-friendly and it helps you build a model in no time.

The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the Sequential model, a linear stack of layers.

Sequential model is defined as:

from keras.models import Sequential
model = Sequential()

Stacking layers using .add():

from keras.layers import Dense
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))

Once the model building is complete, its learning process can be configured with .compile():

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

To know more on Keras follow the link

Classification on Fashion MNIST dataset

We are ready to build our very own classification model! This is like “Hello World” in tensorflow 😄

We have taken Fashion MNIST dataset that has 70,000 grayscale images in 10 categories. Each image in the dataset is a type of clothing garment in a resolution of 28 by 28 pixels.

Do the necessary imports

import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

The dataset can be directly loaded from keras.datasets

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

The training set train_images has 60k images while test_images has 10k images. Each image is a 28x28 array, with pixel values ranging from 0 to 255. The labels are an array of integers, ranging from 0 to 9 which correspond to the class of clothing. 0 corresponds to T-shirt/top, 1 to trouser and so on. These 10 classes of clothing type are mapped to class_names

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Scaling the train and test images so that pixel values for both train and test images are between 0 and 1.

train_images = train_images / 255.0
test_images = test_images / 255.0

You can plot the first 10 images to check the data.

plt.figure(figsize=(10,10))
for i in range(10):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.show()

Fashion MNIST (first 10 images of train set)

Building a Sequential model using keras

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

While dealing with images in neural networks we need to flatten the 2D array of 28 by 28 to a 1D array (of 28 * 28 = 784 pixels). tf.keras.layers.Flatten layer performs this task. After this layer, there are two dense layers (fully connected) with 128 and 10 neurons respectively. Softmax activation function in the last layer returns an array of 10 values which correspond to the probability scores that sum to 1. The class for which we get the highest probability is assigned to the input image.

Compiling model

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Loss function — This measures how accurate the model is during training. We want to minimize this function.
Optimizer — This is how the model is updated based on the data it sees and its loss function.
Metrics — Used to monitor the training and testing steps. accuracy measures the fraction of the images that are correctly classified.

Training the model

model.fit(train_images, train_labels, epochs=5)

To start training, we call the model.fit method to fit the model to training data.

Evaluating accuracy on test_images

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('\nTest accuracy:', test_acc)

With this simple model, we achieve an accuracy of ~0.86

Making predictions on test images

predictions = model.predict(test_images)
predictions[0]

predictions[0] gives the probability scores for the 1st test image. We can check the maximum probability or the class which our model has predicted.

np.argmax(predictions[0])

It gives a value 9 which shows that the model has identified this image to be an ankle boot, or class_names[9]. We can verify this result by checking test_labels[0] which also has a value 9. This shows that our model was able to predict the correct value for the first test image :)

This is it! 😛 You have your classification model build with Tensorflow 2.0 and Keras up and running in no time. You can play with the code to check the impacts on changing loss function, the number of epochs, optimizer. In case it is needed, you can have a look for the code for this in my Github repo

Google Colab

Colaboratory allows us to use and share Jupyter notebooks with others without having to download, install, or run anything on your own computer other than a browser. All our notebooks are saved on Google Drive. In Colabs, the code is executed in a virtual machine dedicated to our account. Virtual machines are recycled when idle for a while and have a maximum lifetime enforced by the system. We many times face issue in training our models. It can easily be done using Colab as it provides free GPU and TPU. To setup Colab you can follow the link 😺

This is a very brief overview of getting started with Tensorflow 2.0. I am myself in the process of learning, as I learn I will be writing on how can we do regression, text-classification, saving models, transfer learning, tensors and operations 😸 Do show some ❤ if you found the article helpful. Stay tuned for more tensorflow! 😄

Classification from scratch — Mammographic Mass Classification

Apoorva Dave — Sun, 17 Mar 2019 06:32:15 +0000

In our previous article, we discussed the classification technique in theory. It’s time to play with the code 😉 Before we can start coding, the following libraries need to be installed in our system:

Pandas: pip install pandas
Numpy: pip install numpy
scikit-learn: pip install scikit-learn

The task here is to classify Mammographic Masses as benign or malignant using different Classification algorithms including SVM, Logistic Regression and Decision Trees. Benign is when the tumor doesn’t invade other tissues whereas malignant does spread. Mammography is the most effective method for breast cancer screening available today.

Dataset

The dataset used in this project is “Mammographic masses” which is a public dataset from UCI repository (https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

It can be used to predict the severity (benign or malignant) of a mammographic mass from BI-RADS attributes and the patient’s age. Number of Attributes: 6 (1 goal field: severity, 1 non-predictive: BI-RADS, 4 predictive attributes)

Attribute Information:

BI-RADS assessment: 1 to 5 (ordinal)
Age: patient’s age in years (integer)
Shape (mass shape): round=1, oval=2, lobular=3, irregular=4 (nominal)
Margin (mass margin): circumscribed=1, microlobulated=2, obscured=3, ill- defined=4, spiculated=5 (nominal)
Density (mass density): high=1, iso=2, low=3, fat-containing=4 (ordinal)
Severity: benign=0 or malignant=1 (binomial)

Screenshot of top 10 rows of the dataset

So we talked a lot about the theory behind it. It’s fairly simple to build a classification model. Follow the below steps and get your own model in an hour 😃 So let’s get started!

Approach

Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import svm
from sklearn import linear_model

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head() . You can specify the number of rows as an argument to this function in case you want to check different number of rows. BI-RADS attribute has been given as non-predictive in the dataset and so it won’t be taken into consideration.

input_file = 'mammographic_masses.data.txt'
masses_data = pd.read_csv(input_file,names =['BI-RADS','Age','Shape','Margin','Density','Severity'],usecols = ['Age','Shape','Margin','Density','Severity'],na_values='?')
masses_data.head(10)

You can get a description of the data like values of count, mean, standard deviation etc as masses_data.describe()

As you might have observed, there are missing values in the dataset. Handling missing data is something very important in data preprocessing. We fill out the empty values using the mean or mode of the column depending on the data analysis. For simplicity, as of now, you can drop the null values from the data.

masses_data = masses_data.dropna()
features = list(masses_data.columns[:4])
X = masses_data[features].values
print(X)
labels = list(masses_data.columns[4:])
y = masses_data[labels].values
y = y.ravel()
print(y)

The vector X contains the input features from column 1 to 4 except the target variable. Their values will be used for training. The target variable i.e Severity is stored in the vector y.

Scale the input features to normalize the data within a particular range. Here we are using StandardScaler() which transforms the data to have a mean value 0 and standard deviation of 1.

scaler  = StandardScaler()
X = scaler.fit_transform(X)
print(X)

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0)

To build a Decision Tree Classifier from the training set, we just need to use the function DecisionTreeClassifier() It has a certain number of parameters about which you can find on the scikit-learn documentation. For now, we would just use the default values of each parameter. Use predict() on the test input features X_test to get the predicted values y_pred. The function score() can be used directly to compute the accuracy of prediction on test samples.

clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_pred)
clf.score(X_test, y_test)

The DecisionTreeClassifier() without any tuning gives a result around 77% which we can say is not the worst.

To build an SVM classifier, the classes provided by scikit-learn include SVC, NuSVC, and LinearSVC. We will build a classifier using SVC class and linear kernel. (To know the difference between SVC with linear kernel and LinearSVC you can go to the link — https://stackoverflow.com/questions/45384185/what-is-the-difference-between-linearsvc-and-svckernel-linear/45390526)

svc = svm.SVC(kernel='linear', C=1)
scores = model_selection.cross_val_score(svc,X,y,cv=10)
print(scores)
print(scores.mean())

In this section, I am trying to show you a different approach for creating a classifier. The svc classifier object is created using the SVC class on the training set. cross_val_score() function evaluates score using cross-validation method. Cross-validation is used to avoid any kind of overfitting. k-Fold cross-validation implies k-1 folds of data is used for training and 1 fold for testing. The score obtained using this is around 79.5%

Cross-Validation (source:https://scikit-learn.org/stable/modules/cross_validation.html)

Similar to the Decision Tree Classifier, we can also create Logistic Regression classifier. The function LogisticRegression() is used. The classifier is fitted on the training set and similarly used to predict target values for the test set. It gives a mean score of 80.5%

clf = linear_model.LogisticRegression(C=1e5)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
scores = model_selection.cross_val_score(clf,X,y,cv=10)
print(scores)
print(scores.mean())

Thus, if we want to build a single classifier we can do it in just 10 lines of code😄. And in no effort, we achieved an accuracy of 80%. You can create your own classification models (there are plenty of options) or fine-tune any of these. Also if you are interested you can give a shot to Artificial Neural Networks as well 😍. For me, I got the best accuracy of 84% with ANNs. To get the entire code, please use this link

If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸

Nevertheless, Apoorva Coded

Apoorva Dave — Sat, 09 Mar 2019 06:52:47 +0000

The title itself gives me goosebumps! 😄 I never thought I would be writing a post as to why I am in this field. But when I went through the inspiring stories of the other great techie women, I understood it surely does have an impact. You read some, get motivated, you relate to some and you feel yes I am not the only one who faced it. And the feeling is just amazing, at the same time satisfactory.

I am the Best!

Well for me it all began after I did an internship in the 3rd year of my college. During the first 3 years, I was (most of the time) on top of the world thinking I worked hard to get admission in such a big college and now my life is set. Nothing can stop me from becoming the next Bill Gates. Yes, I was that stupid!

Am I?

During my internship, I met such amazing people who just blew my mind. Some were good in web development, a few in Java, ML and what not! And I sat there thinking what do I have? Did I stand out from other people? The answer was a big NO!

I can never be

And then began my journey of coding. It was very difficult initially. The only thoughts which I had was

Is it too late to start?
Have I missed the only opportunity I had?
People are already at least 3 years ahead of me!

THE advice

All these made things worse, I could hardly focus. And then one day, I got life-changing advice from one of my friends,

Apoorva, I try to do one good thing every day. Read, code or try a new hobby! The thing which makes me happy and satisfied. Making me believe that I haven't wasted the entire day.

And this is what clicked me! Start with small and it will automatically become larger.

Read, Learn and Code

I started exploring different fields and finally landed on to Machine Learning. I began to read and code and in no time I started liking it. I used to feel satisfied before going to bed as there were no thoughts haunting me 😛 It brings discipline in life. And when the time comes to enjoy, you party like an animal! Believe me! 😃

Every time I learn something new, it makes me feel there is still more to learn and I feel motivated then ever. And if I could do it, I know anyone can do it. There is never a perfect time to start something new. We can always have a reason to not do it. But once we start, we will never stop 👼

Classification in Machine Learning - Part 4

Apoorva Dave — Sat, 02 Mar 2019 06:15:59 +0000

In the last few posts, we covered the basics of machine learning and regression technique. Here we will discuss about classification which is also a supervised problem similar to regression. But instead of predicting continuous values, we predict discrete values. It is used to categorize the different objects. The output variable here is a "category" such as red or blue, yes or no. Take an example where we need to classify emails as spam or not spam. There will be a set of features which will be used to classify emails. Each email is categorized into one of the categories depending upon the values of these features. Once the model is trained on seen data/labeled data, the model predicts the labels for the unknown data.

Let's get started with the types of classification algorithm 😃

SVM

In SVM or Support Vector Machines, we differentiate between the categories by separating the classes with an optimal hyperplane. Optimal hyperplane is the plane which will have the maximum margin.

In the figure, there could have been multiple hyperplanes separating the classes but the optimal plane is the one with maximum margin as shown above. The points that are closest to the hyperplane which define the margin of hyperplane are called Support vectors. So even if all the other points in the dataset are removed (except the support vectors), the position of hyperplane doesn’t change. This is because the distance of the plane from support vectors remains the same. To summarize, in SVM we want an optimal hyperplane with the maximum margin that separates the classes.

Many times the data is not linearly separable and thus defining a linear hyperplane is not possible. In this figure, the data points belong to two classes inner ring and outer ring. As we can see, these classes are not linearly separable i.e. we cannot define a straight line to separate them. However elliptical or circular “hyperplane” can easily separate the two classes. The features x and y can be used to create a new feature z defined as z = sqrt(x²+y²).

The kernel function is applied to each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable. These functions can be of different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.

Decision Trees

Decision trees have a flowchart-like structure. These are those classifiers which work by identifying a splitting node at each step.

The split is decided by Information Gain. The attribute with maximum information gain is identified as the splitting node. More is the information gain, less is the entropy. Entropy represents homogeneity in data. A set’s entropy is zero when it contains instances of only one class.

The steps to build a decision tree classifier are briefly described below:

Calculate the entropy of the target variable.
The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.
The attribute with maximum information gain is selected as splitting attribute and the process is repeated on every branch.
A branch with entropy 0 represents the leaf node. Branch with entropy more than 0 requires further splitting.

Decision trees are very much prone to overfitting. To fit training data perfectly, splitting is sometimes done to a huge extent. This causes the classifier to lose its generalization capability. And the model performs poorly on test dataset (unseen data).

To handle this overfitting, one of the approaches used is Pruning. It can start either at roots or leaves and involves cutting off some branches of decision trees, thus reducing complexity.

Reduced error pruning: Starting at the leaves, each node is replaced with its most popular class. This change is kept if it doesn’t deteriorate accuracy.

Cost complexity pruning: In this, the overall goal is to minimize the cost-complexity function. A sequence of subtrees is created where Tn is the tree consisting only of the root node and T0 the whole tree. At step i, the tree is created by removing a subtree from tree i-1 and replacing it with a leaf node. In each step, that subtree is selected which minimizes the decrease in the cost-complexity function and hence is the weakest link of the tree.

Handling of missing labels is also incorporated in the decision tree algorithm itself. While finding the candidate for the split, missing labels will not produce any gain of information. So it can be safely ignored. Now consider an example, where gender is the splitting node with 2 possible values here of male and female. Now if in case an instance is having a missing value on the gender column how should we decide to which branch the instance belongs (interesting right? :p). This situation is handled by sending the instance with missing value to all the child nodes but with a diminished weight. If there are 10 instances having ‘male’, 30 instances having ‘female’ and 5 instances having a missing value, then all 5 instances with missing will be sent to both male and female child nodes, having weights multiplied by 10/40 for ‘male’ and 30/40 for ‘female’ nodes.

When at prediction time we encounter a node in the decision tree which tests a variable A, and for that variable, we have a missing value than all the possibilities are explored.

Logistic Regression

Logistic Regression is one of the most used classification algorithms. Don’t get confused by the term regression. It is a classification method where the probability of default class is predicted. Considering an example where we need to predict an email as spam or not spam (default class being spam). It is a binary classification problem. When we apply logistic regression, probability, whether the email is spam, is predicted (ranging between 0 to 1). If the value is close to 0, it implies the probability for the email to be spam is close to 0 and hence the class is then given as not-spam. A threshold value is taken for determining the class. If probability value is less than the threshold, email is identified as not-spam else spam (as the primary class is spam).

The value predicted by logistic regression can be anywhere between -infinity to +infinity and so a sigmoid/logistic function is applied to the predicted value to squash it between [0,1].

Logistic Function

The above equation is a representation of logistic regression where the RHS is a linear equation (similar to linear regression) with b0 and b1 as the coefficients and X as the input feature. It is these coefficients which are learned during the training process. The output variable (y) on the left is odds of the default class. Odds are calculated as the ratio of the probability of event by the probability of not the event. p(X) is the probability of default class.

The coefficients are calculated using the cost function Maximum Likelihood estimation. This minimization algorithm is used to optimize the best values for the coefficients for our training data. The best coefficients would result in a model that would predict a value very close to 1 (e.g. spam) for the default class and value very close to 0 (e.g. not-spam) for the other class.

Points to remember while applying logistic regression:

Outliers and misclassified data should be removed from training set.
Remove features that are highly correlated as it would result in overfitting of model.

Naive Bayes

Naive Bayes is a classification algorithm based on Bayes’ Theorem. It makes 2 main assumptions i) each feature is independent of the other feature ii) each feature is given same weight/importance.

Bayes Theorem

By P(A|B) we mean, the probability of event A given the event B is true. Some popular naive Bayes’ classification algorithms are:

Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes

Random Forests

In this classification algorithm, it builds an ensemble of decision trees which helps in giving a more accurate prediction. Each decision tree has a vote for classifying the input variable and the majority class is assigned to the input. As multiple decision trees with different sets of features are involved, the overfitting problem can be avoided.

The steps for building random forest classifier are:

Select K random features where k<m (m total number of features)
Identify n where n is the number of decision tree classifiers to be created by finding the best split node. Repeat step 1 and 2 to create several classification trees.
To predict an input variable, take votes from each decision tree and assign the class with maximum votes.
Random forests usually perform well in all kinds of classification problems. It is able to handle missing features and categorical features.

KNN

KNN stands for K nearest neighbor. Here ‘K’ is the number of nearest neighbors which are to be considered while classifying input variable. The k nearest neighbors are identified and the majority class is assigned to the input variable.

KNN algorithm

While trying to identify class of small circle and when K=3 is taken, 3 nearest variables are taken into consideration. As red cross marks are in majority the class of small circle is set as the red cross.

Phew! If you managed to read the entire article and come till this point, you are doing fantastic 😄. This post was a brief explanation of classification algorithms. You will have to for sure dig into the math behind these algorithms to get a better understanding. In case you have any feedback and points to add, feel free to comment. The next post would be a classification project similar to what we did for regression. Stay tuned for more!😸

Regression from scratch - Wine quality prediction

Apoorva Dave — Sat, 23 Feb 2019 06:16:30 +0000

In our previous posts, we covered the basics of machine learning and types of regression. In this article, we will do our first Machine Learning project. This would give an idea of how we can implement regression on different datasets. It will take just an hour to set up, understand and code. So let’s get started! 😃

The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.

The dataset used is Wine Quality Data set from UCI Machine Learning Repository. You can check the dataset here

Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. And the output variable (based on sensory data) is quality (score between 0 and 10). Below is a screenshot of the top 5 rows of the dataset.

Top 5 rows of Wine Quality dataset

Dependencies

The code is in python. Other than this, please install the following libraries using pip.

Pandas: pip install pandas
matplotlib: pip install matplotlib
numpy: pip install numpy
scikit-learn: pip install scikit-learn

And that’s it! You are halfway through 😄. Next, follow the below steps in order to build a linear regression model in no time!

Approach

Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn import metrics 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head()

df = pd.read_csv('winequality-red.csv')
df.head()

Finding correlations between each attribute of dataset using corr()

# there are no categorical variables. each feature is a number. Regression problem. 
# Given the set of values for features, we have to predict the quality of wine. 
# finding correlation of each feature with our target variable - quality
correlations = df.corr()['quality'].drop('quality')
print(correlations)

Correlations between each attribute and target variable — quality

To draw a heatmap and get a detailed diagram of correlation, insert the below code.

sns.heatmap(df.corr())
plt.show()

Heatmap

Define a function get_features() which outputs only those features whose correlation is above a threshold value (passed as an input parameter to function).

def get_features(correlation_threshold):
    abs_corrs = correlations.abs()
    high_correlations = abs_corrs
    [abs_corrs > correlation_threshold].index.values.tolist()
    return high_correlations

Create two vectors, x containing input features and y containing the quality variable. In x, we get all the features except residual sugar. The threshold value can be increased if you want.

# taking features with correlation more than 0.05 as input x and quality as target variable y 
features = get_features(0.05) 
print(features) 
x = df[features] 
y = df['quality']

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training. You can check the size of the dataset using x_train.shape

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)

Once the training and testing sets are created, it is time to build your Linear Regression model. You can simply use the built-in function to create a model and then fit to training data. Once trained, coef_ gives the values of the coefficients for each feature.

# fitting linear regression to training data
regressor = LinearRegression()
regressor.fit(x_train,y_train)
# this gives the coefficients of the 10 features selected above. 

print(regressor.coef_)

To predict the quality of wine with this model, use predict().

train_pred = regressor.predict(x_train)
print(train_pred)
test_pred = regressor.predict(x_test) 
print(test_pred)

Calculating Root mean squared error for training as well as testing set. The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed. The RMSE for training and test sets should be very similar if we have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that we’ve badly overfit the data.

# calculating rmse
train_rmse = mean_squared_error(train_pred, y_train) ** 0.5
print(train_rmse)
test_rmse = mean_squared_error(test_pred, y_test) ** 0.5
print(test_rmse)
# rounding off the predicted values for test set
predicted_data = np.round_(test_pred)
print(predicted_data)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))
# displaying coefficients of each feature
coeffecients = pd.DataFrame(regressor.coef_,features) coeffecients.columns = ['Coeffecient'] 
print(coeffecients)

Coefficients of each feature

These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.
Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.

Thus, with few lines of code, we were able to build a Linear regression model to predict the quality of wine with RMSE scores of 0.65 and 0.63 for training and testing set respectively. This is just an idea to help you start with regression. You can play with the threshold value, other regression models and try feature engineering as well 😍.

To get the entire code, please use this link to my repository. The dataset is also uploaded :) Clone the repository and run the notebook to see the results.

The next articles would be on Classification and a similar small project on it. Stay tuned for more! Till then happy learning 😸

Regression in Machine Learning - Part 2

Apoorva Dave — Sun, 17 Feb 2019 08:55:27 +0000

In our earlier post, we discussed about Machine learning, its types and a few important terminologies. Here we are going to talk about Regression. Regression models are used to predict a continuous value. Predicting prices of a house given the features of house like size, price etc is one of the common examples of Regression. It is a supervised technique (where we have labelled training data).

Types of Regression

Simple Linear Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression

Simple Linear Regression

This is one of the most common and interesting types of Regression technique. Here we predict a target variable Y based on the input variable X. A linear relationship exists between the target variable and predictor and so comes the name Linear Regression.

Consider predicting the salary of an employee based on his/her age. We can easily identify, there seems to be a correlation between employee’s age and salary (more the age, more is the salary). The hypothesis of linear regression is

Y represents salary, X is employee’s age and a and b are the coefficients of the equation. So in order to predict Y (salary) given X (age), we need to know the values of a and b (the model’s coefficients).

Linear Regression

While training and building a regression model, it is these coefficients which are learned and fitted to training data. The aim of training is to find the best fit line such that cost function is minimized. The cost function helps in measuring the error. During the training process, we try to minimize the error between actual and predicted values and thus minimizing the cost function.

In the figure, the red points are the actual data points and the blue line is the predicted line for this training data. To get the predicted value, these data points are projected on to the line.

To summarize, our aim is to find such values of coefficients which will minimize the cost function. The most common cost function is Mean Squared Error (MSE) which is equal to the average squared difference between an observation’s actual and predicted values. The coefficient values can be calculated using a Gradient Descent approach which will be discussed in detail in later articles. To give a brief understanding, in Gradient descent we start with some random values of coefficients, compute the gradient of cost function on these values, update the coefficients and calculate the cost function again. This process is repeated until we find a minimum value of cost function.

Polynomial Regression

In polynomial regression, we transform the original features into polynomial features of a given degree and then apply Linear Regression on it. The above linear model Y = a+bX is transformed into something like

It is still a linear model but the curve is now quadratic rather than a line. Scikit-Learn provides PolynomialFeatures class to transform the features.

Polynomial Regression

If we increase the degree to a very high value, the curve becomes overfitted as it learns the noise in the data as well.

Support Vector Regression

In SVR, we identify a hyperplane with maximum margin such that maximum number of data points are within that margin. SVRs are almost similar to the SVM classification algorithm. We will discuss the SVM algorithm in detail in my next article.

Instead of minimizing the error rate as in simple linear regression, we try to fit the error within a certain threshold. Our objective in SVR is to basically consider the points that are within the margin. Our best fit line is the hyperplane that has the maximum number of points.

Data points within the boundary line

Decision Tree Regression

Decision trees can be used for classification as well as regression. In decision trees, at each level we need to identify the splitting attribute. In case of regression, the ID3 algorithm can be used to identify the splitting node by reducing standard deviation (in classification information gain is used).

A decision tree is built by partitioning the data into subsets containing instances with similar values (homogenous). Standard deviation is used to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous, its standard deviation is zero.

The steps for finding splitting node is briefly described as below:

Calculate standard deviation of target variable using below formula.

Standard Deviation

Split the dataset on different attributes and calculate standard deviation for each branch (standard deviation for target and predictor). This value is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

The attribute with the largest standard deviation reduction is chosen as the splitting node.

The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until all data is processed.

To avoid overfitting, Coefficient of Deviation (CV) is used which decides when to stop branching. Finally the average of each branch is assigned to the related leaf node (in regression mean is taken where as in classification mode of leaf nodes is taken).

Random Forest Regression

Random forest is an ensemble approach where we take into account the predictions of several decision regression trees.

Select K random points
Identify n where n is the number of decision tree regressors to be created.
Repeat step 1 and 2 to create several regression trees.
The average of each branch is assigned to leaf node in each decision tree.
To predict output for a variable, the average of all the predictions of all decision trees are taken into consideration.

Random Forest prevents overfitting (which is common in decision trees) by creating random subsets of the features and building smaller trees using these subsets.

The above explanation is a brief overview of each regression type. You might have to dig into it to get a clear understanding :) Do feel free to give inputs in the comments. This will help me to learn as well 😃. Thank you for reading my post and if you like it stay tuned for more. Happy Learning 😃

Beginning with Machine Learning - Part 1

Apoorva Dave — Wed, 13 Feb 2019 04:18:13 +0000

This question pops into almost everyone’s head who so ever wants to play with this new technology. I myself wondered as to from where should I begin, what should I cover and how can I learn quickly!

I am not here to give you a list of articles from where you can read or explore. But I will help you through it. To have a basic understanding of almost every important concept so that you can dig into that as well. Let’s get started!

What is Machine Learning?
Important types of Machine Learning
Classification Algorithms
Regression Algorithms
Clustering
Cost Functions
Collinearity
PCA
Gradient Descent
Some projects on ML to help you get started

Image Credits : http://qingkaikong.blogspot.com/2017/04/machine-learning-10-funny-pictures.html

The above list of topics will be covered in almost 5 articles to help you start with ML.

What is Machine Learning?

In ML you learn from data, as simple as that. We don’t have to write any custom code specific to the problem. Instead, we feed data to algorithms and it builds its own logic based on the data.

Consider you want to identify which fruit is an apple and which is not. You cannot go on writing specific dimensions, color or size of your apple. As each apple might look similar but they don’t have exact same dimensions. This is one of the most basic use cases of ML. Here we will provide the algorithm with all types of apples that is a set of features of different types of apples. Our algorithm learns these features and classifies a fruit as an apple or not an apple!

Types of Machine Learning

Supervised: In this approach, we have a labeled dataset. Our model can learn from this labeled data and help in classification, prediction etc. In our above example of apples, when we provide our model with a set of features, each row of the dataset is labeled as to whether those features constitute an apple or not. Classification and Regression problems are supervised.
Unsupervised: Here we have an unlabeled dataset. That is we do not know what all features will constitute an apple. An example is clustering where we cluster or create groups of similar types of objects.
Reinforcement Learning: In this, the agent learns from the environment by interacting with it and receiving rewards for performing actions. It tries to move to a state by performing an action. He learns by receiving a reward for this positive or negative for each action.

Before jumping on to classification and regression algorithms, I will list out a set of terms which will help us have a better understanding.

Model: People often gets confused by the term model. It is simply an artifact that is created by the training process. You provide training data to machine learning algorithms, the data is learned and we get a trained model.
Training and Testing Data: The data provided to the algorithm for learning is called the training set. The predictions are made on a separate dataset called testing data. It is on this data that we check the accuracy of our trained model.
Overfitting and Underfitting: A model is said to be overfitted, if it learns the training data very well, but is unable to generalize. That is even though it gives good results on the trained data it does not provide good predictions on the test data. A model is said to be underfitted if it is unable to learn the training data itself 😄. The underfit model won’t perform well on the seen data forget about the unseen one 😝.
Bias and Variance: Many people (including me) wondered what these errors actually mean. So in simple terms, Bias is the error which occurs because of making wrong assumptions. It results in an underfit model. We might make an assumption that data is linear but in fact, it is quadratic. This causes underfitting. On the other hand, Variance causes overfitting. It is due to the model’s excessive sensitivity to small variations in training data. There is always a trade-off between Bias and Variance. As reducing one error increases the other and vice versa.

There are numerous articles on ML which are better. But here it is an effort to consolidate all the important stuff as I also learn and develop my ML skills. Personally, I prefer using Python and Scikit-Learn. There are other languages and libraries like R, Keras, Tensorflow which we might explore as we go further.

Stay tuned for the next article in the series where we will learn about Regression Algorithms. Till then happy learning! 😃

Forem: Apoorva Dave

Scrape Instagram using python

Dependencies

Code

Create a session

Get list of followers

Get list of following

Get unfollow list

Top Interview Questions on Arrays - Part 1

Guide to Problem Solving

Nevertheless, Apoorva continued to code

It always seems impossible until it’s done.

It does not matter how slowly you go as long as you do not stop

Environmental Sound Classification

Dependencies

Dataset

Visualize Dataset

Feature Extraction

Train model

Compute Accuracy

Make Predictions

Conclusion

Getting started with Tensorflow 2.0

Setting up Tensorflow 2.0

Keras Overview

Classification on Fashion MNIST dataset

Google Colab

Classification from scratch — Mammographic Mass Classification

Dataset

Approach

Nevertheless, Apoorva Coded

I am the Best!

Am I?

I can never be

THE advice

Read, Learn and Code

Classification in Machine Learning - Part 4

SVM

Decision Trees

Logistic Regression

Naive Bayes

Random Forests

KNN

Regression from scratch - Wine quality prediction

Dependencies

Approach

Regression in Machine Learning - Part 2

Types of Regression

Simple Linear Regression

Polynomial Regression

Support Vector Regression

Decision Tree Regression

Random Forest Regression

Beginning with Machine Learning - Part 1

What is Machine Learning?

Types of Machine Learning