<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mansi Saxena</title>
    <description>The latest articles on Forem by Mansi Saxena (@saxenamansi).</description>
    <link>https://forem.com/saxenamansi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F669545%2Fcf56f138-9fb0-4b5f-9345-ccc5dc825935.jpeg</url>
      <title>Forem: Mansi Saxena</title>
      <link>https://forem.com/saxenamansi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/saxenamansi"/>
    <language>en</language>
    <item>
      <title>Dive into ML with this RoadMap!</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Fri, 20 Aug 2021 19:23:11 +0000</pubDate>
      <link>https://forem.com/saxenamansi/roadmap-to-dive-into-the-world-of-machine-learning-5179</link>
      <guid>https://forem.com/saxenamansi/roadmap-to-dive-into-the-world-of-machine-learning-5179</guid>
      <description>&lt;p&gt;If you've been reading about the amazing advancements in the world of Artificial Intelligence and Machine Learning but feel overwhelmed by its complexity, this post is just for you! After reading this blog, you should have a clear understanding of how to embark on this journey of learning Machine Learning the right way, so stick with me till the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-requisites
&lt;/h2&gt;

&lt;p&gt;First things first, what are the pre-requisites of learning Machine Learning? Just knowing the programming languages are not enough; you must know the mathematics behind each algorithm too. The important topics one must familiarize themselves with are: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Linear Algebra &lt;/li&gt;
&lt;li&gt;Calculus&lt;/li&gt;
&lt;li&gt;Statistics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you do not have a mathematical background, &lt;a href="https://www.khanacademy.org/" rel="noopener noreferrer"&gt;Khan Academy&lt;/a&gt; is a good place to get started on the basics. This Coursera Specialization, &lt;a href="https://www.coursera.org/specializations/mathematics-machine-learning" rel="noopener noreferrer"&gt;Mathematics for Machine Learning&lt;/a&gt; is also a good resource if you can devote long hours for MOOCs. Other resources are mentioned in these links below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://analyticsindiamag.com/7-top-linear-algebra-resources-for-machine-learning-beginners/" rel="noopener noreferrer"&gt;For Linear Algebra&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.analyticsvidhya.com/resource-statitics/" rel="noopener noreferrer"&gt;For Statistics&lt;/a&gt;&lt;br&gt;
&lt;a href="https://machinelearningmastery.com/calculus-books-for-machine-learning/" rel="noopener noreferrer"&gt;For Calculus&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are unfamiliar with Calculus, I would recommend you to go through one of the books given in the link above and understand the basics of differentiation and integration as it is essential to the path to becoming a Machine Learning expert. &lt;/p&gt;

&lt;h2&gt;
  
  
  Coding fundamentals
&lt;/h2&gt;

&lt;p&gt;Once you have gotten confident with your math, shift your focus to the coding part. Many languages are used for writing Machine Learning programs, like Python, R, Java and so on. However, python is the most recommended because of its several libraries and frameworks that have simplified the task of writing complex code. Another reason I would recommend Python is because there are far more Machine Learning tutorials written in python than there are in R. Thus, it is easier for a python programmer to get help from the data science community than an R programmer. However, R is also known for its various data visualization libraries. Thus, there is no harm in learning both languages and utilizing their best features. You can always learn one and move to the next. For a beginner, I would recommend python. &lt;/p&gt;

&lt;p&gt;There are several resources to learn basic python but my favorite one is this Coursera specialization, &lt;a href="https://www.coursera.org/specializations/python" rel="noopener noreferrer"&gt;Python for Everybody&lt;/a&gt; by Charles Russell Severance of University of Michigan. If this does not suit you, you may try other resources given in this &lt;a href="https://mikkegoes.com/learn-python-online-best-resources/" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And voila, you finally have all the pre-requites you need to get started on your journey into the world of Machine Learning!&lt;/p&gt;

&lt;h2&gt;
  
  
  Taking the first step
&lt;/h2&gt;

&lt;p&gt;The first step is to complete these two renowned courses. One will teach you about the Mathematics behind each Machine Learning algorithm by none other than the great Andrew NG, and the other one will focus on the programming part. You may choose to do them simultaneously. Take your time with them as this will lay the foundations for this field.  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://www.coursera.org/learn/machine-learning?" rel="noopener noreferrer"&gt;Machine Learning by Andrew NG&lt;/a&gt;, Stanford University. You do not have to buy the course; you may audit it too. Just focus on watching all the videos in this course. There is also a YouTube playlist with all the videos in this course which I will link &lt;a href="https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN" rel="noopener noreferrer"&gt;here&lt;/a&gt;. If you really are a ML-nerd, you'll be hooked on this course! (PS: if you do not know who Andrew NG is, google him RIGHT NOW, you won't regret it ;). &lt;/li&gt;
&lt;li&gt;The second course is &lt;a href="https://dev.toPython%20for%20Data%20Science%20and%20Machine%20Learning%20Bootcamp"&gt;Python for Data Science and Machine Learning Bootcamp&lt;/a&gt; by Jose Portilla. This course can be a little heavy as it is introduces you to all the major programming concepts used in Machine Learning. So take your time with it and keep trying out the codes and functions taught in the course yourself, just listening to videos will not help much until you get your hands dirty. &lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting your hands dirty
&lt;/h2&gt;

&lt;p&gt;While coding, if you get stuck with an error and you are unable to solve it, search the error on &lt;a href="https://stackoverflow.com/" rel="noopener noreferrer"&gt;stack overflow&lt;/a&gt;. There would definitely be someone who has been in your shoes before and has suffered through the error that you are facing now. Read through the answers and solutions that others have suggested to solve the error. On the off chance that the error you are facing is not encountered by anyone else, post your own query. Don't be shy; you'd be surprised at how beginner-friendly and helpful the data science community is. After all - everyone was once a beginner. &lt;/p&gt;

&lt;h2&gt;
  
  
  Applying what you learnt - Starting with some baby projects
&lt;/h2&gt;

&lt;p&gt;After completing these courses, you can safely say that you now have a good understanding of the classical algorithms! You can now start working on some baby projects. Find datasets on Kaggle that interest you and put your newly learnt python skills to use. You may also try going through the code that other developers have written. However, some of it might be too complex - so do not be too hard on yourself if you are unable to understand all of it. With each dataset that you work with, you will learn new functions and concepts of data cleaning, data augmentation, preprocessing, data encoding and so on. &lt;/p&gt;

&lt;p&gt;The codes of some of the baby projects that I had made are on my GitHub. They should be easy to follow, not too complex. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/saxenamansi/HR_Analytics_Employee_Retention" rel="noopener noreferrer"&gt;HR Analytics Employee Retention using Logistic Regression&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/Breast_Cancer_DecisionTree_Classifier" rel="noopener noreferrer"&gt;Breast Cancer Classification using Decision Trees&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/Data_Cleaning_Preprocessing/blob/main/DataCleaningPreprocessing.ipynb" rel="noopener noreferrer"&gt;Cleaning Student Profile Data&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/Healthcare_dataset_pandas_preprocessing" rel="noopener noreferrer"&gt;Preprocessing and Cleaning Stroke Data&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/Recognizing_Hand_Written_Digits" rel="noopener noreferrer"&gt;Recognizing Hand Written Digits using PCA and SVM techniques&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/Credit_Card_Data_Clustering" rel="noopener noreferrer"&gt;Clustering Credit Card Data using Gaussian Mixtures and PCA&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/KMeans_Clustering_Of_GeoLocationsns" rel="noopener noreferrer"&gt;Clustering Geo-Locations using K-Means clustering&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/ImageProcessing_using_Numpy_Matplotlib" rel="noopener noreferrer"&gt;Using Numpy and Matplotlib for Image Processing&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/MSTC_DataScience_Tasks/blob/master/Projects/Australian-fires%20(Visualisation).ipynb" rel="noopener noreferrer"&gt;Data Visualization of Australian Wildfires&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/MSTC_DataScience_Tasks/blob/master/Projects/Mushroom%20classification%20-%20project.ipynb" rel="noopener noreferrer"&gt;Comparing the classification algorithms for Mushroom Classification&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/C4-projects/blob/master/CreditCard%20fraud%20-%20classification.ipynb" rel="noopener noreferrer"&gt;Comparing the classification algorithms for Credit Card Frauds&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/C4-projects/blob/master/Household%20-electricity-consumption.ipynb" rel="noopener noreferrer"&gt;Data Visualization and Comparing the classification algorithms for Household Electricity Consumption&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/saxenamansi/MSTC_DataScience_Tasks/blob/master/Projects/Math_Portugese_course.ipynb" rel="noopener noreferrer"&gt;Data Visualization and Comparing the classification algorithms for grades of Maths and Portuguese class students&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I would urge you to first try them yourselves, and then check my codes for reference. Whenever you come across a new function, read the documentation and check what it does. Make sure you understand all of it. &lt;/p&gt;

&lt;p&gt;And that's it! &lt;/p&gt;

&lt;p&gt;With this, you should now have a concrete understanding of the Machine Learning algorithms and how to use them. You should also be fairly acquainted with some data cleaning, data preprocessing and data visualization techniques. &lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Learning - the path after Machine Learning
&lt;/h2&gt;

&lt;p&gt;If you have found the journey up till now interesting, you may dive into the Deep Learning as well. The best way to do so is by getting started with this in-depth &lt;a href="https://www.coursera.org/specializations/deep-learning" rel="noopener noreferrer"&gt;Deep Learning specialization by Andrew NG&lt;/a&gt;. It will require some dedication as it consists of 5 courses, but it is very thorough and you will not need any material apart from this. &lt;/p&gt;

&lt;h2&gt;
  
  
  Your path from here to becoming a Data Scientist
&lt;/h2&gt;

&lt;p&gt;When you start on this path of Data Science, you must be aware that in this domain, learning never stops. Once you complete the above specialization you can continue on this path by - &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Participating in Kaggle competitions.&lt;/li&gt;
&lt;li&gt;Reading the best research papers in the topics from your interest. &lt;/li&gt;
&lt;li&gt;Doing more MOOCs from Coursera. I would recommend the courses from the DeepLearning.ai foundation.&lt;/li&gt;
&lt;li&gt;Start working on your own projects. Try developing them into products for common users to use. You may also try publishing it in a reputed journal. &lt;/li&gt;
&lt;li&gt;Share your knowledge with the world - help other beginners on stack overflow and write blogs. &lt;/li&gt;
&lt;li&gt;Push your work to GitHub for others to learn from. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do like this post if it helped you. If you have any other suggestions or recommendations, let me know in the comments below.&lt;/p&gt;

&lt;p&gt;Happy Learning! &amp;lt;3 &lt;/p&gt;

</description>
    </item>
    <item>
      <title>Java Question Bank with Solutions</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Wed, 11 Aug 2021 07:09:42 +0000</pubDate>
      <link>https://forem.com/saxenamansi/java-question-bank-with-solutions-30o3</link>
      <guid>https://forem.com/saxenamansi/java-question-bank-with-solutions-30o3</guid>
      <description>&lt;p&gt;Want to practice those newly learned Java concepts, but do not have a question bank with solutions? Look no further! &lt;a href="https://github.com/saxenamansi/Java-Beginner-To-Intermediate" rel="noopener noreferrer"&gt;This&lt;/a&gt;  GitHub repository has it all!&lt;/p&gt;

&lt;p&gt;The concepts covered are - &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Basic Java Questions&lt;/li&gt;
&lt;li&gt;Array Questions&lt;/li&gt;
&lt;li&gt;String Questions&lt;/li&gt;
&lt;li&gt;Object Oriented Questions - Classes and Objects&lt;/li&gt;
&lt;li&gt;Object Oriented Questions - Interfaces, Inheritance, Abstract Classes, Packages&lt;/li&gt;
&lt;li&gt;Exception Handling Questions &lt;/li&gt;
&lt;li&gt;Multi-Threading Questions&lt;/li&gt;
&lt;li&gt;File Handling Questions&lt;/li&gt;
&lt;li&gt;Collections Questions- ArrayList&lt;/li&gt;
&lt;li&gt;JDBC Questions (with concepts)&lt;/li&gt;
&lt;li&gt;JavaFX Questions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If these solutions help you, let me know and reach out to me for any further help/doubts. Happy Learning!&lt;/p&gt;

</description>
      <category>java</category>
      <category>learning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Starting a beginner-friendly Machine Learning Series!</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Thu, 29 Jul 2021 08:00:06 +0000</pubDate>
      <link>https://forem.com/saxenamansi/starting-a-machine-learning-series-4ihh</link>
      <guid>https://forem.com/saxenamansi/starting-a-machine-learning-series-4ihh</guid>
      <description>&lt;p&gt;Planning to start a Machine Learning and Deep Learning series, 1-2 posts every week.&lt;/p&gt;

&lt;p&gt;Some of the topics I plan on covering are -&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Why learn it?&lt;/li&gt;
&lt;li&gt;Curated list of the best resources.&lt;/li&gt;
&lt;li&gt;Basic installations.&lt;/li&gt;
&lt;li&gt;Numpy and pandas basics.&lt;/li&gt;
&lt;li&gt;Matplotlib and Seaborn basics.&lt;/li&gt;
&lt;li&gt;Data Modelling.&lt;/li&gt;
&lt;li&gt;Linear Regression.&lt;/li&gt;
&lt;li&gt;Classification algorithms.&lt;/li&gt;
&lt;li&gt;Clustering algorithms.&lt;/li&gt;
&lt;li&gt;Model evaluation.&lt;/li&gt;
&lt;li&gt;Bias and Variance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let me know if you want me to cover some specific topics!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Make the first step towards that project!</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Tue, 27 Jul 2021 05:01:44 +0000</pubDate>
      <link>https://forem.com/saxenamansi/what-s-a-project-idea-you-ve-had-on-your-mind-for-quite-some-time-but-haven-t-quite-been-able-to-start-yet-162e</link>
      <guid>https://forem.com/saxenamansi/what-s-a-project-idea-you-ve-had-on-your-mind-for-quite-some-time-but-haven-t-quite-been-able-to-start-yet-162e</guid>
      <description>&lt;p&gt;What's that project idea you've had on your mind for quite some time, but haven't quite been able to start yet?&lt;/p&gt;

&lt;p&gt;I know I've had quite a few ideas lurking in the back of my mind every now and then. &lt;/p&gt;

&lt;p&gt;So, here's your reminder to take your first step towards it. Jot down those ideas, set a date and time and get started! 🥂&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Text preprocessing and email classification using basic Python only</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Mon, 26 Jul 2021 20:19:09 +0000</pubDate>
      <link>https://forem.com/saxenamansi/classifying-spam-emails-using-basic-python-2m70</link>
      <guid>https://forem.com/saxenamansi/classifying-spam-emails-using-basic-python-2m70</guid>
      <description>&lt;p&gt;Classifying emails as spam and non spam? Isn't that the "hello world" of Natural Language Processing? Hasn't every other developer worked on it?&lt;/p&gt;

&lt;p&gt;Well, yes. But what about writing the codes from scratch without using inbuilt libraries? This blog is for those who have used the inbuilt python libraries but aren't quite sure about what goes on behind them. Find the full code &lt;a href="https://github.com/saxenamansi/Email_Classification/blob/main/Emailclassifier_without_nltk.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;. After reading this blog, you will gain a better understanding of the entire pipeline. So let's jump right in!&lt;/p&gt;

&lt;p&gt;The basic steps in this problem are -&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Preprocessing the emails&lt;/li&gt;
&lt;li&gt;Finding a list of all the unique words in the emails&lt;/li&gt;
&lt;li&gt;Extracting feature vectors for each email &lt;/li&gt;
&lt;li&gt;Applying Naive Bayes Classifier (using inbuilt library)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the purpose of demonstration, I have made a basic dataset. Spam emails are labelled as positive while others as negative. - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75g55boe0nuqcj2upo6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75g55boe0nuqcj2upo6z.png" alt="dataset" width="727" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, read the emails and store them in a list. This has been shown below using the csv reader.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;emaildataset.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;emails&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. Preprocessing
&lt;/h2&gt;

&lt;p&gt;We can now move onto the preprocessing stage. The emails are first converted to lowercase and then split into tokens. Then, we apply 3 basic preprocessing processes on the tokens: punctuation removal, stopword removal and stemming. Let us go over these in detail.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Punctuation Removal&lt;/em&gt;&lt;br&gt;
This process involves removing all punctuations in a string, which we do using python's string function, replace(). The function below takes a string as input, replaces them with an empty string, and returns a string without punctuations. More punctuations can be added to the list, or a regex of punctuations can be used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;punctuation_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;punctuations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;punc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;punctuations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data_string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;punc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_string&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Stopword Removal&lt;/em&gt;&lt;br&gt;
This process involves removing all the commonly used words that are used to make a sentence grammatically correct, without adding much meaning. The function given below takes a list of tokens as input, parses through it, checks if any of them are in a specified list of stopwords and returns a list of tokens without stopwords. More stopwords can be added to the list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stopword_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;stopwords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;am&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;this&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;was&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;filtered_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;filtered_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Stemming&lt;/em&gt;&lt;br&gt;
This process is the last step in the preprocessing pipeline. Here, we convert our tokens into their base form. Words like "eating", "ate" and "eaten" get converted to eat. For this, we use the help of python dictionaries, with the key and value pairs defined as the base form token and a list of the word in other forms. Eg, {"eat": ["ate", "eaten", "eating"]}. This helps in normalizing the words in our data/corpus. &lt;/p&gt;

&lt;p&gt;We parse through each token and check if it is present in the list of words not in their base form. If it is, then the base form of that word is used. This is demonstrated in the function below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stemming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;root_to_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;you have&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;youve&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;select&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;selected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;selection&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;it is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;its&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;move&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;moving&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;photo&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;photos&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;successfully&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;successful&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;base_form_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;base_form&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_list&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;root_to_token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;token_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_form&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, using the functions defined above, we form a main preprocessing pipeline, as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;emails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;clean_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;punctuation_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;filtered_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stopword_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;base_form_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stemming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Finding unique words
&lt;/h2&gt;

&lt;p&gt;After the emails are converted to a list of tokens in their base form, without punctuations and stopwords, we apply the set() function to get the unique words only.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_form_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Extracting feature vectors
&lt;/h2&gt;

&lt;p&gt;We define each feature vector to be of the same length as the list of unique words. For each unique word, if it is present in the particular email, a 1 is added to the vector, else a 0 is added. Eg, for the email "Hey, it's betty!" with the list of unique words being ["hello", "hey", "sandwich", "i", "it's", "show"], the feature vector is [0, 1, 0, 0, 1, 0]. Note that "betty" is not present in the list of unique words, thus it is ignored in the final result. &lt;/p&gt;

&lt;p&gt;This is demonstrated in this code snippet below where the feature vector is a python dictionary with keys being the unique words and values being 0 or 1 depending on whether the word is present in the email. The label for each email is also stored.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;feature_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;feature_vec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;
&lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;#email[1] is the label for each email
&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way, we generate our training data. The complete pipeline till this stage is given the code snippet below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;train_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;emails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;word_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;word_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;clean_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;punctuation_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;filtered_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stopword_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;base_form_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stemming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;feature_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;feature_vec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;
    &lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; 
    &lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Applying Naive Bayes Classifier
&lt;/h2&gt;

&lt;p&gt;The Naive Bayes Classifier is imported from the nltk module. We can now find feature vectors for any email (say, "test_features") and classify if it is spam or not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NaiveBayesClassifier&lt;/span&gt;
&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NaiveBayesClassifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The complete pipeline for testing is given below -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;word_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email_str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;word_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;clean_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;punctuation_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;filtered_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stopword_removal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;base_form_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stemming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;test_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;base_form_tokens&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this, you now know the ins and outs of any basic natural language processing pipeline. Hope this helped!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>nlp</category>
      <category>preprocessing</category>
    </item>
    <item>
      <title>Logistic Regression at a glance</title>
      <dc:creator>Mansi Saxena</dc:creator>
      <pubDate>Wed, 21 Jul 2021 19:56:15 +0000</pubDate>
      <link>https://forem.com/saxenamansi/logistic-regression-at-a-glance-5h50</link>
      <guid>https://forem.com/saxenamansi/logistic-regression-at-a-glance-5h50</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is Logistic Regression?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In problems where a discrete value (0, 1, 2...) is to be predicted based on some input values, Logistic Regression can be very handy. Examples of such problems are - detecting if a student will be selected in a graduate program depending on his profile, or if an Instagram account has been hacked depending on its recent activity. These problems can be solved by "Supervised Classification Models", one of which is Logistic Regression. &lt;/p&gt;

&lt;p&gt;To build such a model, we need to supply the model with some training data, ie, samples of various data values as inputs and their corresponding discrete valued outputs. The input can be defined in terms of several independent features on which the output depends. For instance, if we take the problem of predicting if an Instagram account has been hacked, we can define some independent features such as "activity time", "5 recent texts", "5 recent comments", "10 recently liked posts" and so on. Using this input training data, the model essentially "learns" what the traits of a hacked Instagram account and uses this knowledge to make predictions on other accounts to check if they are hacked. &lt;/p&gt;

&lt;p&gt;However, you and I both know it is not that simple. So what goes on behind this black box?  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diving into the math!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, let us set some notations. &lt;/p&gt;

&lt;p&gt;If we have "n" features and "m" training samples, they can be arranged in an "n*m" matrix consisting of training samples as column vectors horizontally stacked together as given in the image below. Let us call this matrix X. &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rs0szss5ef4e569omgj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rs0szss5ef4e569omgj.png" alt="Training Matrix" width="420" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It has a corresponding vector which contains the discrete valued outputs for each training sample. It is a single column vector of dimension m*1. Let us call this vector Y. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfo7wgklqnf5dnqd8cot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfo7wgklqnf5dnqd8cot.jpg" alt="Alt Text Output Labels" width="176" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the notations set and out of the way, let's get to the heart of logistic regression! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The equations&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;We first calculate the probability that the output value of a particular input is 1 (given that the set of output labels = {0, 1}), which is also denoted as given below - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4288biofb66tr6w6liw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4288biofb66tr6w6liw.png" alt="image" width="224" height="41"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;First, a hypothesis value Z is calculated by finding the transpose of a weight parameter W (column vector of dimensions n*1) multiplied with the matrix X (matrix of dimensions n*m), and then added to another bias parameter b (row vector of dimension 1*m). This gives us Z, a row vector of dimension 1*m. &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwau1diey64hlf7ulc4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwau1diey64hlf7ulc4q.png" alt="image" width="165" height="43"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Then, an irregularity function "sigmoid" is applied to Z to give us the predicted probability for that particular input set. It outputs a value between 0 and 1 as shown in the figure below. &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76ujn3m07be35vygbskj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76ujn3m07be35vygbskj.png" alt="Sigmoid" width="235" height="154"&gt;&lt;/a&gt;&lt;br&gt;
The equation for the sigmoid function is - &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q70xfmmkbdl0bivx4af.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q70xfmmkbdl0bivx4af.png" alt="image" width="200" height="77"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Thus, our final equation becomes - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53w95a1a5ce56in69jra.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53w95a1a5ce56in69jra.png" alt="Alt Text" width="246" height="85"&gt;&lt;/a&gt;&lt;br&gt;
This gives us a row vector of dimension 1*m. It has the predicted probabilities of the m-training samples. When the probability is greater than 0.5, it is classified as output 1, and if the probability is less than 0,5, then it is classified as output 0. &lt;/p&gt;

&lt;p&gt;Here, the parameters W and b are trained and set to optimal values that give the highest accuracy in predicting the probability that the output is 1. A loss value is calculated for each training example, and depending on the value, the parameters are adjusted to give better results and reduce this loss value. This is essentially what is referred to as "training" a model. A low loss value suggests that the model has been successfully trained (or that the model is overfitting, but that is a concept for another blog 😁). This loss value is calculated by the equation - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjclegn71q8et5h4ak3k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjclegn71q8et5h4ak3k.png" alt="image" width="542" height="48"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thus, we see that - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frx5f46or44l50visahwi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frx5f46or44l50visahwi.png" alt="image" width="478" height="128"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the loss function, we calculate the cost function, which is an addition of all the loss function values over all the training examples. It is calculated using the formula below - &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr0jctkf6x78j52lzsk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr0jctkf6x78j52lzsk6.png" alt="image" width="392" height="90"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Now, to adjust the values of the parameters W and b, we use the famous gradient descent algorithm (which is also for another blog 😁). This formula is given below - &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwmaoyre9oqlj6wa2r54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwmaoyre9oqlj6wa2r54.png" alt="image" width="270" height="146"&gt;&lt;/a&gt;&lt;br&gt;
This formula comes from the gradient descent algorithm. Here, the parameter alpha is called learning rate. A large learning rate causes large adjustments in parameters while a small learning rate causes smaller adjustments. It can be tuned according to our requirements. &lt;/p&gt;

&lt;p&gt;And viola! That wraps up one iteration of training our Logistic Regression Model! Connect enough of these together with slight modification, and we get a neural net!&lt;/p&gt;

&lt;p&gt;Hope you enjoyed reading this, thank you for reading till the end!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
