Forem: Dulaj Prabasha

The Chatbot Part 1 : Behind the curtains of NLP

Dulaj Prabasha — Sat, 30 Apr 2022 10:54:45 +0000

I started this blog series with the intention of documenting all my steps as I built a functional chatbot application. However, at the time of writing this article, I've already connected a working chatbot model served on a Flask backend to a React frontend. It ended up only taking a couple of days.

My familiarity with React and the Client-Server model allowed me to connect the chatbot model and frontend together rather easily. However, the logic of building the chatbot I never bothered to comprehend. This is the biggest pitfall of following tutorials - you tend to get caught up in seeing a finished product as soon as possible. Now, having GitHub Copilot autocomplete most of the code did not help matters.

In the days since, I've had time to reflect on the whole process. Although initially this article was meant to cover all the steps involved in building the chatbot model, doing so would have defeated the purpose of even going through all this. It is very important to separate the theory from the code from the implementation. So we're starting off with the theory. I promise you it'd be anything but boring!

The basics of NLP

Natural Language Processing - NLP in short - is at the core of an AI-powered chatbot. A functioning chatbot application relies on the model's ability to comprehend the human language and its intricacies.

Now, computers think and talk in numbers you see. So how do you translate between human and machine language? This is where we get into some of the core concepts behind NLP (basically all the new stuff I've come to discover while trying to break down how the chatbot functions)

Stemming vs Lemmatization

Words are complicated things. Their meaning can change based on context. Taking a word away from a sentence and nothing could change. Or everything could. As humans, we've become attuned to naturally distinguish between all the possible variations without a sliver of thought.

But how can you teach a machine all these things? The short answer is - you don't. Instead, you teach it to act based on patterns. You help it understand how to talk to you. To respond to you based on the nature of your input. After all, that is the singular purpose upon which a chatbot is built is it not?

Now if that is the case, we're free to abstract away all those fine details that make language as intricate as it is. Cannot we reduce generous and generosity to say "gener" if all we're deriving from the word is the context it adds to a sentence? Provided we also deal with the possible confusion with words like general, we indeed could. This is at the heart of "Stemming" - you reduce a word to its root - also called its "lemma".

Lemmatization does basically the same thing. However, there is a key difference. Rather than explain it myself, I'd like to point you to this answer on stackoverflow :

That last point is very important. Based on your use case, you're free to go with one or the other. I'll go over implementation when we get there.

Bag of Words

So say we've done our lemmatization (or stemming), using our sample data. Our model would now have an idea of what to look for. A vocabulary. Do note that all it knows is all you've trained it on!

But a chatbot will need to understand complete sentences. These sentences will likely contain multiple words that are not part of its current vocabulary.

How would we deal with this? Again, remember that the goal is to get the model to now assign context to sentences in some way or the other - nothing more and nothing less. We would need to find a way to do this based on the limited vocabulary it has. This is where the concept of "Bag of Words" (BoW) kicks in.

As much as I'd prefer to leave the implementation details to a future post, it'd be rather hard to dive into the concept of BoW without doing so.

A chatbot application - and essentially the majority of NLP algorithms - perform what is known as classification. And that means exactly what it sounds like. You group an input into one of several possible classes (in our case "contexts") - classes that are prepared in advance. For example, we can build a chatbot that understands solely the contexts "greeting" and "goodbye". Note that these contexts are purely arbitrary. Knowing this detail, we can move onto how a Bag of Words becomes intuitive in the scheme of things.

So you have two things - your vocabulary and your classes i.e contexts. As we're dealing with supervised learning in the case of NLP (we train the bot explicitly how to understand its world), we also have the data we train it with (a set of sentences each labelled with their context). As an example, say our vocabulary contains the lemma "hi", "bye","fly" and "bear". Say our classes are "greeting" and "farewell".

Given we have labelled data, we take each sentence we have, count the occurrences of the words that we know that appear in this sentence i.e we weight the parameters (our words), and assign that weight the label as defined in the training data. Each sentence essentially boils down to a "bag of words".

For example, in our case, say we have the sentence "Hi! How are you this morning?" in our training data, with a label of "greeting". We notice that it contains a word from our vocabulary - "hi". So the sentence with the parameter "hi" having a frequency of 1 must have the "greeting" class set to 1 i.e "true" for "greeting" and "false"/0 for the class "farewell". Essentially, we're reducing sentences to numbers while trying to retain a sense of context.

Confused? You might probably want to read all that again. Or take a look at this super helpful video:

Corpus

Admittedly, knowing what corpus means isn't critical to understanding how NLP or a chatbot works. However, it's one of those terms that tends to come up a lot in a similar project (and one of those fancy words that you could casually throw around to hint to someone that you REALLY know your stuff - regardless of whether you actually do or not!) .

Essentially what a corpus is is your labelled text (or audio) data - the data you ultimately use to train your model. However, it can also extend to the bodies of text used to lemmatize your words and sentences as well. Basically anything that is a labelled body of text (or audio).

Conclusion

So I've dissected all the theory that would be essential to understanding how an NLP-based chatbot works under the hood as best I understand it. In the following post, we can (finally) dive into the implementation and coding specifics. As always, feel free to let me know your thoughts on the article. And seeing as a majority of this was written based on my own understanding of matters, please do correct me if I'm off anywhere!

The Chatbot : Another Foray into the Vast World of Machine Learning

Dulaj Prabasha — Wed, 20 Apr 2022 15:29:48 +0000

Confession : This is me trying to dig myself out of tutorial hell. I've lost count of the number of times I've wanted to try something - then start watching a tutorial - lose the plot halfway there - only to blindly follow along and type out commands wanting to look at a final product and feel like I've accomplished something. And that is rarely ever the case. Sound familiar? It's a pretty common problem I hear.

So what am I doing here? Well right now I find myself wanting to be able to apply TensorFlow in my future projects. In order to achieve this hefty goal, I will be building a fully end-to-end Chatbot application that responds to user input. You can check out the progress I make here:

JDPrabasha / chatbot

A chat application that allows you to chat with an AI-powered chatbot

Chatbot Application

This repository contains code for a chatbot application that responds to user input, and is based on the YouTube tutorial by NeuralNine, which you can find here : https://www.youtube.com/watch?v=1lwddP0KUEg

Prerequisites

nltk
TensorFlow
Numpy
Flask

How to Run

Run chatbot.py to start the flask server

Then cd into the Client folder and run

`npm install`

to install the required dependancies.

Then run

`npm start`

to access the application on your local browser

View on GitHub

This is not the first time I will be using TensorFlow. Nor is it the first time I will be building a Web Application. Will I be following tutorials? Absolutely.

The catch - this time I will be documenting all of my thought processes, problems faced, lessons learned and all those other things in-between. Essentially, if all goes to plan, this series of articles should allow anyone to speedrun their way through to achieving a similar goal. Why do it this way you ask? Well, if I can explain that entire process, that should mean all those things must have stuck with me yeah? Fingers crossed 🤞

Here's the current plan :

Follow along with this tutorial : and build the model for the application.
Connect the model to a Flask backend
Connect the Flask application to a React frontend
Add styling

And well that's all I can think of so far I'm afraid. I'll be wrapping up things here for now. Wish me luck!

RCB GM Simulator : Running an EDA on the IPL Dataset

Dulaj Prabasha — Tue, 13 Apr 2021 15:43:14 +0000

Remember how in the movie Moneyball (2011), Okland A's General Manager Billy Beane hires Economist Peter Brand to use data to change how the game's played? (Seriously though, if you haven't actually watched it, I strongly suggest you do so and come back!) Well, we're about to do the same thing - although operating in the realm of cricket this time.

What we're about to do is called an Exploratory Data Analysis - EDA for short - a process of analysing datasets to summarise their main characteristics. To do this, we will be using this IPL Dataset. First off, let's see what we have shall we?

import pandas as pd 
import numpy as np
df=pd.read_csv("matches.csv")
df.head()

Okay, that's a lot of information! Let's break it down using the Pandas Dataframe info() method :

df.info()

As General Manager of the RCB, your job is to extract as much information you can from the data available. And that involves asking the right questions. Looking closely at our dataset, how much of the data is actually useful? Certainly not the umpires. Let's drop 'em out shall we?

df.drop(["umpire1", "umpire2", "umpire3"], inplace=True, axis=1)
df.head()

Now that is something we can work with! The thing about being the GM (and Data Science in general) however, is that no-one tells you what questions you need to ask. And asking the right questions can often to be tougher than answering them. Here's a list of five questions that I think this dataset may be able to answer :

Is it more advantageous to bat or field first in a given venue?
Which venue is each team most strongest at?
Does the chasing team really have the edge in a match affected by rain?
How crucial is winning the toss at a given venue?
How is our track record versus different teams?

Now then, let's explore our data on a question-by-question basis! (although not necessarily in that exact order 👻)

Q1) How crucial is winning the toss at a given venue?

The cool thing about Pandas is how you can chain together logic to get your desired output in a less verbose way than say SQL. Our dataset gives us the winner of each match. We also know who won the toss at every match. Putting 2 and 2 together, these two things must be equal if the team that won the toss won the match as well. Finally, we need this data on a by-venue basis. Another point is that a percentage would be more useful to us than a raw number.

df[df["toss_winner"]==df["winner"]].value_counts("venue")/df.value_counts("venue")*100

See how easy that was? Note that we used the value_counts() method to group our data after filtration, and then divided it by the total matches played in that venue. This is another intuitive feature of Panads - in that it can detect similar data objects using indices.

We see that certain venues give a 100 percent dependancy on the toss. However, if we look at how many matches were played in these venues, we have a much clearer picture. The more matches played, the closer that number aproaches 50.

Q2) Is it more advantageous to bat or field first in a given venue?

Now, the process of answering this question follows the same intuition as the first. Although this time, we have to perform a few tweaks to obtain the data in the format we want.

fielding_wins=df[((df["toss_winner"]==df["winner"]) & (df["toss_decision"]=="field")) | ((df["toss_winner"]!=df["winner"]) & (df["toss_decision"]=="bat"))].value_counts("venue")
fielding_wins.sort_index(inplace=True)

batting_wins=df[((df["toss_winner"]==df["winner"]) & (df["toss_decision"]=="bat")) | ((df["toss_winner"]!=df["winner"]) & (df["toss_decision"]=="field"))].value_counts("venue")
missing_rows=pd.Series([0,0],index=["Green Park", "Holkar Cricket Stadium"])
batting_wins=batting_wins.append(missing_rows)
batting_wins.sort_index(inplace=True)

choices=fielding_wins>batting_wins

encoder={True:"Field", False:"Bat"}
choices.map(encoder)

And there's our results! Although that seemed like a lot of code, the logic isn't that hard to follow. Following a top-down approach, we had to

Compare the fielding team wins to the batting team wins
Ensure that the indices of both these sets lined up and had the same number of elements
Save both subsets in different variables to perform operations on them

And finally, we mapped the truth values to "Bat" and "Field" to make our lives easier.
( Confession: I did not do any of this in one go. Everything you see here packaged into one neat block of code was preceded by a lot of warnings, a lot of head-scratching and plenty of trial and error)

Q3) Does the chasing team really have the edge in a match affected by rain?

Let's use the power of Pandas to answer this question in one line of code! If you've been following along with the code so far, this one should be a cakewalk!

df[(((df["toss_winner"]==df["winner"]) & (df["toss_decision"]=="field")) | ((df["toss_winner"]!=df["winner"]) & (df["toss_decision"]=="bat")))& (df["dl_applied"]==1)].size/df[df["dl_applied"]==1].size*100

Well there we go - the chasing team wins a whooping 81 percent of rain-affected matches!

Q4) Which venue is each team most strongest at?

Now this one takes a bit of thinking, as well as some domain knowledge. First off, what we require is a count of winner-venue pairs. Naturally, we know that each team will perform best in one given venue - most commonly their Home Stadium. Counts of these pairs should be expected to appear at the top of a sorted list (barring anomalies such as teams that have rebranded multiple times). Using this intuition, we may answer our question with the following code snippet :

team_venue_counts=df.groupby(["winner", "venue"]).agg("count")["id"]
team_venue_counts=team_venue_counts.sort_values(ascending=False)
team_venue_counts.head(9)

We see that teams do indeed have the largest number of wins at their Home Stadium. However, a percentage would be a much better indicator of how strong teams perform at home. This would require some splicing, appending and rearranging - both with the winner-venue pairs and the venue counts. Unfortunately, the complexity of that process lies outside the purpose of this post. Moving on!

Q5) How is our track record versus different teams?

This one includes a few convoluted steps - primarily because our team appears both in the "team1" and "team2" columns. But all in all, the code is again pretty straightforward :

as_team1_wins=df[(df["team1"]=="Royal Challengers Bangalore") & (df["winner"]=="Royal Challengers Bangalore")].value_counts("team2")
as_team1_games=df[(df["team1"]=="Royal Challengers Bangalore")].value_counts("team2")

as_team2_wins=df[(df["team2"]=="Royal Challengers Bangalore") & (df["winner"]=="Royal Challengers Bangalore")].value_counts("team1")
as_team2_games=df[(df["team2"]=="Royal Challengers Bangalore")].value_counts("team1")

(as_team1_wins+as_team2_wins)/(as_team1_games+as_team2_games)*100

Now, I have to admit that that was a bit of a hack. The different variables used to store data weren't exactly the same size - leading to those NaNs in-between. However, it's still impressive that Pandas knew what to do with the rest of it!

Secondly, we see that there's two separate entries for "Rising Pune Supergiant" and "Rising Pune Supergiants" respectively. This certainly deserves some cleaning up!

Now it's your turn! Think you can fix those errors? As GM, what other insights might you obtain with this dataset? What other data may you need to augment it with in order to derive better insights? And last but not least, what may be the best methods to visualise our findings?

Any questions? Anything you think I might have done differently? Do feel free to let me know and I'd be more than happy to respond!

Machine Learning Diaries : The Andrew Ng Machine Learning Course In Review

Dulaj Prabasha — Sun, 17 Jan 2021 11:48:55 +0000

A foreword from me from the future : Are you looking to get started in Machine Learning? Wondering whether the Stanford Machine Learning Course on Coursera by Andrew Ng is the best place to start? Look no further! In this blog, I will outline my journey through the course along with my thought process and guide to additional resources. Finally, tips on where you might go upon completion. Without further ado, lets rewind back time shall we?

Week 0

The date is the 21^st of November, 2020. I look around the internet for free resources to learn machine learning. Intense research on Google recommends me Stanford’s Machine Learning Course on Coursera as its top pick. Although the course is completely free, I apply for financial aid just because. I make plans to dedicate an hour each day to working on the course material while in the midst of a busy Freshman Year CS schedule. My background going into the course is as follows :

Linear Algebra : High-School
Statistics : High-School
Programming : C, beginner-level Python
Machine Learning : NULL

The date is the 23^rd of November. I start the course.

Week 1

Okay this first week seems pretty simple. I get myself familiar with the different types of machine learning and am introduced to the concepts of cost function and gradient descent. High school calculus makes this week feel like a breeze. Partial differentiation - which although new to me - ends up being something I am able to grasp easily. The linear algebra content seems simple enough to skimp on.

A note from future me : Dear me, I am proud of you for completing the first week of the course despite how outdated the slides and content may seem in terms of visual quality. You finally have a proper idea of the different types of machine learning and will later discover a third type called “Reinforcement Learning”, but alas that is for the future. For now you are good 😌 . Also, you will be surprised at how crucial a role the concepts of cost and gradient descent will play in the future weeks and in your overall understanding of Machine Learning.

Week 2

Okay, so this seems largely similar to the content of Week 1, but with broader scope. I like how it all naturally flows from Week 1 to Week 2. I see that we have a programming assignment this week. Seems intimidating but turns out to be pretty simple. The Octave installation (Octave being the language used to do the course assignments, with an alternative being Matlab) turns out to be pretty straightforward. I complete the programming exercises via the text editor. Oh, the submission facility seems very intuitive! Color me impressed!

A note from future me : Dear me, again proud of you for maintaining consistency. Although, I wish you had not been so intimidated by the first assignment. It is VERY simple. VSCode has this octave extension that simplifies the entire process it turns out - something you will discover in the following days.

Week 3

Oh, we’re finally doing something different! Classification! Surprisingly, a lot of the same concepts from the previous weeks manage to carry over to this week as well. What’s this though - a new word? Sigmoid … such a peculiar name. The strangeness of the term makes the concept stick. Although I have trouble understanding classification in its entirety , I manage to complete the week.

A note from future me : Oh past me, I wish I could tell you about two things that might have really helped you connect everything together. First, well this picture really :

… which is pretty self - explanatory. The second being telling you to imagine a circular decision boundary where the circle expands based on the size of the feature x. This would have helped you get to Week 4 faster without all the meandering, wondering whether you were really ready for Week 4 - until doing so nonetheless.

Week 4

Neural Networks! Finally get to know what those are! I understand it as a process of multiple logistic regression with a network that selects its own features. But I do not understand the hype behind their ubiquity in modern ML applications. For the first time, the programming assignment seems a lot harder, but looking through the different tutorials and discussions related to the course, I manage to complete it.

A note from future me : No dear me, a neural network is not multi-step logistic regression. The sigmoid, as you will discover very later, is only one of the many available activation functions. Alas, I am afraid you might complete the course with a very shallow understanding of neural networks. This video

might have helped immensely, had I been able to show it to you.

Week 5

Okay, so looks like we’re going into the deep end of how neural networks work! Backpropagation - a neural network calculating its errors and going in reverse looks like? So I’m guessing that’s the difference between neural networks and regular regression - they use backpropagation instead of gradient descent. Getting to the assignments this week - Oh my are they a mess! How do they deviate so much from how I’d understood it? I barely pass the assignment - that too thanks to a very detailed guide in the resource section - although I have to admit my understanding is clearly lacking.

A note from future me : Yes past me - that week was indeed a mess no matter how you look at it! Unfortunately, that is one of the pitfalls of a course that is no longer actively maintained. And no past me - it’s not that neural networks do not use gradient descent - it's more the case of them using backpropagation in combination with gradient descent. I really wish I could have directed you to this video

which I’ve found to explain backpropagation in the clearest way possible. It would have saved you all those hours of confusion. However, I am glad you stuck through with the course as that is only one of the only two weeks that I’d describe as categorically bad.

Week 6

I am very hesitant to start this new week after all the shock from the previous one. But what’s this though - looks like we’re done with neural networks! That’s a relief! This week proves to be one of the most interesting, informative and intuitive weeks in the entire course, dealing with the concept of fine-tuning your models. I feel relieved to have stuck through the hell that was Week 5, and go on to easily complete the week's assignments, with some help from the resources section of course.

Week 7

We move to an entirely different algorithm this week - Support Vector Machines. I have a hard time grasping the concept of kernels. Those ugly feelings from Week 5 are starting to pop up again - Oh no! I completely fail the week’s quiz. Fortunately however, I discover this video series

which does a really good job explaining how kernel functions work. Although I cannot grasp all the math behind it, the intuition proves adequate to get through the quiz. Unfortunately, the programming assignment does not get any easier. A single error on one line causes me to waste an entire day on debugging as this week’s assignment takes a lot longer than usual to run on the terminal.

A note from future me : You can pat yourself on the back - you have survived the last hell week of the course. It’s smooth sailing from here on out!

Week 8

This week I learn about my first unsupervised learning algorithm, K-Means, and thankfully it proves to be pretty straightforward. I also learn about dimensionality reduction, another simple concept at face-value but its importance showing to be unmistakably obvious - although I do not understand all of the math behind it completely. I begin to understand the importance of having a better grasp of linear algebra for machine learning.

A note from future me : You are absolutely correct past me - A solid foundation of linear algebra is crucial to machine learning, and as you will discover in the future - so is statistics - especially when you eventually get to reinforcement learning. Not to worry though, as your high school linear algebra will prove to be adequate to finish off the course.

Week 9

I am nearing the end of the course. As such, this week looks to present more general applications of machine learning in different ways - namely anomaly detection and recommender systems. Although the concept seems simple, I have a tough time wrapping my head around recommender systems. I end up failing the quiz. After failing it a few more times, I discover this helpful video

with which I finally manage to pass. The programming assignment for the week ends up being one of the shortest - I complete it in an hour, again with help from the resources section of course.

A note from future me : Look at that - that was your final programming assignment! Now then, let’s finish off the course!

Weeks 10 and 11

These prove to be the shortest - each taking about an hour to complete. I finish off the final two weeks in the span of two days. And would you look at that - there’s my Certificate! I’m ecstatic that I have something to show for on LinkedIn for all that effort! The date is the 2^nd of January. I seem to have finished well ahead of time!

A note from future me: Congrats me! Your future self is proud of you for having stuck through it and maintaining consistency despite a busy schedule. (Although have to say, the LinkedIn thing sounds kind of pathetic doesn’t it? Ahh well, as long as it gets you going)

Conclusion

In the end, was it worth it? A course not maintained after 2011 with little visual appeal. Assignments in a language barely used in modern machine learning applications. Deep learning and neural networks, which are at the forefront of Machine Learning today, being an aspect that is merely glanced upon, and that too with little clarity. Looking back at it now, how many stars would I rate it out of five if I had to rate it as a starting point for individuals looking to enter the field of machine learning? A sparkling five stars of course! Barring the relatively few inconveniences, I believe the course was some of the best use of my time, as now I can confidently explore different avenues of machine learning with the solid foundation laid by the course. Few courses offer the level of theoretical understanding in the manner this one does across all aspects in the field. Believe the hype around this course - it’s real!

So where to from here? It's been about two weeks since I've finished the course. From exploring the options out there ,what I’ve found is that Kaggle’s plethora of mini-courses are the best way to find out for yourself the answer to that very question!

Their Intro to Machine Learning, Intro to Deep Learning and Pandas courses are what I’ve found to be the best starting points. You will surely be amazed at how fast you will be able to grasp how to work with Python’s different machine learning libraries Scikit-learn and Tensorflow + Keras with your newfound foundational knowledge of machine learning!

At the time of writing, I am trying out Scikit-learn and Pandas using different datasets across the internet, the link to which you can find here :

JDPrabasha / scikit

Testing scikit-learn

Scikit-learn Projects

Welcome!

This is where I store all of the datasets and notebooks I use to practice with scikit-learn, a free software machine learning library for Python. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy (to quote Wikipedia)

All datasets are ones I've found after looking through the internet for specific datasets that would help me practice with a specific algorithm/set of algorithms, as well as gain experience with Pandas, a data manipulation and analysis library for Python

You are completely free to use the datasets as you please, as well as go through my notebooks to see how I've implemented varoius algorithms using particular datasets. Click through the links below to go through the associated notebooks.

Salary Predictor
Heart Disease Predictor
Housing…

View on GitHub

I am also in the process of polishing my linear algebra knowledge with this course on YouTube :

My next goal is to start fastai’s Practical Deep Learning for Coders at the end of February. You can expect a next peek at my diary in a few months !

Any questions? Fire away below!