Forem: Siddhant Dubey

I got GPT-3 To Make an Anime That Doesn't Exist!

Siddhant Dubey — Mon, 05 Jul 2021 15:55:44 +0000

An Introduction to Time Complexity and Big O Notation

Siddhant Dubey — Fri, 19 Mar 2021 15:13:35 +0000

If you're interested in learning the basics of Time Complexity and Big-O Notation which are the fundamental aspects of algorithmic analysis, something that is really important for theoretical computer science check out this YouTube video I made. Additionally, knowing how to analyze the running time of algorithms helps you write efficient code.

This video contains animated high-quality explanations of Time Complexity concepts along with multiple examples.

An Introduction to Cybersecurity, Capture the Flag Contests, and Basic Security Concepts

Siddhant Dubey — Thu, 24 Dec 2020 20:32:12 +0000

Cybersecurity is important, there’s no dodging that fact. It is also nothing like the hacking that is shown in most popular media.

However, that does not mean it isn’t interesting, it is undoubtedly so. Due to this intrigue, lots of people want to dip their feet into cybersecurity, myself included, and I have found capture the flag events (CTFs) to be a wonderful way to get a taste of the field.

Now, by no means are CTFs completely accurate in the day-to-day work of a cybersecurity professional but they are very educational and they do help people develop their cybersecurity skillsets, as well as just being fun to participate in.

In addition, if you are a programmer, these will give you an insight into the way you should design your programs so that they are not vulnerable to malevolent users. You don’t want to be the person that stored all their passwords in plain text.

If you gain value from this post consider following me on Twitter and subscribing to my email newsletter.

What Is a CTF?

At this point, you may be asking yourself: “Cool, but what is a CTF?”

Essentially, it is a team cybersecurity competition of which there are three main types:

Jeopardy:** **These have a collection of tasks in several distinct categories: web exploits, binary exploitation, reverse engineering, forensics, and cryptography. By solving these challenges, you find “flags” which typically follow a standard format like flag{Th1s_1s_a_flag}. Some examples include picoCTF and Defcon CTF’s qualification round.
Attack-Defense: In attack-defense competitions, each team is given their own host or service and is tasked with protecting that host from other teams while also trying to exploit other teams’ hosts. Famously, the Defcon CTF final takes this format.
Mixed:** **As can be inferred by the name of this type of competition, it is some sort of combination of jeopardy and attack-defense competitions.

In this article, I will be mainly focusing on the jeopardy-type CTF. In the future, I may write another article on attack-defense competitions.

What Are All Those Categories?

Before you get into all of the cool categories in jeopardy contests that I mentioned earlier, you need to learn the basics. Most importantly, you need to familiarize yourself with the Linux terminal.

Here are a couple of commands that you will use over, and over, and over again:

ls: This command lists out all the files and subdirectories that you are currently in.

pwd: This prints your current working directory. If you are in the documents directory, this will return documents.

cd: This command changes the directory into any of the subdirectories of the current directory. Eg: If you have an essays folder in your documents folder and your current directory is documents, cd essays, will take you to your essays folder.

These are the absolute basics for the Linux terminal and there are a lot more commands that we will cover in the rest of this article.

To succeed in CTFs, it is also important to know:

A scripting language, most popular of which is Python. There are a lot of cool libraries for cybersecurity in Python, including pwn which has a lot of functions that are helpful for CTFs.
Number bases. Having an understanding of how this works is very helpful.
JavaScript: Doing good work in web exploitation needs knowledge of JavaScript as well as some SQL for SQL injections.
It is also advisable to have a UNIX-based operating system because of all the amazing tools that are readily available on Linux, this can be done in a virtual box, no changing your main OS necessary. However, you can still participate in CTFs on Windows.

Time to start digging into some heavier stuff.

Cryptography

Cryptography challenges consist of exactly what you think they would, codebreaking. Given a ciphertext, can you decode it into the original message? Can you do the opposite?

These types of problems include an encrypted message that you have to decrypt. To prepare for these, it is best to learn different types of ciphers and how to decrypt them.

Here are some common methods of encryption in these challenges: Caesar Ciphers, Vigenère Ciphers, and RSA. For more info on how to decrypt these, check out this link.

Steganography

Steganography is not cryptography by definition but it does involve hiding messages in plain sight. As a result, many CTF organizers will include steganography challenges in the cryptography section.

Steganography consists of hiding messages in media files, typically audio and images. It is important to note that there aren’t a lot of real applications in the field of cybersecurity with steganography, other than just increasing your knowledge.

There is a multitude of ways to do this and not enough space in this general-purpose article to cover them all, so here is an in-depth article about steganography:
CTF Tidbits: Part 1 — Steganography
*I have been asked by a few folks what tools I use for CTF’s. What I use all depends on what the CTF is. There are all…*medium.com

Binary Exploitation

Binary exploitation involves finding vulnerabilities in a program, typically Linux executables, and then exploiting these vulnerabilities to obtain the flag.

These exploitations usually involve either using the program to gain control of a shell or just modifying the program to yield the flag. This is an extremely broad field and some helpful tips can be found here.

Forensics

Forensics challenges in CTFs typically have the following aspects:

File format analysis: Given various files that have something wrong with them, can you fix them? Can you fix a corrupt file to produce a flag?
Memory dump analysis: Taking a look at the memory of the system and seeing if any important information can be learned.
Steganography: Yes, steganography appears in the forensics section as well.
Packet capture analysis: A packet is a segment of data sent from one device to another device over a network. A lot of information can be gleaned from packets and there are a lot of programs for packet analysis and capture out there. Possibly the most popular is Wireshark.

Here is something that goes into a lot of detail on this topic.

Web Exploitation

Web exploitation challenges have the contestant retrieve the flag from exploiting websites and web apps. There are a couple of ways to do this:

SQL injections: Sometimes, the creator of a web app unintentionally makes it so that SQL code can be inputted. This provides a golden opportunity for the exploiter to use SQL to obtain information from the databases of the web app.
Just inspecting element: In the easier stages of contests, event organizers may just hide flags in the HTML of the website. They may also have a JavaScript function that needs to take in a certain input to spit out the flag, these can be done with inspect element and some problem-solving skills.
Directory traversal: If an application takes in a directory as input and this input is not properly checked, the attacker can mess with the directories to their heart’s desire.
XSS (cross-site scripting): This is when the attacker can send JavaScript that will be executed by the browser of another user of the web app.
Command injection: Sometimes, developers forget to properly check for input that goes into a system’s shell. If not properly checked, the attacker can send whatever system commands they want to the web app.

For more in-depth information on the above topics, take a look at this wonderful resource.

Reverse Engineering

As the name suggests, these types of challenges are based around reverse-engineering a program to figure out how to properly exploit it. The product of a successful exploit is the flag, as desired.

These could be given in many programming languages but the following, especially the first two, tend to show up more than others:

Assembly: Reading this, you may be thinking that nobody codes in Assembly, on the contrary, quite a lot of people do. It is not extremely widespread but it used in the programming of embedded microsystems which are very relevant. This may be a bummer to learn but it is a fairly useful skill to know.
C: Lots of programs are written in C and its control over memory allocation makes it a valuable programming language. Familiarity with C may help you do well in reverse engineering programs written in C.
Java: Java is a very popular programming language and has easily-readable code. Knowing Java will help you reverse engineer it tremendously so learning it if you don’t already know it is recommended.

It is to be noted that there are a lot of times where you are not given the actual source code of the program and are just given the executable.

To overcome this hurdle, we use *decompilers. *These programs try to convert the executable back into source code.

A great example of a decompiler is Ghidra which was created by the NSA. It is a very powerful tool and very good at it what it does. It would be advisable to have set this up on your computer.

For a more in-depth explanation of reverse engineering, take a look at this wonderful resource.

Beginner-Friendly CTFs

Alright, these CTF things seem cool, how do I participate in one?

Well, future pwner, here’s a list of CTFs that are great for beginners. Note, not all of them are available right now:

Now, get out there and capture those flags. Trust me, it is an incredible experience.

If you gained value from this post consider following me on Twitter and subscribing to my email newsletter. Every Sunday, I send out a newsletter that contains the best programming and learning-related content I’ve seen in the past week along with my own thoughts on the events of the week. The main goal of the newsletter is to bring meaningful and thought-provoking ideas to your inbox every Sunday. Consider signing up if you’re interested.

A Giant List of Resources

Trail of bits: Lots of good information on CTFs.
CTFS Resources: Information on cryptography and forensics.
CTF 101: A complete beginner’s guide to all things CTF.

An Introduction to Competitive Programming

Siddhant Dubey — Thu, 24 Dec 2020 20:27:12 +0000

What is Competitive Programming?

Competitive Programming is an art form. It is creative problem solving at its finest, a combination of hard analytical thinking and creativity. Competitive programmers use their knowledge of algorithms and data structures and logical reasoning skills to solve challenging algorithmic problems in a limited time frame. Java and C++ are extremely popular due to their relative run-time efficiency compared to a language like Python. C++ is my preferred competitive programming language as I love its Standard Template Library (STL) which allows for quick write ups of solutions. Without further ado, lets get right into it.

If you happen to gain value from this post consider following me on Twitter and subscribing to my email newsletter.

How to Get Started

Learn C++

I know most people might not want to hear this, but as I mentioned already, Python will be hard to succeed with in competitions. C++ as a language for Competitive Programming is fairly easy to pick up, but like any languages understanding the nuances takes a little bit more time. Here are a couple of resources I found helpful:

Learn Algorithm Analysis

Algorithm Analysis a llows you to look at the solution you came up with and see whether or not it will run in time for a contest or if it will exceed the time limit imposed by the online judge. Time Complexity analysis as the name suggests, quantifies the amount of time that an algorithm takes to run, it is where the famous O(n) notation comes from. This is called big O notation and yields the worst-case scenario run-time, that is the longest it would take for the algorithm to run. O(n) then means that if there are n elements to perform an operation on, the algorithm would run in time proportional to the n elements. Similarly O(n^2) is proportional to n^2 and O(log n) is proportional to log(n). There are other types of Time Complexity Analysis that are also handy. Here's a resource to read up on Algorithm Analysis:

Big O Analysis

Learn how to Brute Force Problems

At the start of one's competitive programming journey you more often than not encounter problems that can be brute forced with simple algorithms and do not require optimization to solve. In this case, you can usually just code out the solution step by step rather than applying specific algorithms. To get good at brute forcing problems, all you really have to do is practice. I would advise practicing rating 1000 problems on codeforces.com. Codeforces is a wonderful site to practice your competitive programming skills on.

Learn Greedy Algorithms

Greedy algorithms make the locally optimal choice at each stage of the algorithm in the hope of finding the global optimum. These are usually more efficient than Brute Force solutions as long as you can come up with a greedy solution. They are not suited to every type of problem and may end up being inefficient if you apply them in places they shouldn't be used. Here is some great information on Greedy Algorithms:

Brilliant's Greedy Algorithms Page

Learn Dynamic Programming

Dynamic Programming is an optimization on normal recursion. Essentially it involves solving subproblems, saving those solutions, so that you don't have to resolve them later on as you would with normal recursion. This greatly reduces time complexity and is helpful on recursive type problems. Again the best way to get good at this is to solve multiple dynamic programming problems.

Learn Graph Algorithms

Once you move to the upper echelons of programming competitions you will find a multitude of Graph Algorithm problems. These will include things like finding the shortest path in between two nodes on a graph. In order to succeed at these types of problems, there are quite a few algorithms you need to master.

Mastering DFS and BFS first will yield great results as you progress in the world of competitive programming. Now that you know what algorithms and techniques you need to learn, you are ready to find out what competitions you can use to practice your skills.

Major Competitions / Online Judges

United States of America Computing Olympiad(USACO)

The USACO is a competitive programming contest held every year in January, February, March, and December. It is divided up into four divisions, Bronze, Silver, Gold, and Platinum. Each division gets progressively harder, Platinum being the hardest. They also have a training page which is a great place to practice and learn Competitive Programming.

Codeforces

Codeforces is a platform on which a lot of programming contests are held. The problems are usually of a very high quality and they have a large database of past problems covering a wide variety of topics you can use to practice your skills. Codeforces works on a rating system and if you are just starting out I would recommend solving 1000-1200 rating problems.

Atcoder

Atcoder is a wonderful programming contest especially for beginners. They host beginner contests often and they are a great way for newcomers to get into the world of competitive programming and learn how to succeed. Like Codeforces, Atcoder has a large collection of problems for you to work on and improve your skills with.

Benefits

Ok cool, but I'm never going to have to implement Dijkstra's algorithm on the job right? I'll just use a built in function or use a library. While these are valid points, the objective of Competitive programming is not to have you implement all these fancy algorithms and data structures from scratch. Rather, it is to develop problem solving skills. Solving a lot of Competitive Programming questions helps you improve your problem solving skills. This is why many companies have you solve Competitive Programming like questions during an interview, not to see whether or not you can implement Dijkstra's algorithm on the job, but to see if you can problem solve on the fly and take things in your stride.

Now get out there and start programming!

Identifying the Gender of a Movie Character with Deep Learning

Siddhant Dubey — Thu, 24 Dec 2020 20:20:50 +0000

If you were given a single line from a movie, would you be able to identify the gender of the character who delivered the line? Unless you've memorized a lot of movie scripts, probably not. Lucky for you, you don't have to do this as long as we have computers! The field of Natural Language Processing (NLP) has us covered. By applying Deep Learning to NLP and creating a text classifier we can train a computer to identify whether a line from a movie was delivered by a male or female character!

If you happen to gain value from this post consider following me on Twitter and subscribing to my email newsletter.

Setting Up Your Environment

Colab

Deep learning usually requires a large amount of computing power and a solid GPU, Deep Learning NLP is no exception. This used to be a barrier to entry in the field, but thanks to Google Colaboratory, it no longer is. Google Colab is a platform that allows you to train models on GPUs that are in the cloud with Jupyter Notebooks completely for free! You'll be following along with this tutorial on Colab.

To get started with Google Colab, all you need to do is go to https://colab.research.google.com/ and sign in with your google account. Once you've done that, you can make Jupyter Notebooks and use them like you normally would. Usually, you'll want to read from and write to your google drive so that you can actually use data and save your results. In order to do this, add the following block of code as the first cell in every Colab Notebook you create.

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/pathtofolderwithfileshere"

Now you have a great environment to train your models in!

Installing Libraries

One of the best things about Colab is that it comes with all the big data science libraries like PyTorch, Tensorflow, Numpy, Matplotlib, and Scikit-Learn out of the box! In fact, it also comes with NLTK and spaCy, two of the most important NLP libraries. In short, you don't need to install any libraries to follow along with this tutorial as long as you're using Colab.

Obtaining Data

The data that this tutorial uses comes from the Cornell Movie-Dialogs Corpus which contains information about 617 Hollywood films. The data this article is concerned with is the conversational data which is just the lines delivered by the characters in the movie and the genders of the character. For that purpose, I extracted all the relevant data and merged it into one file for easy use which you can find here: https://drive.google.com/file/d/1pD6u40QVZ6bHeLgUUmH2eRNZv4F-byz_/view?usp=sharing. This data file contains a lot of lines from movies with associated information about the characters, including the gender of the characters, which is of great importance for building the classifier. Once you download the data file, upload it to a folder named data that is within your root_dir as defined above on Google Drive. When you finish, you're ready to move on to the preprocessing stage of the tutorial.

Preprocessing Data

The first step one takes in most Data Science projects of any kind is to examine the data they're working with and then preprocess it. For those of you who are unfamiliar with the term preprocessing, all it really is just making the data usable for whatever task you intend to do with it. The different preprocessing tasks vary based on the field and this section will cover a few common NLP preprocessing tasks that will help with the larger Deep Learning NLP goal.

Preliminary Analysis of the Dataset

Before you begin a Data Science project, it is always good to take a brief look at what your dataset actually looks like. We know that we only have two classes to classify: males and females. Hence, you'll also need to look for imbalances in between the two classes, which is just a fancy way of saying you need to see if the number of data points for each class is around the same. Thankfully to get a quick overview of the data you just need to use a little Python! While you conduct the preliminary analysis of the data, you'll also be creating two new files, male.txt and female.txt, which will make it easier when we're training our models. Before you follow along with the following steps, create a new notebook and call it preprocessing in Colab. Once you do that and the cell that mounts your drive on Colab, you can follow along with the rest of this section.

The first thing you should do is open up the collated_data.txt file and look at its format. You'll notice that it uses "+++$+++" as the delimiter, which is just what it uses to separate different data values, in a CSV the delimiter is a comma. You'll also notice that there's 7 different pieces of information in each line, in order they are: line number, character id, movie id, character name, character gender, line text, and character's position in the credits. You may also see that there is a ? in place of the character gender in some places and that is because the people who put the dataset together were unable to ascertain the gender of the character.

Now that you've taken a brief look over your data you'll need to separate the data based on the gender of the characters. This can be done in base python without the help of any libraries. Follow along with these steps:

Add the following cell to your notebook, all it does is initialize a list that will contain the text of the lines delivered by males and then another list for females.

   male_lines = []
   female_lines = []

Next, you'll need to loop through the data file and add the text to either the male or female list depending on the gender of the speaker. You can do this with basic file operations and conditional statements. Note that root_dir is defined in the first cell discussed in this tutorial and that collated_data.txt should be present in that directory.

with open(root_dir+'collated_data.txt', encoding="charmap") as data:
    for line in data:
        line_no, chr_id, mov_id, chr_name, gender, text, credit = line.strip().split("+++$+++")
        if(gender.strip().lower() == 'm'):
            male_lines.append(text)
        elif(gender.strip().lower() == 'f'):
            female_lines.append(text)

You'll notice that we split the line based on the delimiter and then have variables for each attribute that is on the line. The only two that matter for this tutorial are the character gender and the line text.

You'll now do some preliminary analysis of the dataset which just boils down to looking at the number of male and female lines. This just requires use of the len() function.

print(len(male_lines)) #Output: 170768  
print(len(female_lines)) #Output: 71255

Yikes! There's almost 100,000 more male data points than there are female data points! That's a massive imbalance and something that will need to be corrected before a classifier can be constructed. In the meantime however, we can proceed with writing the male and female lines to separate text files.

By writing the male and female lines to separate files you'll be doing yourself a favor and making it easier to reuse the data for future projects.

with open(root_dir+'male.txt', mode='w+') as male:
    for line in male_lines:
        male.write(line + '\n')

with open(root_dir+'female.txt', mode='w+') as female:
    for line in female_lines:
        female.write(line + '\n')

The above code blocks are separate because I encourage you to put them in different cells of your notebook for clarity's sake. Now that you're done with some very basic preprocessing, it's time that you do some preprocessing tasks that are exclusive to NLP.

Introduction to NLP Terms and Preprocessing

There's a lot of information that gleaned be from words in the English language, however you often don't need the whole sentence to be able to ascertain its meaning. In NLP, there's usually a lot of unimportant data that we can clear out so as to reduce the noise in the inputs to our model. The most common of these preprocessing steps include: tokenization, stopword removal, and stemming. However, these steps are not always applied because sometimes they remove useful data. In fact, there will be no stopword removal or stemming applied to this dataset because of the important information that may be removed by those steps. Here are some definitions of each of these steps.

Tokenization

Tokenization is the breaking down of a sentence or document into individual tokens which are essentially just words. This can be done with the help of a function from the NLTK library called nltk.tokenize.tokenize(). You can also have this done by PyTorch when you're loading data into your model and this is what you will be doing when you write the LSTM model. By turning sentences into individual tokens you're creating sequential data that is very useful for LSTMs.

Stopword Removal

Stopword removal is the process of removing common words in the English language from text. Often this is done so that models don't weight extremely common words disproportionately in comparison to rarer words in the English language that show up more often in that particular text. However, no stopword removal should be done for this project since the movie lines are already fairly short and all the tokens are valuable.

Stemming

As the name suggests, Stemming is just turning words into their stems. This is helpful when knowing the tense or form of a word doesn't matter to the task at hand, however, for the task of text classification this may prove to be incredibly useful.

Splitting the Data into Training and Testing sets

One of the most important preprocessing steps in Machine Learning in general is dividing your dataset into training and testing sets. Remember, there are a lot more male data points than female data points which mean's you'll have to correct this imbalance somehow. Keep this in mind as you begin to divide the data. The following code will still be a part of the preprocessing notebook. Your main goal will be to create a file that will contain the training data and a file that will contain the testing data.

When creating training and testing sets you must keep in mind that the number of data points for each class should be roughly the same in both the training and testing sets. The testing set will usually be much smaller than the training set, following an 80/20 split of all the data. Scikit-learn has a built in function that splits data into training and testing for you! Before you divide the data, throw your mind back to the imbalance in the data that we saw. There are a lot more male data points than female data points. You can combat this by either randomly oversampling or randomly undersampling our training set. By randomly oversampling the train set you will increase the number of female data points by using some of them multiple times until the number of female lines matches the number of male lines in the train set. By randomly undersampling the train set you will decrease the number of male data points to match the number of female lines in the dataset. Oftentimes, randomly undersampling will lead to a lower accuracy for a model because there just isn't enough data, and for that reason you'll be randomly oversampling the train set.

Now you may be wondering why the train set is being randomly oversampled and that is because if we were to randomly oversample the entire dataset, it is likely that there would be some overlap in between the train set and the test set which would then lead to an inaccurate representation of the performance of the model.

Alright, enough theory! It is time for you to write some code. Let us have around 10% of the data be for testing and 90% be for training. To properly split your data into training and testing, follow along with these steps.

Create the testing set first by simply taking the first 10,000 lines from both the male and female lists.

male_test = male_lines[:10000]
female_test = female_lines[:10000]
X_test = male_test + female_test

Now that there exists the X portion of the testing set, the labels need to be constructed. Our labels in this case will be 0 if the line was delivered by a male and a 1 if the line was delivered by a female. This can be accomplished with two simple for loops.

Y_test = []
for x in male_test:
    Y_test.append(0)
for x in female_test:
    Y_test.append(1)

The test set is now complete, it is time for the creation of the train set. First, take everything that wasn't used in the test set and put that it into two new lists: male_train and female_train.

male_train = male_lines[10000:]
female_train = female_lines[10000:]
X_train = male_train + female_train

Now you need to create Y_train, which will contain the labels for the lines in X_train. This is the same process that was used to make the labels for the test set.

Y_train = []
for x in male_train:
    Y_train.append(0)
for x in female_train:
    Y_train.append(1)

Since the number of male lines significantly outnumber the number of female lines in the train set, you'll need to oversample the female lines. This can be done with the help of a library called imblearn which is included in your colab environment. You'll also need to import numpy. The following code oversamples until the number of female lines is equal to the number of the male lines.

import numpy as np
from imblearn.over_sampling import RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
X_train, Y_train = oversample.fit_resample(np.array(X_train).reshape(-1,1), Y_train)

The X_train that is created in the above code block is actually a list of lists with one element where the one element is the movie line. It should just be a list of strings. This is an easy conversion with a quick for loop and list indexing.

male_lines = []
for phrase in X_train:
    male_lines.append(phrase[0].strip())
X_train = male_lines

Now that both the training and test sets are completely constructed, they need to be converted into pandas dataframes and then saved as CSVs. The dataframes will have two columns: text and target, where text is a movie line and target is either 0 or 1 depending on the gender of the speaker. To do all of this, pandas will need to be imported but the code itself is fairly simple. You will create two dataframes and fill them with the train and test lists that have been created and then save them to root_dir.

import pandas as pd
train_df = pd.DataFrame()
test_df = pd.DataFrame()

train_df['text'] = X_train
train_df['target'] = Y_train

test_df['text'] = X_test
test_df['target'] = Y_test

train_df.to_csv(root_dir + 'train.csv')
test_df.to_csv(root_dir + 'test.csv')

Amazing! You now have cleaned data and a training and testing set! The actual creation of the models will take you less time than the preprocessing stage and this is often true of real-life data science projects. Without further ado, it is time to move on to building the classifiers.

LSTMs for Text Classification

How does a Recurrent Neural Network (RNN) work?

Long Short Term Memory Networks (LSTMs) are a version of RNNs. To properly understand how LSTMs work, one needs to know how RNNs work. What's great about RNNs is that they have internal memory which other Neural Networks do not. People use RNNs when they're dealing with sequential data, such as language! In this explanation of the workings of a RNN, it is assumed that you know how basic feed-forward networks work.

In a RNN the input data cycles in a loop, when it comes time to make a decision the RNN takes into account the current input and the input that came before it. A normal RNN has short-term memory which is one of the reasons LSTMs need to be used, so that the network has long-term memory as well. In essence a RNN has two inputs: the current input, and the recent inputs. This provides an edge when doing language related tasks. Additionally, unlike Feed-Forward networks which can only map one input to one output, RNNs can do one to many, many to many, and many to one. This is a brief summary of RNNs and there's a lot of in-depth math that one can get into and I would advise you to read up on that to get a really thorough understanding of the network.

GloVe Vectors

GloVe vectors are what we'll be using as the inputs to our model. GloVe stands for global vectors for word representation and it is used to create word embeddings. Word embeddings often serve as the inputs for Deep Learning NLP models and are just a way to convert textual information like sentences into numerical like data. This makes the input understandable for deep learning models. This section will walk you through the first few steps of writing the LSTM in PyTorch which is really just loading the data and creating GloVe vectors.

First you'll need to create a new notebook on Colab to actually write the LSTM in. You should probably name it something along the lines of GenderClassifierLSTM.ipynb. Before you type up any lines of code, make sure to change the runtime of your notebook and ensure that it is utilizing a GPU. To do this, click on Runtime > Change Runtime Type and then change the Hardware Accelerator to a GPU. To load in the data and set the base for your LSTM, follow along with these steps.

Mount your Google Drive in Colab

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/pathtoyourdatahere"

Import all the necessary libraries. Don't be scared at everything that is being imported here, you'll know what everything means by the end. Some quick highlights are that we're using PyTorch, Numpy, Pandas, and Scikit-Learn. The PyTorch documentation is something that you'll need to continuously look at and you can find it at https://pytorch.org/docs/stable/index.html.

import torch
import torch.nn as nn 
import torch.nn.functional as F
import torchtext 
import numpy as np
import pandas as pd
from torchtext import data
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence 
from sklearn.metrics import mean_squared_error

Now it is time to load in the data that you have, a fairly easy task.

train_df = pd.read_csv(root_dir + 'train.csv')
test_df = pd.read_csv(root_dir + 'test.csv')

Now, when you load in a CSV into Colab, you'll end up with an extra column at the beginning and to fix that you'll need to reconstruct both the train_df and test_df. The way to do this is by just extracting the relevant columns and then putting them into new dataframes.

X_train = train_df['text']
Y_train = train_df['target']
X_test = test_df['text']
Y_test = test_df['target']

Another point you must look into as one final bit of preprocessing is removing NaN values from your data. An easy way to do this is to just remove all data types that are floats from your lists because when dealing with textual data only the nans will be floats.

indices = []
for i in range(len(X_train)):
  if (isinstance(X_train[i], float)):
    indices.append(i)

for index in sorted(indices, reverse=True):
    del X_train[index]
    del Y_train[index]

indices = []
for i in range(len(X_test)):
  if (isinstance(X_test[i], float)):
    indices.append(i)

for index in sorted(indices, reverse=True):
    del X_test[index]
    del Y_test[index]

Now you will seed your notebook so that you'll get the same results everytime you run the notebook.

SEED = 42

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Now if you recall, one of the important parts of preprocessing textual data is tokenization. PyTorch allows us to do this when we're creating the fields of the model, of which we have two: TEXT and LABEL which are self-explanatory.

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

As you can see, we're creating two fields using the built in fields that torch.utils.data has. The text is then being tokenized using spaCy, which is one of the text processing libraries that is used in such projects.

When working with Deep Learning NLP in PyTorch and any other type of Deep Learning, you usually need to write classes to accommodate your custom Datasets and make sure you can load it into your model. In this case, you'll be writing a custom class that will represent a dataframe.

class DataFrameDataset(data.Dataset):

    def __init__(self, df, fields, is_test=False, **kwargs):
        examples = []
        for i, row in df.iterrows():
            label = row.target if not is_test else None
            text = row.text
            examples.append(data.Example.fromlist([text, label], fields))

        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex):
        return len(ex.text)

    @classmethod
    def splits(cls, fields, train_df, val_df=None, test_df=None, **kwargs):
        train_data, val_data, test_data = (None, None, None)
        data_field = fields

        if train_df is not None:
            train_data = cls(train_df.copy(), data_field, **kwargs)
        if val_df is not None:
            val_data = cls(val_df.copy(), data_field, **kwargs)
        if test_df is not None:
            test_data = cls(test_df.copy(), data_field, True, **kwargs)

        return tuple(d for d in (train_data, val_data, test_data) if d is not None)

All of this code is standard among many projects that I've done before and you will most likely end up using this class multiple times so be sure to save it! The most important part of this class is the splits method and it used to split the TEXT and LABEL field into a train and test dataset that is readable by the model we create. This also happens to be the next step in this process.

The next step is to make the train and test dataset readable to the model you'll create and to do do this you'll be using the splits method in the DataFrameDataset class that you wrote.

fields = [('text',TEXT), ('label',LABEL)]
train_ds, test_ds = DataFrameDataset.splits(fields, train_df=train_df, val_df=test_df)

Now you have your train and test datasets in a readable format, and you are ready to construct GloVe vectors.

When constructing GloVe vectors you're going to have to define the size of your vocabulary, and in this case it will be the size of the X_train vector. You also have to define the size of the vector, or how many dimensions it will have. 200 dimensions is standard. In the following code block, you are building the vocabulary for your TEXT. As you can see your vocabulary's size is the number of lines in the X_train list.

MAX_VOCAB_SIZE = len(train_df['text'])

TEXT.build_vocab(train_ds, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = 'glove.6B.200d',
                 unk_init = torch.Tensor.zero_)

Having built the vocabulary for the text, you'll need to do the same for your labels but you won't be using GloVe vectors.

LABEL.build_vocab(train_ds)

Alright! You've finished all the preprocessing you need and have set the basis for writing your LSTM. It is time to learn more about the wonder that is a Long Short Term Memory Network.

What are LSTMs?

LSTMs are an improvement upon RNNs. It was mentioned earlier that RNNs have short-term memory which is one of their advantages and this is improved upon in LSTMs. LSTMs are able to maintain memories long-term which significantly boosts their performance.

LSTMs are centered around something called the cell state which is commonly thought of as a conveyor belt. This conveyor belt goes through the chain of modules of the neural network. Information usually goes through the chain unchanged and uninterrupted. However, the LSTM can alter the information that the cell state has through the use of "gates". Gates are made of a pointwise multiplication operation and a sigmoid neural net layer. If you're familiar with deep learning you'll know that the sigmoid layer just outputs numbers between zero and one. This corresponds to how much of each component should be let through. As you may surmise, 0 means nothing should be let through and 1 means everything should be let through. LSTMs have three such gates and that is how they control the flow of information in the networks. I'd suggest that you read more about the math behind Long Short Term Memory Networks after you implement one and it will help you gain a better understanding of the network.

Enough theory! Time to implement this in PyTorch. Follow along with these steps and you'll be golden the next time you want to implement an LSTM.

Before you really get into writing the LSTM there's some housekeeping things you need to do that is common amongst most PyTorch Neural Network implementations. Namely, making sure that you'll be using a GPU to train and choosing some hyperparameters. Another important thing that you're doing is declaring a train_iterator and a valid_iterator. These will be used during the training and testing of the model respectively to, as the name suggests, iterate through data.

BATCH_SIZE = 256

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_ds, test_ds), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

Alright, the only hyperparameter that is defined so far is the BATCH_SIZE. There are a lot of other important hyperparameters that should be discussed. They are all in the code block below with accompanying explanations.

# Hyperparameters
num_epochs = 25 #This is the number of epochs and dictates how long the model trains for
learning_rate = 0.001 #This essentially determines how quickly a model trains

INPUT_DIM = len(TEXT.vocab) #As the name suggests this is the input dimension
EMBEDDING_DIM = 200 #The GloVe Embedding dimensions which is 200
HIDDEN_DIM = 256 #The number of hidden dimensions
OUTPUT_DIM = 1 #The number of output dimensions: 1 (either 0 or 1)
N_LAYERS = 4 #The number of layers in the neural network.
BIDIRECTIONAL = True #LSTMs are Bidirectional so don't change this hyperparameter
DROPOUT = 0.2 # Dropout is when random neurons are ignored, the higher the dropout the greater percentage of neurons are ignored.
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # padding makes it so that sequences are padded to the maximum length of any one of the sequences, in this case that would be the longest utterance delivered by a movie character.

Now comes the exciting part, actually writing the LSTM. You'll be creating a class called LSTM_net that inherits from PyTorch's nn.Module. As with any class that one writes in Python, the first thing to do is write the __init__ method.

class LSTM_net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):

        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)

        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)

        self.fc1 = nn.Linear(hidden_dim * 2, hidden_dim)

        self.fc2 = nn.Linear(hidden_dim, 1)

        self.dropout = nn.Dropout(dropout)

If you take a look at the parameters that the __init__ method takes, you'll notice that they are the hyperparameters we've already set and that they're being used to construct the LSTM. The method starts off with a classic trait of inheritance in Python which is calling super().__init__ to call the init method of the nn.Module class, for which you should look at the documentation of. Next the embedding for the LSTM is being constructed using the vocab size, embedding dimensions, and padding. This embedding is just a simple lookup table that stores embeddings of a fixed dictionary and size and is being used to store the GloVe word embeddings.

You'll also notice that an RNN is being used as the base of the LSTM with some of the hyperparameters that have already been defined. You may be confused by the two variables called self.fc1 and self.fc2, but don't fear, these are just the two activation layers of the LSTM with the first one being larger than the second one. FC is shorthand for fully connected layer. The last variable that is initialized is the dropout of the network which was discussed earlier.

Now it is time to move on the second of the two methods that this class will have: forward() which encodes the forward pass for the network.

def forward(self, text, text_lengths):  
        embedded = self.embedding(text)

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        output = self.fc1(hidden)
        output = self.dropout(self.fc2(output)) 
        #hidden = [batch size, hid dim * num directions] 
        return output

The forward pass deals with the embedded text, packing said text and then applying the dropout to the final forward and backward hidden layers, and applying dropout to that to get the final output for the method. The above code represents that process and if you would like to know more about these functions I would suggest you take a look at the PyTorch documentation.

Ok, now that the LSTM class has been created, you'll need to make an instance of the class.

#creating instance of our LSTM_net class

model = LSTM_net(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

It is time to store the embeddings in a variable, conveniently labeled pretrained_embeddings and then imparting this knowledge to the model.

pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

To make the padding of the network you'll need to fill it out with a bunch of zero which can be done with the following line of code.

model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

To make sure the model trains on the GPU, use the following line of code.

model.to(device)

All neural networks need a loss function and optimizer! Add them with the following block of code.

# Loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), learning_rate)

Next, you'll be writing a function that will calculate the accuracy of your model's predictions with some basic logic. Pay attention to the comments to understand what's happening!

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Training the LSTM

Whew! You've worked through a lot so far and you're almost at the end of the road! It is time to train and test the model!

You'll need to write a function that you'll use to train the model.

def train(model, iterator):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        text, text_lengths = batch.text
        optimizer.zero_grad()
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

This function keeps track of both the accuracy and loss for each Epoch that you're training the model for and goes through forward passes and backpropagation and then measures the accuracy by using the binary_accuracy function that you wrote earlier. It returns the accuracy and loss for the epoch when it is done training.

You'll also need a function that you'll use to evaluate the model's performance.

def evaluate(model, iterator):

    epoch_loss = 0
    epoch_acc = 0
    model.eval()

    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text

            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

This function loops through the test data, feeds it to the model and then measures the prediction against the actual label. It then outputs the loss and accuracy for the epoch.

This will be the last block of code you write, it is what will actually train the model. It is fairly plain python and requires one import, the time library that is already included with colab so you don't have to install anything.

import time

t = time.time()
loss=[]
acc=[]
val_acc=[]
val_losses=[]

for epoch in range(num_epochs):
    train_loss, train_acc = train(model, train_iterator)
    val_loss, valid_acc = evaluate(model, valid_iterator)
    print("Epoch " + str(epoch) + " :")
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\tVal Loss: {val_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
    print('\n')
    loss.append(train_loss)
    acc.append(train_acc)
    val_acc.append(valid_acc)
    val_losses.append(val_loss)
print(f'time:{time.time()-t:.3f}')

The code block above keeps track of the loss and accuracy for each epoch and then stores them in a list that you can use to graph and see the performance of the model over epochs. It also keeps track of the time the model takes to train on each epoch. With the current hyperparameters, you'll end with a validation accuracy in the range of 70% and a training time of 30 minutes. By adjusting the hyperparameters you can boost the performance of the model, but that may come at the cost of having higher training time.

Conclusion

You've learned a lot in this article, mainly how to perform binary text classification on a dataset with PyTorch. The skills you learned through this article are transferrable to any other textual dataset where you want to classify two labels but the level of work required for datasets will vary. Some come pre-cleaned and in that case you just have to make a model but others are rough and you'll have to do a lot of textual preprocessing before you even think about making a model. Preprocessing is usually the most time consuming part of developing a model besides training the model itself. You are now armed with incredibly valuable knowledge and I advise you to go out and find a dataset and practice the skills you just learned.

#100DaysOfNLP Day 5: Gendered Dialogue in Movies

Siddhant Dubey — Tue, 16 Jun 2020 21:13:27 +0000

Photo by Andrew Neel on Unsplash

Note: This post was written for work done on the 15th of June, 2020. If you want to catch up with the first four days, which you definitely do not need to do, the articles are available on my blog.

It has been a while since I last wrote one of these blog posts but that was because I really hadn't learned or done anything of value. However, I have now begun work on an incredibly interesting project that I'll now be writing about every day until it reaches completion. I'm currently being mentored by Dr. Srivastava and Dr. Chaturvedi at UNC Chapel Hill and working with one of my classmates, Bhargav Vaduri, on the project. Without further ado, let me get into covering what the project is and what work we've done on it so far. If you want to see the code for the project, here is our github repository.

The Research Problem:

What we're attempting to do with the project is analyze lines from movies and then identify the gender of the character who said the line. The first big goal we want to hit with this project is to be able to build a classifier that can identify the gender of the speaker of a line to a fairly high degree of accuracy. Hopefully, we'll be able to hit that goal soon and then move on to trying to hit other goals using the same dataset.

The Dataset:

Speaking of our dataset, we'll be using the Cornell Movie-Dialogs Corpus which is a massive textual dataset that contains dialogue from 617 movies, with 304,713 lines of dialogues. Since we're focusing on the gender of the characters that said the utterance (line of dialogue), we are only focusing on two of the files in the dataset: movie_lines.txt and movie_characters_metadata.txt. This is only for the time being, as we make progress with the project we will most likely use other data from the dataset to perform other analyses.

Events of the Day:

Most of today was spent working on preprocessing the data, formatting it to our needs, and reading the original research paper that accompanied the corpus: Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. I've already mentioned that we wanted to format our data in a specific way that would make it easier for us to work with it. The main thing we're doing here is adding information about the character's gender and their position in the credits to the information present about each line of dialogue in movie_lines.txt. We then put this combined information into a new text file that we are calling collated_data.txt. This file is 304,713 lines long with 7 tokens on each line: the line number in the script, the movie id, the character id, the character gender, character name, and the text of the utterance. This was a simple task to do with python and we got it done in the day. We also did some precursory analysis of the data. I attempted to POS tagging of the textual data today, but formatting in the way I wanted got to be a little tedious for how tired I was by the end of the day, so I decided to finish it on Tuesday morning. Our results for the day are present in the next section, it is pretty sparse since we only really did preprocessing, however, Tuesday's results section will be significantly bigger.

Results of the Day:

Our precursory analysis focused on seeing how the data breaks down by gender and this is what we found:

The number of male characters is 2049
The number of instances of a male character speaking is 170,168.
The number of female characters is 966
The number of instances of female speech is 71255
The number of characters of unknown gender is 6020
The number of instances of speech from a character of unknown gender is 62690.

Like I mentioned at the beginning of today's post, the code for the project is being hosted on our github repository. So if you want to check that out, be my guest! Until then, keep coding!

As always, if you want to keep up to date with my work, consider subscribing to my newsletter here.

5 Great Productivity Tools for Programmers

Siddhant Dubey — Sun, 14 Jun 2020 15:25:53 +0000

As a programmer, you probably have a love-hate relationship with your craft. There are times that you love programming and there are other times that you just don’t want to write print("Hello World"). However, you probably always want to get your work done and make it as enjoyable as possible, and using some productivity apps can have that exact effect. I’ve been using all of the tools I list in here for at least a month now and I love each and every single one of them. They’ve made working a lot more fun for me and have definitely allowed me to finish my work quicker. So without further ado, let me show you five great productivity tools that will make your life better.

Kite

Kite is an AI-powered autocomplete plugin for Python and Javascript that works for all your favorite text editors. You’ve probably seen sponsored segments in youtube videos about the plugin, but its value speaks for itself. While your standard autocomplete can only complete the word or the line, Kite allows you to complete multiple lines at a time so that you spend less time writing out repetitive boilerplate like code. It also gives you in-depth documentation for Python and Javascript within your code editor so you don’t have to leave to make a quick google search. Best of all, it’s free! If you’re a Python or Javascript developer, I implore you to get this extension for your code editor and I promise you’ll see your productivity increase.

Notion

Picture from Notion’s website.

If you’ve been on either productivity Twitter or productivity YouTube within the last couple of months, you’ve probably heard about this tool. For those of you who are uninitiated, Notion is basically an all in one project management tool for teams and individuals. It allows you to take notes, build wikis and publish them as webpages, mark out your calendar, and track the progress of your projects with a kanban board. It also allows you to create dynamic databases that you can use to track your work and notes as well. It is an incredibly powerful tool that will help you stay on track of your work and make note-taking fun again. It has a free tier that is more than enough for the average user, but you can also upgrade for free if you have a student email. If you want to learn more about using Notion, I suggest Ali Abdaal’s youtube channel: https://www.youtube.com/channel/UCoOae5nYA7VqaXzerajD0lg.

Obsidian

A graph of my notes from the last week in obsidian.

Continuing on the subject of note-taking. Obsidian has changed the way I take notes and write articles. I have only been using it for a week, but I used a very similar tool called Roam Research for much longer and switched when they started charging 15 dollars a month. Obsidian markets itself as your second brain, and honestly? It might as well be. Just like your brain makes connections between things you learn, you can make links in between your notes in Obsidian and then visualize the way they look using its graph view. This follows the Zettelkasten method of note-taking. What this method of note-taking focuses on is connecting all your notes to make it easier for you to come up with ideas. It also focuses on making sure you have easy access to all your notes by making the archival of your notes easy. It’s an incredibly rewarding note-taking process and it really helps you generate ideas and keep your notes in order. All of your notes in Obsidian will be markdown files, so you get all the robustness of markdown editing with some additional features that Obsidian has. The best part is, the app is totally free. It is currently only available for PC and Mac.

F. lux

Blue light is harmful and it makes looking at your screen for long periods of time hard to do. Which is horrible if you’re a programmer because you usually have to look at your screen for extended periods of time every day. F. lux makes your screen warmer and makes it so that there isn’t as much blue light emanating from your screen, making it easier to look at your screen so that you can get your work done in comfort. It is free to install, so go nuts using it!

Clockify

Screenshot from clockify’s website

Clockify is a time tracker that was created for developers. It lets you track how much of your time you spend doing focused work, which helps keep you on track. If you’re a freelance developer, you can use this to make sure you’re being paid properly for the hours you are putting in. Just like a normal time tracker, using Clockify lets you get an overview of your work habits so that you know if you’re spending your time the way you want to spend it.

Those are my top five productivity tools for programmers and they’re all free! So go ahead install them, have fun, and keep coding!

Follow me on twitter.
Sign up for my newsletter.

How to Make a Cross-platform Image Classifying App with Flutter and Fastai

Siddhant Dubey — Mon, 08 Jul 2019 21:05:00 +0000

In this article, I’ll be explaining how to use an API to build a cross-platform mobile app that uses a neural network to classify images…

Continue reading on Better Programming »

How I Used a Convolutional Neural Network to Classify Cricket Shots

Siddhant Dubey — Sun, 30 Jun 2019 23:29:19 +0000

I have recently delved into the world of deep learning, more specifically, image classification. After completing the first lecture in the…

Continue reading on Better Programming »

Tips and Tricks for Programming Beginners

Siddhant Dubey — Sat, 29 Jun 2019 19:33:22 +0000

How not to lose your mind while coding

Continue reading on Medium »

My Competitive Programming Journey: Week 1

Siddhant Dubey — Tue, 25 Jun 2019 20:53:10 +0000

I love programming. I love competition. I always have, I’ve always embraced the joy of learning that is enhanced by the feeling of competition. So when I heard about competitive programming, I was over the moon! This will detail my journey and progress through the world of Competitive Programming. Jump on the coding train with me, it's going to be a fun ride!

https://medium.com/media/6f0c111d5d11ef070594e20d4c348a09/href

This was about a year ago, but I really didn’t take it too seriously until this summer. I had participated in some online contests and done fairly well. I even made the Gold Division in the USACO(United States of America Computing Olympiad), which is the second highest division. Throughout this 4–5 month period, I just hoped that I would progress through the ranks without doing any work at all, obviously, that isn’t the greatest idea.

Photo by Clemens van Lay on Unsplash

What is Competitive Programming?

Competitive Programming is a mind sport. It consists of either individuals or teams coming up with solutions to algorithmic problems within a certain time limit. Some of the more famous programming contests are the IOI (International Olympiad of Informatics) and the ACM-ICPC ( Association for Computing Machinery — International Collegiate Programming Contest).

What resources am I going to use?

Textbook: Principles of Algorithmic Problem Solving by Johan Sannemo

Text Editor: Visual Studio Code

Problem Sites:

What did I do this week?

Alright, here’s what you clicked on the article for. Feel free to follow along at your own pace.

I went through chapter 1 and 2 in the textbook and absorbed as much as I could about C++.
I learned how most Competitive Programming problems are formatted and how to cut through to the heart of the problem.
I did 50 odd problems from the problem sites above.
Did all the exercises in chapter 2 of the textbook.

What did I Learn?

The format of a C++ program
How to implement basic programs into C++
The format of Codeforces contests

I didn’t learn a whole lot this week due to it being mostly a refresher course on things I already knew, however, next week looks to be a completely different matter.

Goals for next week

Finish at least chapter 3 in the textbook.
Do 1 Codeforces contest
Do 30 CSES Problems
Do all the exercises in chapter 3 on Kattis.

What I Learned from Trying to Make a Lie Detector Using a Neural network

Siddhant Dubey — Mon, 24 Jun 2019 16:58:58 +0000

Over this weekend I tried to build a lie detector that would take the spectrogram of some audio and then decide whether it was a lie or…

Over this weekend I tried to build a lie detector that would take the spectrogram of some audio and then decide whether it was a lie or not.
Going into this experiment, I was quite convinced that there was no way this would actually work. So I did the usual, I collected my data, I cleaned it, and made a training and validation set. Now I will admit that the method of data gathering I chose, recording my voice say different truths and lies, was not the most scientific but for a home experiment, it worked fine. At this point, all I was focused on was whether or not it would work. This led me to forget the most important question. What happens if it does work?

Clearly, a lie detector isn't as big of a problem as people bringing dinosaurs back to life, right? Time to answer that question, but first we have to look at the results of the experiment. I trained the network and the results were very shocking to me. It had an error rate of 0!

Of course, the error rate should have been very small, considering my dataset was small, and since all of the files came from the same source. So I dismissed this as a case of overfitting.

Unknowingly, my mindset had shifted from wanting this network to succeed to wanting it to fail. Why? Probably because I realized that was definitely not an ethical thing to do.

Now comes the really interesting part. I fed it audio files of myself of different lies and truths, and it identified the file correctly every single time. If this was a normal neural network, I would be absolutely elated. This time, however, I felt an intense amount of apprehension.

I decided to test it on other people's voices and the results were just a tiny bit better than a human guessing whether something was a lie or not. I felt a lot of relief, but why?

You might be asking, why is a lie-detecting neural network a problem? I mean, polygraphs exist and those are fine.

Yes, but polygraphs can't become web apps with always listen modes on. Polygraphs can't become Alexa skills to infiltrate the homes of people across the world. Polygraphs can't take in information continuously while far far away from the subject.

Making neural networks do interesting tasks has become incredibly easy, but with that also comes the lack of thought as to what the network actually does. We rush to make it because of how cool it but we forget to ponder the ethics of the action.

Now, this isn't a Skynet level problem, but just like all tech, neural networks can be used by elements of society that don't exactly have our best interests at heart. That's why policing AI and Machine Learning becomes so important. It would be incredibly easy for someone to wreak havoc with a seemingly harmless app if there isn't a proper way to police neural networks for harmful intent.

Of course, although I overfit this version of a lie detector, other versions have been iterated by multiple researchers throughout the past and iterations will continue to be made. It isn't a question of whether we can do it, because we definitely can, it is a question of how to do it ethically.

After all, humanity's moral compass is one of its finest traits.