Forem: Eugenia

Using machine learning to understand customers behavior

Eugenia — Wed, 29 May 2019 12:01:00 +0000

Not all clients are alike

Continue reading on Towards Data Science »

Discovering the essential tools for Named Entities Recognition

Eugenia — Sat, 11 May 2019 20:06:01 +0000

Image source: Unsplash

It’s all about the names!

_“The letter is E …. Start!…” — _One of my brothers said.

We began to crazily write down words that start with E in each category. Everyone wanted to win as many points as possible in that afternoon game.

“Stop!!!!!” — _My sister announced suddenly — “I’m already done with all the categories”_.

We stared at each other with disbelief.

“Ok. Let’s start checking!” — I said.

One by one, we start to enumerate the words we have entered under each category: fruits, places, names…

“What about a color name?” — My other brother asked at one point.

“Emerald…” — my sister said proudly.

“Noo! Noo! That’s not a color name!!!” — My two brothers complained loudly at the same time.

_“Of course it is! If no one put a color name I get double points!” — _She replied happily.

Every time we played Scattergories, or Tutti Frutti as it is commonly called in Argentina, we had the same discussion. The rules about color names changed every time.

It all depended on the players of the day. Something happening to a lot of people playing that game as we learned after.

It is easy for us to put words under categories. We can tell which word is a noun, an adjective or a verb in a text. We can point out the name of a person, an organization or a country.

This is not an easy task for a machine. However, we have come a long way.

Now, we don’t need to spend long hours reading long texts anymore. We can use machine learning algorithms to extract useful information from them.

I remember the first time I read about Natural Language Processing (NLP). It was hard for me to picture how an algorithm can recognize words. How it was able to identify their meaning. Or select which type of words they were.

I was even confused about how to start. It was so much information. After going around in circles, I started by asking myself what exactly I had to do with NLP.

I had several corpora of text coming from different websites I have scraped. My main goal was to extract and classify the names of persons, organizations, and locations, among others.

What I had between my hands was a _Named Entities Recognition (NER) _task.

These names, known as entities, are often represented by proper names. They share common semantic properties. And they usually appear in similar contexts.

Why did I want to extract these entities? Well, there are many reasons for doing it.

We use words to communicate with each other. We tell stories. We state our thoughts. We communicate our feelings. We claim what we like, dislike or need.

Nowadays, we mostly deliver things through written text. We tweet. We write in a blog. The news appears on a website. So written words are a powerful tool.

Let’s imagine that we have the superpower of knowing which people, organizations, companies, brands or locations are mentioned in every news, tweet or post in the web.

We could detect and assign relevant tags for each article or post. This will help us distribute them in defined categories. We could match them specifically with the people that are interested in reading about that type of entities. So we would act as classifiers.

We could also do the reverse process. Anyone can ask us a specific question. Using the keywords, we would also be capable of recommending articles, websites or post very efficiently. Sounds familiar?

We can go even further and recommend products or brands. If someone complains about a particular brand or product, we can easily assign it to the most idoneous department in a matter of seconds. So we are going to be excellent customer support.

As you can guess, NER is a very useful tool. However, everything comes at a price.

Before an algorithm can recognize entities in a text, it should be able to classify words in verbs, nouns, adjectives. Together they are referred to as parts of speech.

The task of labeling them is called part-of-speech tagging, or POS-tagging. The method label a word based on its context and definition.

Example of POS-tagging a text.

Take as an example the sentences: “The new technologies impact all the world” and “In order to reduce global warming impact, we should do something now”. “Impact” has two different meanings in each sentence.

There are several supervised learning algorithms that can be picked for this assignment:

Lexical Based Methods. They assign the most frequently POS co-occurring with a word in the training set.

Probabilistic Methods. They consider the probability of the occurrence of specific tag sequence and assign it based on that.

Rule-Based Methods. They create rules to represent the structure of the sequence of words appearing in the training set. The POS is assigned based on these rules.

Deep Learning Methods. In this case, recurrent neural networks are trained and used for assigning the tags.

Training a NER algorithm demands suitable and annotated data. This implies that you need different types of data that match the type of data you want to analyze.

The data should also be provided with annotations. This means that the named entities should be identified and classified for the training set in a reliable way.

Also, we should pick an algorithm. Train it. Test it. Adjust the model….

Fortunately, they are several tools in Python that make our job easier. Let’s review two of them.

1 Natural Language Tool Kit (NLTK): NLTK is the most used platform when working with human language data in Python.

It provides more than 50 corpora and lexical resources. It also has libraries to classify, tokenize, and tag texts, among other functions.

For the next part, we will get a bit more technical. Let’s start!

In the code, we imported the module ntlk but also the methods word_tokenize and pos_tag.

The first will help us tokenize the sentences. This means splitting the sentences into tokens or words.

You may wonder why we don’t use the Python method .split(). NLTK split a sentence in words and punctuation. So it is more robust.

The second method will “tag” our tokens into the different parts of speech.

First, we are going to make use of two other Python module: requests and BeautifulSoup. We’ll use them to scrape Wikipedia website about NLTK. And retrieve its text.

Now, our text is in the variable wiki_nltk.

It’s time to see our main methods in action. We’ll create a method that takes the text as an input. It will use word_tokenize to split the text into tokens. Then, it will tag each token with its part of speech using pos_tag.

The method will return a list of tuples. What each tuple will consist of? Well, a word along with its tag; the part of the speech that it corresponds to.

After that, we apply the method to our text wiki_nltk. For convenience, we will print only the first 20 tuples; 5 per line.

The tags are quite cryptic, right? Let’s decode some of them.

DT indicates that the word is a determiner. NN a Noun, singular. NNP a proper noun, singular. CC a coordinating conjunction. JJR an adjective, comparative. RB an adverb. IN a preposition.

How is that pos_tag is able to return all of these tags?

It uses Probabilistic Methods. Particularly, Conditional Random Fields (CRF) and Hidden Markov Models.

First, the model extracts a set of features of each word called State Features. It bases the decision on characteristics like capitalization of the first letter, presence of numbers or hyphen, suffixes, prefixes, among others.

The model considers also the label of the previous word in a function called Transition Feature. It will determine the weights of different features functions to maximize the likelihood of the label.

The next step is to perform entity detection. This task will be carried out using a technique called chunking.

Tokenization extracts only “tokens” or words. On the other hand, chunking extract phrases that may have an actual meaning in the text.

Chunking requires that our text is first tokenized and POS tagged. It uses these tags as inputs. It outputs “chunks” that can indicate entities.

An example of how chunking can be visualized.

NLTK has several functions that facilitate the process of chunking our text. The mechanism is based on the use of regular expressions to generate the chunks.

We can first apply noun pronoun chunks or _NP-chunk_s. We’ll look for chunks matching individual noun phrases. For this, we will customize the regular expressions used in the mechanism.

We first need to define rules. They will indicate how sentences should be chunked.

Our rule states that our NP chunk should consist of an optional determiner (DT) followed by any number of adjectives (JJ) and then one or more pronoun noun (NNP).

Now, we create a chunk parser using RegexpParser and this rule. We’ll apply it to our POS-tagged words using chunkParser.parse.

The result is a tree. In this case, we printed only the chunks. We can also display it graphically.

Entities recognized in the text.

NLTK also provides a pre-trained classifier using the function nltk.ne_chunk(). It allows us to recognize named entities in a text. It also works on top of POS-tagged text.

Entities recognized in the text.

As we can see, the results are the same using both methods.

However, the results are not completely satisfying. Another disadvantage of NLTK is that POS tagging supports English and Russian languages.

2 SpaCy model : An open-source library in Python. It provides an efficient statistical system for NER by labeling groups of contiguous tokens.

It is able to recognize a wide variety of named or numerical entities. Among them, we can find company-names, locations, product-names, and organizations.

A huge advantage of Spacy is having pre-trained models in several languages: English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.

These models support tagging, parsing and entity recognition. They have been designed and implemented from scratch specifically for spaCy.

They can be imported as Python libraries.

And loaded easily using spacy.load(). In our code, we save it in the variable nlp .

SpaCy provides a Tokenizer, a POS-tagger and a Named Entity Recognizer. So it’s very easy to use. We just called our model in our text nlp(text). This will tokenize it, tagged it and recognize the entities.

The attribute .sents will retrieve the tokens. .tag_ the tag for each token. .ents the recognized entities. .label_ the label for each entity. .text just the text for any attribute.

We define a method for this task as follows.

Now, we apply the defined method to our original Wikipedia text.

Spacy recognizes not only names but also numbers. Very cool, right?

One question that probably raises is how SpaCy works.

Its architecture is very rich. This results in a very efficient algorithm. Explaining every component of SpaCy model will require another whole post. Even tokenization is done in a very novel way.

According to Explosion AI, Spacy Named Entity Recognition system features a sophisticated word embedding strategy using subword features, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing.

Let’s explain these basic concepts step by step.

Word embedding strategy using subword features. Wow! Very long name and a lot of difficult concepts.

What does this mean? Instead of working with words, we should represent them using multi-dimensional numerical vectors.

Each dimension captures the different characteristics of the words. This is also referred to as Word embeddings.

The advantage is that working with numbers is easier than working with words. We can make calculations, apply functions, among other things.

The huge limitation is that these models normally ignore the morphological structure of the words. In order to correct this, the subword feature is introduced to include the knowledge about morphological structures of the words.

Convolutional neural network with residual connections . Convolution networks are mainly used in processing images. The convolutional layer multiplies a kernel or filter (a matrix with weights) by a window or portion of the input matrix.

The structure of the traditional neural networks is that each layer feeds the next layer.

A neural network with residual blocks splits a big network into small chunks. This chunks of the network are connected through skip functions or shortcut connections.

The efficiency of a residual network is given by the fact that the activation function has to be applied fewer times.

Transition-based approach . This strategy uses sequential steps to add one label or change the state until it reaches the most likely tag.

Lastly, Spacy provides a function to display a beautiful visualization of the Named Entity annotated sentences: displacy.

Let’s use it!

… Wrapping up!

Named Entities Recognition is an on-going developing tool. A lot has been done regarding this topic. However, there is still room for improvement.

Natural Language Processing Toolkit — It is a very powerful tool. Widely used. It provides many algorithms to choose from for the same task. However, it only supports 2 languages. And it requires more tunning. It does not support word vectors.

SpaCy — It is a very advanced tool. It supports 7 languages as well as multilanguage. It is more efficient. It’s object-oriented. However, the algorithms behind are complex. Only keeps the best algorithm for a task. It has support for word vectors.

Anyhow, Just go ahead and try an approach! You will have fun!

Visualizing Twitter interactions with NetworkX

Eugenia — Fri, 19 Apr 2019 22:42:13 +0000

Connections, connections, and connections…

Social media is used every day for many purposes: expressing opinions about different topics such as products and movies, advertising an event, a service or a conference, among other things. But what is most interesting about social media, and particularly for this post, about Twitter, is that it creates connections; networks that can be studied to understand how people interact or how news and opinions get spread.

Previously, we have used Twitter API to store tweets to afterward performed a sentiment analysis and elucidate the public opinion about avengers. Let’s review the steps needed to stream tweets: First of all, we should go to the Twitter developer website, log-in with the twitter account and ask for approval as a developer. After receiving the approval, we go on and create a new app filling out the details, and lastly, we create the access tokens keeping them in a safe place. In a Jupyter notebook, we can use the Tweepy Python library to connect with our Twitter credentials and stream real-time tweets related to a term of interest and then, save them into a .txt file.

Now, we are going to read all the data we gathered into a pandas DataFrame.

We will use this information to graph how the people that tweet about Avengers interact with each other. There are three types of interactions between two Twitter users that we are interested in: retweets, replies, and mentions. According to Twitter documentation, the JSON file retrieved representing the Tweet object will include a User object that describes the author of the Tweet, an entities object that includes arrays of hashtags and user mentions, among others.

Let’s take a look at the columns of our DataFrame so we can check how the information was read:

From the displayed columns, we are interested in:

Author of the Tweet: Name(screen_name) and Id(id ) inside user.
Twitter users mention in the text of the Tweet: Name and Id can be found as screen_name and id in user_mentions inside entities.
Account taking the retweet action: screen_name and id inside user object of the retweet_status.
User to which the tweet replies to: in_reply_to_screen_name and in_reply_to_id
Tweet to which the tweet replies to: in_reply_to_status_id

You may be wondering how the details collected will help us build a network representing the interactions between Twitter users. In order to answer that, we will need to take a glimpse at Graph Data Structure.

Let’s start with the basic concepts. What is a Graph? Graph is a data structure used to represent and analyze connections between elements. Its two main elements are nodes or vertices, and edges or lines that connect two nodes.

Representation of a Graph structure

Even though two nodes could not be directly connected to each other, there is a sequence of edges or path that can be followed to go from one node to the other. The possibility of finding one node by following paths is what makes Graph so powerful to represent different structures or networks. When there are a few nodes such that there is no path you can take to reach them, then we are in the presence of a disconnected graph or isolated nodes.

Graph can also be classified as directed when the edges have a specific orientation (normally representing by an arrow to indicate direction) or undirected when the edges don’t follow any orientation.

In our analysis, users represent the nodes of our Graph or Network. If we find any type of interaction (retweet, reply or mention) between them, an edge will be created to connect both nodes. We can work with direct Graph if we are interested in knowing which user retweets another user. Because we only want to describe the interaction present between two users without caring about its orientation, then we are going to use an Undirected Graph.

The next question is which tool can be used in our analysis. We will take advantage of NetworkX, a Python package for the creation and study of the structure of complex networks, such as a social network.

First of all, we’ll define a function that will allow us to get a list of the interactions between users. We will iterate over the DataFrame, obtain the user_id and screen_name of the user that the author of that specific tweet mention, reply or retweet. The function will return the user of the specific tweet together with a list of the users with whom the user interacted. We need to be careful to discard any None value that may be raised if the user doesn't have any interactions in the three categories mentioned before.

Now, it’s time to initialize the Graph. We can do this by calling the function .Graph() of NetworkX.

There are two other important functions to create a Graph. The first one is add_node()and the second one, add_edge both with a very descriptive name. Let’s pay attention to the syntax of add_edge:

Graph.add_edge(u_of_edge **,** v_of_edge** , **** **attr***)*

where u, v are the nodes, and attr are keyword arguments that characterize the edge data such as weight, capacity, length, etc.

If we add an edge that already exists, the edge data will get updated. Also, if we are an edge between two nodes that are still not in the Graph, the nodes will be created in the process.

We are going to populate the Graph by calling the function get_interactions that we defined earlier. With this information, we apply the function add_edge to every tuple consisting of the tweet’s user_id and the user_id of the user mentioned, replied to or retweeted, creating the nodes and the edges connecting them. Also, the tweet id will be added as edge data.

Now that we have the node and edge of the Graph created, let’s see the number of nodes and edges present:

Let’s explore now some other characteristic of a Graph. The degree of a node u, denoted as deg(u), is the number of edges that occur to that node. In simpler words, the number of connections a particular node has. The maximum degree of a graph and the minimum degree of a graph are the maximum and minimum degree of its nodes, respectively.

In our case, we can obtain the degrees of the Graph:

and the maximum and minimum degree:

We can also obtain the average degree and the most frequent degree of the nodes in the Graph:

An undirected graph is connected if, for every pair of nodes, there is a path between them. For that to happen, most of the nodes should have at least a degree of two, except for those denominated leaves which have a degree of 1. From the characteristics of the Graph, we can suspect that the graph is not connected. In order to confirm these, we can use nx.is_connected:

Components of a graph are the distinct maximally connected subgraphs. Now, that we confirm that our Graph is not connected, we can check how many connected components it has:

For this analysis, we are going to work with the largest connected component. Fortunately, NetworkX gives us an easy way to obtain that component by using nx.connected_component_subgraphs that generates graphs, one for each connected component of our original graph, and max function that will help us retrieve the largest component:

Now, we can obtain the characteristics of this new graph:

And if we use the function nx.is_connected we’ll observe that the subgraph is connected as it should be.

Clustering and transitivity measure the tendency for nodes to cluster together or for edges to form triangles. In our context, they are measures of the extent to which the users interacting with one particular user tend to interact with each other as well. The difference is that transitivity weights nodes with a large degree higher.

The clustering coefficient, a measure of the number of triangles in a graph, is calculated as the number of triangles connected to node i divided by the number of sets of two edges connected to node i (Triple nodes). While the transitivity coefficient is calculated as 3 multiply by the number of triangles in the network divided by the number of connected triples of nodes in the network. These two parameters are very important when analyzing social networks because it gives us an insight into how users tend to create tightly knot groups characterized by relatively high-dense ties.

Let’s take a look at what is happening in our analysis using the functions average_clustering and transitivity:

It appears that in our graph, the users do not tend to form close clusterings.

After that, we’ll investigate some summary statistics, particularly related to distance, or how far away one node is from another random node. Diameter represents the maximum distance between any pair of nodes while the average distance tells us the average distance between any two nodes in the network. NetworkX facilitates the functions diameter and average_shortest_path_length to obtain these parameters:

Now, we are going to focus on network centrality which captures the importance of a node’s position in the network considering: degree on the assumption that an important node will have many connections_, closeness_ on the assumption that important nodes are close to other nodes_, and finally, betweenness_ on the assumption that important nodes are well situated and connect other nodes. For this, we are going to use the following functions degree_centrality, closenness_centrality and betwenness_centrality, all which return a list of each node and its centrality score. We will particularly capture the node with the best score in each one.

As we can see not always the same node shows the maximum in all the centrality measures. However, the node with id 393852070 seems to be the one with more connection that is well situated connecting other nodes. On the other hand, the node with id 2896294831 is closest to other nodes.

Now, we can get to see how the Graph looks like. For that, we will use nx.drawing.layout to apply node positioning algorithms for the graph drawing. Specifically, we will use spring_layout that uses force-directed graph drawing which purpose is to position the nodes in two-dimensional space so that all the edges are of equal length and there are as few crossing edges as possible. It achieves this by assigning forces among the set of edges and nodes based on their relative positions and then uses this to simulate the motion of the edges and nodes. One of the parameters that we can adjust is k, the optimal distance between nodes; as we increase the value, the nodes will farther apart. Once, that we got the positions, we are also going to create a special list so that we can draw the two nodes with higher centrality that we found in different colors to highlight them.

After all that calculation, we’ll use the functions draw_networkx_nodes() and draw().

And finally, we have the drawing of the largest connected component of our original Graph:

If you want to get to know the entire code for this project, check out my GitHub Repository.

Learning how to perform Twitter Sentiment Analysis

Eugenia — Thu, 31 Jan 2019 18:35:16 +0000

Keras challenges the Avengers

Sentiment Analysis, also called Opinion Mining, is a useful tool within natural language processing that allow us to identify, quantify, and study subjective information. Due to the fact that quintillion of bytes of data is produced every day, this technique gives us the possibility to extract attributes of this data such as negative or positive opinion about a subject, also information about which subject is being talked about and what characteristics hold the persons or entities expressing that opinion.

Twitter has been growing in popularity and nowadays, it is used every day by people to express opinions about different topics, such as products, movies, music, politicians, events, social events, among others. A lot of movies are released every year, but if you are a Marvel’s fan like I am, you’d probably be impatient to finally watch the new Avengers movie. Personally, I want to know how the people is feeling about this.

Previously, we have discussed how we can use the Twitter API to stream tweets and store them in a relational database. Now, we will use that information to perform sentiment analysis. But before that, we should take into consideration some things. First of all, we have streamed our tweets using the term ‘Avengers’ but without any extra consideration. It is highly likely that we have thousands of repeated tweets. In terms of sentiment analysis, processing them will not add any extra value and contrary, it will be computationally expensive. So, we need to access the database and delete duplicated tweets keeping the first occurrence. Second, we have an unlabeled database. For the model to learn during training, we should state if the tweets are positive or negative. The ideal solution would be to manually label the dataset, which is very accurate but requires a lot of time. However, there are several alternatives such as using an open-source dataset labeling tool such as Stanford CoreNLP.

There are a high number of frameworks that can be used for machine learning tasks, however, we are going to use Keras because it offers consistent and simple APIs, minimizes the number of user actions required and more importantly, it is easy to learn and use. We will also make use of the Natural Language Toolkit (NLTK), that provides many corpora and lexical resources that will come in handy for tagging, parsing, and semantic reasoning, and Scikit-learn, that provides useful tools for data mining and data analysis.

Ready to start? Let’s see what Keras can learn about Avengers.

First of all, we need to retrieve the tweets that we have previously store in our PostgreSQL database. For that aim, we are going to take advantage of sqlalchemy, a Python SQL toolkit and Object Relational Mapper that will allow us to connect and query the database in an easy way. One of the characteristics of sqlalchemy is that includes dialect (the system that it uses to communicate with databases) implementations for the most common database, such as MySQL, SQLite, and PostgreSQL, among others. We’ll use the create_engine() function that produces an Engine object based on a given database URL which typical form is as follows: dialect+driver://username:password @host :port/database. In our case, the dialect is PostgreSQL while the driver is psycopg2. After creating the engine object, we’ll use the function read_sql_query from pandas module to query the database to obtain all the data stored in our tweet table (‘select * from tweet_table’) and gather the information retrieved in a DataFrame:

Before we dig into analyzing the public opinion on ‘Avengers’, there is an important step that we need to take: preprocessing the tweet text. But what does this mean? Text preprocessing includes a basic text cleaning following a set of simple rules commonly used but also, advanced techniques that take into account syntactic and lexical information. In the case of our project, we are going to perform the following steps:

Convert tweets to lowercase using **.lower() function**, in order to bring all tweets to a consistent form. By performing this, we can assure that further transformations and classification tasks will not suffer from non-consistency or case sensitive issues in our data.
Remove ‘RT’ , UserMentions and links: In the tweet text, we can usually see that every sentence contains a reference that is is a retweet (‘RT’), a User mention or a URL. Because it is repeated through a lot of tweets and it doesn’t give us any useful information about sentiment, we can remove them.
Remove numbers: Likewise, numbers do not contain any sentiment, so it is also common practice to remove them from the tweet text.
Remove punctuation marks and special characters: Because this will generate tokens with a high frequency that will cloud our analysis, it is important to remove them.
Replace elongated words: an elongated word is defined as a word that contains a repeating character more than two times, for example, ‘Awesoooome’. Replacing those words is very important since the classifier will treat them as different words from the source words lowering their frequency. Though, there are some English words that contain repeated characters, mostly consonants, so we will use the wordnet from NLTK to compare to the English lexicon.
Removing stopwords : Stopwords are function words that are high frequently present across all tweets. There is no need for analyzing them because they do not provide useful information. We can obtain a list of these words from NLTK stopwords function.
Handling negation with antonyms : One of the problems that come out when analyzing sentiment is handling negation and its effect on subsequent words. Let’s take an example: Say that we find the tweet “I didn’t like the movie” and we discard the stopwords, we will get rid of “I” and “didn’t” words. So finally, we will get the tokens “like” and “movie”, which is the opposite sense that the original tweet had. There are several ways of handling negation, and also there is a lot of research going on about this; however, for this project, in particular, we are going to scan our tweets and replace with an antonym (that we’ll get from lemmas in wordnet) of the noun, verb or adjective following our negation word.

After we have cleaned our data but before we start building our model for sentiment analysis, we can perform an exploratory data analysis to see what are the most frequent words that appear in our ‘Avengers’ tweets. For this part, we will show graphs regarding tweets labeled as positive separated from those labeled as negative.

We will start by using WordCloud to represent the word usage across all tweets by resizing every word proportionally to its frequency. Even though it would seem not the most appropriated for different reasons, this graph provides a textual analysis and a general idea of which type of words are present more frequently in our tweets. Python has a WordCloud library that allows us to apply a mask using an image that we upload from our hard drive, select the background, the word colormap, the maximum words, font size, among other characteristics of the graph.

WordCloud for positive tweets:

WordCloud for negative tweets:

When we observed the WordCloud for positive tweets, some of the words that appear in a bigger size do not have a particular connotation and can be interpreted as neutral, such as “Captain Marvel”, “infinity war”. On the other hand, other words, even though some of smaller size, could be explained to be in tweets with a positive sense, such as “good”, ”great”, ”best” and “liked”. On the contrary, WordCloud for negative tweets showed mostly neutral words such as “movie”, “endgame” and very small words with a negative connotation, for example, “never”, and “fuck”.

Afterward, we can present in a graph the 50 most frequent words in co-occurrence with ‘Avengers’ term for positive and negative tweets. We’ll start by using the function CountVectorizer from sklearn which will convert the collection of tweets into a matrix of token counts producing a sparse representation of the counts. Then, we sum all counts for each token and obtain the frequency and store them as a DataFrame.

We can now plot these values in a barplot by using matplotlib function bar.

We can see that the first frequent words are common for positive and negative tweets: “marvel”, “endgame” and moreover, most of the words have a neutral connotation, except for words like “good”, “great”, “love”, “favorite” and “best” in the positive tweets.

We can finally check if there is any correlation between the frequency of the words that appear in the positive and negative tweets. We’ll concatenate both word frequency DataFrames and after that, we’ll use the seaborn regplot graph:

Apart from one or two points that seem related, no meaningful association can be derived from the graph above between words appearing in positive and negative tweets.

After visualizing our data, the next step is to split our dataset into training and test sets. For doing so, we’ll take advantage of the train_test_split functionality of sklearn package. We will take 20% of the dataset for testing following the 20–80% rule. From the remaining 80% used for the training set, we’ll save a part for validation of our model.

We also need to convert our sentiment column into categories that our model can understand. We’ll use then 0 for negative, 1 for neutral and 2 for positive tweets.

Finally, we need to process our input tweet column using TfidfVectorizer that will convert the collection of tweets to a matrix of Term Frequency/Invert document frequency (TF-IDF) features. What is very crucial about this function is that it will return normalized frequencies; feature normalization is a key step in building a machine learning algorithm. After several tries, 3500 was the number of maximum features returned that worked best with our model.

Now, it’s time to build our model: a neural network. If you are not familiar with how neural networks work, you can check my previous post. Fortunately, Keras makes building a neural network very simple and easy in a few lines of code.

Let’s dissect our model: Sequential() is a type of network composed of layers that are stacked and executed in the order presented. So, which type of layers do we have? We observe that we have added Dense layers for our three layers (input, hidden and output layer), meaning that every node in a layer receives input from all nodes in the previous layer implementing the following operation: output = activation(dot(input, weights) + bias). Between them, we have used the Dropout method, which takes a float between 0 and 1 (that we’ll pass as drop and adjust later) representing the fraction of the neurons that will be randomly dropped during training to prevent overfitting. The key layers are the input and the output because they will determine the shape of our network and it is important to correctly know what we expect. Because we’ll use 3500 as the maximum features returned in the vectorization process, we need to use this exact number as the size of the input shape. We’ll also include how many outputs will come out of the first layer (pass as layer1), a parameter that we‘ll modify later in order to make the layer simpler or more complex. In this case, we’ll choose relu as our activation function, which has several benefits over others such as reducing the likelihood of vanishing gradient. For the last layer, we’ll choose three nodes corresponding to the three different outputs and because we want to obtain categorical distributions, we’ll use softmax as an activation function. For the hidden layer, we’ll also pass the size as layer2 that normally is half layer1 and because it's a classification problem, we'll use sigmoid activation (if you want to know more about which activation functions to use, check this video).

After that, the optimizer to be used and its parameters should be stated. We’ll use AdamOptimizer with fixed decay and betas, but later we’ll adjust the learning rate and epsilon value. Adam is a stochastic optimization which has several advantages such as being straightforward to implement, being computationally efficient, having little memory requirements, being invariant to diagonal rescaling of the gradients, and being well suited for problems that are large in terms of data and/or parameters.

One of the last steps before training is to compile the network clarifying the loss we want. In this case, we’ll use sparse_categorical_crossentropy due to the fact that we have categories represented as numbers. We’ll also need to clarify the optimizer, and the metrics to be evaluated (for us, it’ll be accuracy). Moreover, we need to fit the model stating the set for X and Y values, the batch size (number of samples propagated through the network), epochs (how many times we’ll scan through all the data, parameter that we’ll also adjust), the validation split (which percentage will be saved to validate our results) and if we are going to present the data every time in the same way or, we are going to shuffle it (shuffle).

After trying several parameters for dropout, features, shuffle, learning rate, layer size, epsilon, validation_split, epochs, we’d finally arrived at the following model:

We can see that our final validation accuracy was 71.91 seen for epoch 1 improving to 76.54% for epoch 5. Furthermore, increasing, even more, the epochs improved the training accuracy but decreased the validation accuracy.

Even though we always want to have higher accuracy, we can now go on and try to identify what the opinion is in a new dataset that we have created just like the one we used for training and validation.

For that, we are going to query our new database, performed the same text preprocessing step, tokenize our tweets and use our trained model to predict the sentiment on ‘Avengers’ using model.predict(). If we want to make it easier for human readability, we can convert the numeric prediction to our categorical labels ‘positive, neutral and, negative’.

Now the time of the truth! We can plot how many of our tweets are including in each sentiment category using a pie chart:

As you can see 53.1% of the tweets have a positive connotation about ‘Avengers’ while the remaining 46.9% are neutral or have a negative connotation. If I was to tweet about this subject I should be included on the positive side, or a least I can be 76% confident I would. What about you?

Next, we are going to use Tweets information to visualize user interactions on Twitter by using NetworkX.

If you want to get to know the entire code for this project, check out my GitHub Repository.

How to build a PostgreSQL database to store tweets

Eugenia — Thu, 27 Dec 2018 15:56:16 +0000

Learning how to stream from Twitter API

Twitter is used every day by people to express their feelings or thoughts, especially about something that it is happening at the moment or has just occurred; by companies to promote products or services; by journalists to comment on events or write news, and the list can go on. There is no doubt that analyzing Twitter and tweets is a powerful tool that can give us a sense on the public opinion about a topic that we are interested in.

But how do we perform this type of analysis? Fortunately, Twitter provides us with an API (that stands for ‘Application Programming Interface’) which we can interact with creating an app and in that way, access and filter public tweets.

The first step is to register our app; for that, we need to go to Twitter developer website, log-in with our twitter account and ask Twitter for approval as developers (As from July 2018, Twitter changed their policies and anyone that want to access Twitter API and create an app need to apply for a developer account, provide detailed information on how they intend to use it and wait for the application to be approved). After we received the approval, we can go on and create a new app filling out the details: Name (unique name that no one else has used as their Twitter app), Description , and Website (It should be the app’s home page but we can also put our personal website or a GitHub repository URL.

Creating an app in Twitter Developer account

After that, we need to create our access tokens. The access tokens will allow our Twitter app to read Twitter information, such as tweets, mentions, friends, and more:

For the whole analysis, we are going to work with JupyterLab and Python. Because anyone having this information can use it to authorize the app to connect to Twitter, we are going to create a python file (.py) where we can store the Consumer Key, Consumer Secret, OAuth Access Token, OAuth Access Token Secret and then call it in our main Jupyter Notebook file.

Now that we have registered our app, got our tokens/keys and stored them in a separated file, the second step is to decide where we are going to store our tweets once we got them. Should we store them in a file, in a NoSQL-type database or in a relational database? In order to answer this question, we need to understand how the information that we get from twitter app is given to us and what we want to do with it.

Twitter APIs always return tweets encoded using JavaScript Object Notation (JSON), an unstructured and flexible type which is based on key-value pairs with attributes and associated values that describe objects. Each tweet contains an author, message, unique ID, a timestamp and creation date when it was posted, among others; each user has a name, id, and the number of followers. Because of this, we will immediately think of storing tweets in a built database management systems (DBMS), like MongoDB, an open source database conceived as a native JSON database. The advantages of this type of database are that they are designed to be agile and scalable while using dynamic schemas without defining the structure first.

On the other hand, relational databases are more related to standards compliance and extensibility and consequently, do not give us freedom over how to store the data. They use dynamic and static schemas which help us to link data when they are relational. On the contrary, because of the unstructured approach, we can not perform this in MongoDB. Some other advantages of using a relational database are that we need only to change data in one of the tables and then it will update itself ( data integrity ), and they ensure that no attributes are repeated ( data redundancy ). As we said before, relational databases are structured in columns and rows in a way that we can link information from different tables through the use of keys that uniquely identified any piece of data within the table and are used by other tables to point to them.

Even though MongoDB has been designed to be fast and has great performance with unstructured data, relational databases such as PostgreSQL has a great performance handling JSON. This fact together with the possibilities that structuring the data give us lead us to use a relational database and specifically, PostgreSQL 10 (mention usually just as Postgres) because it is a free, reliable and efficient SQL database that most importantly, has a very powerful and useful python API, called psycopg2. In order to administer our database in a more friendly way, we will also use pgadmin that will allow us to create a database with a user and password to protect our information. Again, we are going to store these credentials in another python file so we keep them secret once we push the main file to our git repository.

The last components that we need to bring into our analysis are Tweepy, a python library that will become the main player in our code and will help us to access the Twitter API, handle the authorization request, capture and stream tweets among others, and json which will manage the JSON files obtained from the API.

So, the first thing that we need to do in our code is to import all the libraries and files that we’ve created:

After that, we will define a function that will authorize our app to connect to Twitter API. OAuth is an open standard for access mainly used by internet users to grant applications or websites access to their information on other websites without providing them the passwords and instead, allowing approved access tokens to access protected resources hosted by the resource server. In the case of Twitter, we already requested the tokens and keys so, we will use Tweepy that makes OAuth authorization easy for us handling it with the tweepy.OAuthHandler class. So, first, we pass to this function the consumer key and the consumer secret and then, we set the access tokens to be our access token and our access token secret that we got from Twitter.

As we have discussed before, relational databases store information in structured tables, so the next step is to decide the schema of our database. There is a very important concept that we need to consider here: Normalization. Normalizing a database requires a process that structures a relational database according to certain normal forms which follow the goal of reducing data redundancy and improving data integrity. A normalized database is one such as the relationship among the tables matches the relationship that is really there among the data. In simple terms, the rules state that unique keys and events in the rows should say something about it, facts that do not relate to the key belongs into different tables, and the tables should not imply relationships that do not exist.

In our case, we are going to create two tables: the first one will contain information about the Twitter users: user id, that will be our PRIMARY KEY (a key that is unique for each record), and user name. On the other hand, we will create a second table that will store information about the tweets: creation date, text (the tweet itself), user id, that will be our FOREIGN KEY (a key that uniquely identifies a primary key of another table) relating this table with our user table, and the retweet count.

We will define a function that once called will connect to our database with the credentials (using the command pyscopg2.connect) and create a table containing the name of the term we want to search for. For this last step, Postgres give us the possibility of setting up a cursor that encapsulates the query and reads its results a few rows at a time instead of executing a whole query at once. So, we will take advantage of that, create a cursor (using database.cursor()) and then execute our query to create the user table and then the tweets-containing table. We need to consider some points here: it is important to use the IF NOT EXISTS command when we perform the query to CREATE TABLE, otherwise, Postgres can rise the error that the table is already created and will stop the code execution; we need to clarify which type of variable each column contains (VARCHAR, TIMESTAMP, etc.), which column is the primary and foreign key, and in this last case, which column REFERENCES to; after we have executed the queries is important to commit this (database.commit()), otherwise, no changes will be persisted, and close the connection to the cursor and the database.

Afterward, we need to define a function that will help us store the tweets. This function will follow the same logic that we use to create the table(connect to the database, create a cursor, execute the query, commit query, close connection), but instead, we will use the INSERT INTO command. When creating the user table, we declared that user id will be our primary key. So, when we store the tweets we need to be careful how we insert this in the table. If the same user has two tweets, the second time the function is executed, it will raise an error because it detects that particular user id to already be in the table, as primary keys have to be unique. So, we can use here the ON CONFLICT command to tell postgres that if the user id is already in the table, it doesn’t have to insert it again. On the contrary, the tweet will be inserted into the tweets table and will be referenced to that user id in the user table.

There are two ways to capture tweets with Tweepy. The first one is using the REST search API, tweepy.API, which searches against a sampling of recent public tweets published in the past 7 days. The second one is streaming real-time tweets by using the Twitter streaming api that differs from the REST api in the way that this one pulls data from Twitter while the streaming api pushes messages to a persistent session.

In order to stream tweets in Tweepy, an instance of the class tweepy.Stream establishes a streaming session and sends messages to an instance of StreamListener class. Inside of this class, there are several methods that handle tweets. Depending on what type of information we want to obtain, we need to override the different methods: if we want only the status, we will then overload on_status method. Because we want detailed information about the tweet (creation date, user id, user name, retweet count), we will overload the on_data method that is in charge of receiving all messages and calling functions according to the type of the message.

Consequently, we will create the class MyStreamListener which will inherit from tweepy.StreamListener class and we will override on_data method. We will obtain the json file containing the tweet (json.load(raw_data)) and parse it to store the values in different variables (as an example: user_id = data[‘user’][‘id_str’]) to then pass them to the function to store tweets that we have created before.

It is important to be careful about error or exceptions that could occur at this point. For this, we will surround the code by a try/except block so in case an exception happens, it will be printed and we can be aware of what is happening. Also, there is a limit number of attempts to connect to the streaming API and this will show an error 420. We can handle this error by overloading the on_error method and disconnect the API in case this error shows up.

So, what is left? We need to create an api (tweepy.API()) and after that, create our stream object (tweepy.stream()) passing the authorization and the listener that we have created. We will use the filter function to stream all tweets containing a word of interested (track = [‘word’]) and being written in English (languages = [‘en])

Now, it’s time to start streaming the tweets!! I’m particularly interested in knowing what people are feeling about the Avengers. So I will use “avengers” as my term of interest and start capturing real-time tweets to create a nice database that will help my later sentiment analysis that you can read here as well as to visualize Twitter interactions with Networkx that you can find here. What are you interested in?

Understanding Neural Networks: What, How and Why?

Eugenia — Tue, 30 Oct 2018 22:09:26 +0000

Unraveling the black box

Neural networks is one of the most powerful and widely used algorithms when it comes to the subfield of machine learning called deep learning. At first look, neural networks may seem a black box; an input layer gets the data into the “hidden layers” and after a magic trick we can see the information provided by the output layer. However, understanding what the hidden layers are doing is the key step to neural network implementation and optimization.

In our path to understand neural networks, we are going to answer three questions: What, How and Why?

WHAT is a Neural Network?

The neural networks that we are going to considered are strictly called artificial neural networks, and as the name suggests, are based on what science knows about the human brain’s structure and function.

Briefly, a neural network is defined as a computing system that consist of a number of simple but highly interconnected elements or nodes, called ‘neurons’, which are organized in layers which process information using dynamic state responses to external inputs. This algorithm is extremely useful, as we will explain later, in finding patterns that are too complex for being manually extracted and taught to recognize to the machine. In the context of this structure, patterns are introduced to the neural network by the input layer that has one neuron for each component present in the input data and is communicated to one or more hidden layers present in the network; called ‘hidden’ only due to the fact that they do not constitute the input or output layer. It is in the hidden layers where all the processing actually happens through a system of connections characterized by weights and biases (commonly referred as W and b) : the input is received, the neuron calculate a weighted sum adding also the bias and according to the result and a pre-set activation function (most common one is sigmoid, σ, even though it almost not used anymore and there are better ones like ReLu_), it decides whether it should be ‘fired’ or activated. Afterwards, the neuron transmit the information downstream to other connected neurons in a process called ‘_forward pass’. At the end of this process, the last hidden layer is linked to the output layer which has one neuron for each possible desired output.

Basic structure of a 2-layer Neural Network. Wi: Weight of the corresponding connection. Note: The input layer is not included when counting the number of layers present in the network.

HOW does a Neural Network work?

Now that we have an idea on how the basic structure of a Neural Network look likes, we will go ahead and explain how it works. In order to do so, we need to explain the different type of neurons that we can include in our network.

The first type of neuron that we are going to explain is Perceptron . Even though its use has decayed today, understanding how they work will give us a good clue about how more modern neurons function.

A perceptron uses a function to learn a binary classifier by mapping a vector of binary variables to a single binary output and it can also be used in supervised learning. In this context, the perceptron follows these steps:

Multiply all the inputs by their weights w , real numbers that express how important the corresponding inputs are to the output,
Add them together referred as weighted sum: ∑ wj xj ,
Apply the activation function , in other words, determine whether the weighted sum is greater than a threshold value, where -threshold is equivalent to bias, and assign 1 or less and assign 0 as an output_._

We can also write the perceptron function in the following terms:

Notes: b is the bias and is equivalent to -threshold, w.x is the dot product of w, a vector which component is the weights, and x, a vector consisting of the inputs.

One of the strongest point in this algorithm is that we can vary the weights and the bias to obtain distinct models of decision-making. We can assign more weight to those inputs so that if they are positive, it will favor our desired output. Also, because the bias can be understood as a measure of how difficult or easy is to output 1, we can drop or raise its value if we want to make more or less likely the desired output to happen. If we pay attention to the formula, we can observe that a big positive bias will make it very easy to output 1; however a very negative bias will make the task of output 1 very unlikely.

In consequence, a perceptron can analyze different evidence or data and make a decision according to the set preferences. It is possible, in fact, to create more complex networks including more layers of perceptrons where every layer takes the output of the previous one and weights it and make a more and more complex decisions.

What wait a minute: If perceptrons can do a good job in making complex decisions, why do we need other type of neuron? One of the disadvantages about a network containing perceptrons is that small changes in weights or bias, even in only one perceptron, can severely change our output going from 0 to 1 or vice versa. What we really want is to be able to gradually change the behaviour of our network by introducing small modifications in the weights or bias. Here is where a more modern type of neuron come in handy (Nowadays its use has been replaced by other types like Tanh and lately, by ReLu): Sigmoid neurons. The main difference between a sigmoid neuron and a perceptron is that the input and the output can be any continuous value between 0 and 1. The output is obtained after applying the sigmoid function to the inputs considering the weights, w, and the bias, b. To visualize it better, we can write the following:

So, the formula of the output is:

If we perform a mathematical analysis of this function, we can make a graph of our function _σ _, shown below, and conclude that when z is large and positive the function reaches its maximum asymptotic value of 1; however, if z is large and negative, the function reaches its minimum asymptotic value of 0. Here is where the sigmoid function becomes very interesting because it is with moderate values of z that the function takes a smooth and close to linear shape. In this interval, small changes in weights (Δwj) or in bias (Δbj) will generate small changes in the output; the desired behaviour that we were looking for as an improvement from a perceptron.

We know that the derivative of a function is the measure of the rate at which the value y changes with respect to the change of the variable x. In this case, the variable y is our output and the variable x is a function of the weights and the bias. We can take advantage of this and calculate the change in the output using the derivatives, and particularly, the partial derivatives (with respect to w and with respect to b). You can read this post to follow the calculations but in the case of sigmoid function, the derivative will be reduce to calculate: f(z)*(1-f(z)).

Here it’s a simple code that can be used to model a sigmoid function:

We have just explain the functioning of every neuron in our network, but now, we can examine how the rest of the it works. A neural networks in which the output from one layer is used as the input of the next layer is called feedforward, particularly because there is no loops involved and the information is only pass forward and never back.

Suppose that we have a training set and we want to use a 3-layer neural network, in which we also use the sigmoid neuron we saw above, to predict a certain feature. Taking what we explain about the structure of a neural network, weights and bias need to be first assigned to the connections between neurons in one layer and the next layer. Normally, the biases and weights are all initialized randomly in a synapsis matrix. If we are coding the neural network in python, we can use the Numpy function np.random.random generating a Gaussian distributions (where mean is equal to 0 and standard deviation to 1) to have a place to start learning.

After that, we will build the neural network starting with the Feedforward step to calculate the predicted output; in other words, we just need to build the different layers involved in the network:

layer0 is the input layer; our training set read as a matrix (We can called it X)
layer1 is obtained by apply the activation function a’ = σ(w.X+b), in our case, performing the dot multiplication between input layer0 and the synapsis matrix syn0
layer2is the output layer obtained by the dot multiplication between layer1 and its synapsis syn1

We will also need to iterate over the training set to let the network learn (we will see this later). In order to do so, we will add a for ** ** loop.

Until now, we have create the basic structure of the neural network: the different layers, the weights and bias of the connection between the neurons, and the sigmoid function. But none of this explains how the neural network can do such a good job in predicting patterns in a dataset. And this is what will take us to our last question.

WHY Neural Networks are able to learn?

The main strength of machine learning algorithms is their ability to learn and improve every time in predicting an output. But what does it mean that they can learn? In the context of neural networks, it implies that the weights and biases that define the connection between neurons become more precise; this is, eventually, the weights and biases are selected such as the output from the network approximates the real value y(x) for all the training inputs.

So, how do we quantify how far our prediction is from our real value in order for us to know if we need to keep searching for more precise parameters? For this aim, we need to calculate an error or in other words, define a cost function (Cost function is not other thing that the error in predicting the correct output that our network has; in other terms, it is the difference between the expected and the predicted output). In neural networks, the most commonly used one is the quadratic cost function, also called mean squared error, defined by the formula:

w and b referred to all the weights and biases in the network, respectively. n is the total number of training inputs. a is the outputs when x is the input. ∑ is the sum over all training inputs.

This function is preferred over the linear error due to the fact that in neural networks small changes in weights and biases do not produces any change in the number of correct outputs; so using a quadratic function where big differences have more effect on the cost function than small ones help figuring out how to modify these parameters.

On the other hand, we can see that our cost function become smaller as the output is closer to the real value y, for all training inputs. The main goal of our algorithm is to minimize this cost function by finding a set of weights and biases to make it as small as possible. And the main tool to achieve this goal is an algorithm called Gradient Descent.

Then, the next question that we should answer is how we can minimize the cost function. From calculus, we know that a function can have global maximum and/or minimum, that is, where the function achieves the maximum or minimum value that it can have. We also know that one way to obtained that point is calculating derivatives. However, it is easy to calculate when we have a function with two variables but in the case of neural network, they include a lot of variables which make this computation quite impossible to make.

Instead, let’s take a look at the graph below of a random function:

We can see that this function has a global minimum. We could, as we said before, compute the derivatives to calculate where the minimum is located or we could take another approach. We can start in a random point and try to make a small move in the direction of the arrow, we would mathematically speaking, move Δx in the direction x and Δy in the direction of y, and calculate the change in our function ΔC. Because the rate of change in a direction is the derivative of a function, we could express the change in the function as:

Here, we will take the definition from calculus of the gradient of a function:

Gradient of a function: Vector with partial derivatives

Now, we can rewrite the change in our function as:

Gradient of C relates the change in function C to changes in (x,y)

So now, we can see what happens with cost function when we choose a certain change in our parameters. The amount that we choose to move in any direction is called learning rate, and it is what define how fast we move towards the global minimum. If we choose a very small number, we will need to make a too many moves to reach this point; however, if we choose a very big number, we are at risk of passing the point and never be able to reach it. So the challenge is to choose the learning rate small enough. After choosing the learning rate, we can update our weights and biases and make another move; process that we repeat in each iteration.

So, in few words, the gradient descent works by computing the gradient ∇C repeatedly, and then updating the weights and biases, and trying to find the correct values that minimize, in that way, the cost of function. And this is how the neural network learns.

Sometimes, calculating the gradient can be very complex. There is, however, a way to speed up this calculations called stochastic gradient descent. This works by estimating the gradient ∇C by computing instead the gradient for a small sample of randomly chosen training inputs. Then, this small samples are average to get a good estimate of the true gradient, speeding up gradient descent, and thus learning faster.

But wait a second? How do we compute the gradient of the cost function? Here is where another algorithm makes an entry: Backpropagation. The goal of this algorithm is to compute the partial derivatives of the cost function with respect to any weight w and any bias b; in practice, this means calculating the error vectors starting from the final layer and then, propagating this back to update the weights and biases. The reason why we need to go back is that the cost is a function of the output of our network. There are several calculations and errors that we need to compute whose formula are given by the backpropagation algorithm: 1) Output error (δL) related to the element wise (⦿) product of the gradient (▽C) by the derivative of activation function (σ′(z)), 2) error of one layer (ẟl) in terms of the error in the next layer related to the transpose matrix of the weights (Wl+1) multiplied by the error of the next layer (ẟl+1) and the element wise multiplication of the derivative of activation function, 3) rate of change of the cost with respect to any bias in the network: this means that the partial derivative of C with respect to any bias (∂C/∂bj) is equal to the error ẟl, 4) rate of change of the cost with respect to any weight in the network: meaning that the partial derivative of C with respect to any weight (∂C/∂wj) is equal to the error (ẟl) multiplied by activation of the neuron input. These last two calculation constitute the gradient of the cost function. Here, we can observe the formulas.

Four essential formulas given by backpropagation algorithms that are useful to implement neural networks

The backpropagation algorithm calculates the gradient of the cost function for only one single training example. As a consequence, we need to combine backpropagation with a learning algorithm, for instance stochastic gradient descent, in order to compute the gradient for all the training set.

Now, how do we apply this to our neural network in python?. Here, we can see step by step the calculations:

Let’s wrap up everything…

Now we can put all of these formulas and concepts that we have seen in terms of an algorithm to see how we can implement this:

INPUT : We input a set of training examples and we set the activation a that correspond for the input layer_._
FEEDFORWARD : For each layer, we compute the function z = w . a + b, being a = σ(z)
OUTPUT ERROR : We compute the output error by using the formula #1 cited above.
BACKPROPAGATION : Now we backpropagate the error; for each layer, we compute the formula #2 cited above.
OUTPUT : We calculate the gradient descent with respect to any weight and bias by using the formulas #3 and #4.

Of course, that there are more concepts, implementations and improvements that can be done to neural networks, which can become more and more widely used and powerful through the last years. But I hope this information can give you a hint on what a neural network is, how it works and learns using gradient descent and backpropagation.

References:

Neural Networks and Deep Learning. Michael Nielsen
Build a Neural Network in 4 minutes. Siraj Raval

Linear and Bayesian modeling in R: Predicting movie popularity

Eugenia — Fri, 31 Aug 2018 13:01:27 +0000

Which movie should I choose?

Let’s imagine a rainy day. You look outside through the window and everything is grey and cold. You grab your blanket and sit in your favourite couch; it is very cozy and comfy and you just decided it’s a perfect day to watch a movie. You want to watch a good one, or something at least that it is popular. You have several options, but some of them are not even rated in the typical movie websites! Wouldn’t be nice to be able to predict what people think of those movies? Well, maybe there is a solution for that. What about predicting a movie popularity according to some characteristics? We just need a dataset with movies, some statistical tools and R studio.

In our dataset, there is 651 randomly sampled movies which were released in United States movie theaters in the period of 1970–2016. The data was obtained from Rotten Tomatoes and IMDB. The dataset contains 32 features of each movie, including genre, MPAA rating, production studio, and whether they received Oscar nominations, among other characteristics. So now, we can ask ourselves:

Can the popularity of a movie be predicted by considering certain of its characteristics such as type, genre, MPAA rating, number of IMDb votes, and whether it has won an award?

Before going on developing any model, we need to answer two questions: Can our results be generalised? Which type of inference can be do with the present dataset?. For the first question, we can notice that the movies included in this dataset were randomly sampled from the above two mentioned sources and no bias was created by the sampling method, as a consequence, we can assume that the results obtained can be generalised to all U.S movies released between 1970 and 2016. On the other hand, this is an observational study, so the relationships that could be found from this data indicate association , but not causation.

Ready to start? Wait a second. What do we understand as “popularity” of a movie? Our dataset includes movies sample from two different sources and we have two variables that could potentially be used as popularity: audience_score(Audience score on Rotten Tomatoes) and imdb_rating(Rating on IMDB). So let’s go ahead and analyse a little more these two variables. First of all, we will check whether these variables show a correlation between them. For doing this, we will plot both variables in a scatter plot:

We can see that the plot shows a possible positive correlation between the two variables. We will confirm this by using the function cor to numerically calculate the correlation:

As we can observed, there is a high correlation between both variables. If we are going to include one of the variables as response, it is better to not include the other one as independent variable. So we need to decide which one to select as response variable. We can analyse their distribution by making use of histogram and summary statistics to make a wise choice. Let’s start by imbd_rating:

Then, we can do the same for audience_score:

We can see that the variable imbd_rating shows a distribution close to normal with a slightly left skew with a mean of 6.493 and a median of 6.00. On the other hand, the variable audience_score shows a more uniform distribution with a mean of 62.36 and a median of 65.00. Because of its distribution, we will choose to considered only imdb_rating.

After deciding which variable we will consider as our response variable, we need to explore our explanatory variables. We can analyse the distribution of the variables that we are interested in including in our model by plotting a histogram for each of the variables and obtaining summary descriptive tables. For those variables that are categorical , we can create a proportion table by using the build-in function table. On the other hand, we can create a data frame for continuous variables and apply the function summary. I will not show the entire code here, but for example, in our dataset, the analysis showed that the distribution of the number of votes was right skew. In order to adjust this, we can apply a log-transformation to the values (log_votes).

After that, we can analyse the interaction between our exploratory variables and the response variable. For this task, we can plot boxplot or scatter plots according to whether the exploratory variable is numerical o categorical. I will only show the significant findings.

From the plots and the summary descriptive obtained, it can be seen that, in our dataset, those movies that won an Oscar or the director ever won an Oscar appear to have a sightly higher rating. Moreover, the number of votes given show a weak positive association with the IMDB rating. Last, the variables best_actor_win and best_actress_win appear to have the same distribution and a similar association with imdb_rating, so we will combine these two variables in a new one called main_oscar_win.

Now, we have a good idea of what our response variable looks like and also a hint on which variables could be important to predict the popularity of a movie. It’s time to start building a model!. We will take two approaches here: first, we will do a multiple linear regression and then, we will develop a Bayesian model.

Multiple linear regression model

Multiple linear regression seek to model the relationship between two or more independent or explanatory variables and the response variable by fitting a linear equation to the data.

Our goal is to reach a parsimonious model , this is the simpler model with great explanatory predictive power. In order to do this, we have two options for model selection: Forward selection and backwards elimination. In the first case, we start with an empty model and we add one predictor at a time. We will choose the second option: Backwards elimination implies starting with a model comprising all candidates and dropping one predictor at a time until the parsimonious model is reached. In our case, our first full model will include six variables that we find before that could be important for predicting movie popularity: genre, best_pic_win, best_dir_win, main_oscar_win, log_votes and mpaa_rating. In R, we can use the function lm to build a linear model:

Now that we have the full model, there are several criteria that we can use in order to drop variables: p-value and adjusted R². We will choose p-value as elimination criteria due to the fact that in this case, the aim is to create a model that shows the highest predictive value using only variables with significance.

After running the full model with all the variables involved, we have obtained an adjusted R² of 0.3582, which means that we can still improve the model. In order to do so, we can start by removing the variable which has the highest p-value each time, until all the variables remaining in the model are significant. So the variable that has the highest p-value in our model is main_oscar_win.

After running again our simpler model, we can see that now our adjusted R² is 0.3594. We can try to improve our model more by eliminating again the variable with the highest p-value. In this case, it will be best_pic_win.

We now see that the adjusted R² is 0.3595, not different from our previous model in step1, but with the difference that this time all variables involved are significant. I will not show it here for practical sake, but removal of any of other variables will decrease the adjusted R². So we considered this our final model.

There is a very important concept to have in mind for linear regression: Collinearity. Two variables are considered to be collinear when they are highly correlated with each other. The inclusion of collinear predictors complicates the model estimation.

So at this point, we can look into our variables and see if the variables we are interested in show some degree of collinearity. In our dataset, we have mixed variables, this is we have some variables that are categorical and some that are continuous, so in this case, a way to measure collinearity is using the variance inflation factor (VIF). The VIF, that quantifies the extent of multicollinearity in an ordinary linear regression, is calculated as the ratio between the variance of the model with multiple terms and the variance of the model with one term alone. In simple words, it tells us how much the variance of a regression coefficient increases due to collinearity existent in the model. So, let’s go ahead and calculate this:

None of our predictors has a high VIF, so we can assume that multicollinearity is not playing a role in our model.

Now, it’s time to run some diagnostic in our model. The multiple regression model depends on the following four assumptions:

Each numerical explanatory variable is linearly related to the response variable
Residuals are distributed nearly normal with a mean of 0
Variability of residuals is nearly constant
The residuals are independant

We will test one-by-one the assumptions in the context of our model:

The only numerical variable that we have in our model is log_values. So we can explore the first assumption by checking the residual plots.

The plot shows that the residuals are random scatter around 0, which indicates a linear relationship between the numerical exploratory variable and the response variable.

To check this condition, we will perform first the histogram of the residuals and then a residuals Q-Q plot.

As we can see above, the distribution histogram and the residuals Q-Q plot show a close to normal distribution, and also mimics the left-hand skew that was observed in the original imdb rating variable.

Now, we need to check that the residuals are equally variable for low and high values of the predicted response variable. Then, we will check the plot of residuals vs. predicted.

The residuals are randomly scattered in a band with a constant width around 0.

Lastly, we will check for the independency of the residuals:

The plot above does not display any particulat pattern, so it is possible to assume that the residuals and as a consequence, the observations are independant.

Bayesian model

Usually, we are taught traditional frequentist statistics to solve a problem. However, there is another approach which it is sometimes undermine for being subjective, but which is more intuitive or close to how we think about probability in everyday life and yet is a very powerful tool: Bayesian statistics. There are some key concept on which this theory relies: Conditional probability and Bayes theorem.

Conditional probability i_s the probability that an event will happen given that another event took place. If the event B is known or assumed to have taken place, then the conditional probability of our event of interest A given B is written as _P(A|B).

When two events are independent, meaning that A happening is not affecting B to happen, the conjunction probability of A and B (in other words, the probability of both events being true) is written as P(A and B)= P(A) P(B). But this is not the case if A conditions B to happen, where the conjunction probability is p(A and B) = p(A) p(B|A).

Here, after some mathematical calculations, the Bayes theorem can be derived and it is presented as follows:

p(A|B) = p(A) p(B|A) / p(B)

To put this on words: the probability of A given that B have occurred is calculated as the unconditioned probability of A occurring multiplied by the probability of B occurring if A happened, divided by the unconditioned probability of B.

There is a powerful interpretation of this theorem called diachronic interpretation , meaning that something is happening over time, and it gives a tool to update the probability of a hypothesis provided new data. In this interpretation, the terms in our equations implies some other concepts:

p(A) is the probability of the hypothesis before we see the data, called the prior probability, or just prior.
p(A|B) is our goal, this is the probability of the hypothesis after we see the data, called the posterior.
p(B|A) is the probability of the data under the hypothesis, called the likelihood.
p(B) is the probability of the data under any hypothesis, called the normalizing constant.

There is an element which is key when we want to build a model under Bayesian approach: the Bayes factor. The Bayes factor is the ratio of the likelihood probability of two competing hypotheses (usually null and alternative hypothesis) and it helps us to quantify the support of a model over another one. In Bayesian modelling, the choice of prior distribution is a key component of the analysis and can modify our results; however, the prior starts to lose weight when we add more data. Non informative priors are convenient when the analyst does not have much prior information.

In R, we can conduct Bayesian regression using the BAS package. We will use Bayesian Model Averaging ( BMA ), that provides a mechanism for accounting for model uncertainty, and we need to indicate the function some parameters:

Prior: Zellner-Siow Cauchy (Uses a Cauchy distribution that is extended for multivariate cases)

Model prior: Uniform (assign equal probabilities to all models)

Method: Markov chain Monte Carlo ( MCMC )( improves the model search efficiency)

We will now print the marginal inclusion probabilities obtained for the model:

After that, we can use the function summary to see the top 5 models with the zero-one indicators for variable inclusion.

It is also displayed a column with the Bayes factor (BF) for each model to the highest probability model, the posterior probabilities of the models (PostProbs), the R² of the models, the dimension of the models (dim) and the log marginal likelihood (logmarg) under the selected prior distribution.

Last, we can make use of the function image to visualize the Log Posterior Odds and Model Rank.

In the picture above, each row correspond to each variable included in the full model as well as one extra row for the intercept. In each column, we can see all possible models (2¹⁶ because we have 16 variables included) sorted by their posterior probability from the best to worst rank on the top (from left to right).

From the model and the image above, we can see that:

> feature_film has a marginal probability of 0.999, and appears in all five top models

> critics_score has a marginal probability of 0.999 and also appears in all five top models

> runtime has a marginal probability of 0.98 and appears in all five top models

> drama has a marginal probability of 0.57 and appears in three of the five top models

> imbd_num_votes has a marginal probability of 0.99 and appears in three of the five top models

> the intercept also has a marginal probability of 1, and appears in all five top models

According to this, the best model includes the intercept, feature_film, critics_score, drama, imbd_num_votes and runtime

We can now obtain the coefficients estimates and standard deviations under BMA in order to be able to examine the marginal distributions for the important variables coefficients. To do so, we will use the function coef and plot them using plot

The vertical line corresponds to the posterior probability that the coefficient equals to 0. On the other hand, the shaped curve shows the density of posiible values where the coefficient is non-zero. It is worthy to mention that the height of the line is scaled to its probability. This implies that intercept and feature_film, critics_score, imbd_num_votes and runtime show no line denoting non-zero probability.

Last, we can obtain 95% credible intervals (The probability that the true mean is contained within a given interval is 0.95) for coefficients using confint method.

BAS package provides us with an easy way to get graphical summaries for our model just using the function plot and the which option.

Residual vs. fitted plot

Ideally, we will expect to not see outliers or non-constant variance. However, in this case we can see that there is a constant spread over the prediction but there are two outliers.

Model probabilities

This plot displays the cumulative probability of the models in the order that they are sampled. This plot shows that the cumulative probability starts to level off after 300 model trials as each additional model adds only a small increment to the cumulative probability. The model search stops at ~1400 instead of enumerations of 2¹⁵ combinations.

Model complexity

This plot shows the dimension of each model, that is the number of regression coefficients including the intercept versus the log of the marginal likelihood of the model. In this case, we can see that highest log marginal can be reached from 5 to 12 dimensions.

Marginal inclusion probabilities

In this case, we can observe the marginal posterior inclusion probabilities for each of the covariates, with marginal posterior inclusion probabilities that are greater than 0.5 shown in red (important variables for explaining the data and prediction). In the graph, we can see what it was show already before about which variables contribute to the final scores.

PREDICTION

Now it’s time to test the predictive capability of our two models! We will use the movie “Zootropolis” released in 2016. The corresponding information was obtained from the IMDB website and RottenTomatoes to be consistent with the analysis data.

As we saw above, the true imdb_rating is 8, which is pretty close to what our Bayesian model predicted.

So what can we conclude? From the linear regression and the Bayesian model we learnt that in fact the popularity of a movie can be predicted by considering characteristic data of each movie.

In the linear regression analysis, it was possible to build a parsimonious, multivariable, linear model that is able to some extend to predict the movie popularity, understood as IMDb rating, with the four statistically significant predictors chosen. However, it is important to remember that the adjusted R² of our final model is only 0.3595, so this means that 35.95% of the variability is explained by the model. In the Bayesian model, we finally got a parsimonious model that also fullfilled the Bayesian assumptions.

From both models, we can see that the Bayesian model is the one which prediction was close to the real IMDb rating.

References:

Peng Roger D. (2016) _Exploratory Data Analysis with R. _LeanPub
Downey Allen B. (2012) Think Bayes. Bayesian Statistics in Python. Green Tea Press.

Predicting Survival in Patients: Prediction

Eugenia — Fri, 10 Aug 2018 14:57:15 +0000

Building my first Data Science project

After getting the data, it’s very tempting to jump immediately into trying to fit several models and evaluate their performance. However, the first thing that has to be done is an exploratory data analysis (EDA), which allows us to explore the structure of our data and to understand the relationships governing the variables. Any EDA should involve creating and analysing several plots and creating summary statistics to considered the patterns present in our dataset.

If you want to see, how I performed EDA for this particular project, you can read this previous post.

From the EDA on this project, we have learnt some important features about our dataset. First of all, it doesn’t suffer from class imbalance , that occurs when the total number of observations in one class is significantly lower that the observations in the other class. Also, some of our variables showed skewness that was fixed after log-transforming them and that no variable

showed a perfect linear relationship with the other, though in some of them we could observe a trend to an interaction.

Machine learning predictions

One of the main decisions to make when performing machine learning is choosing the appropriate algorithm that fits the current problem we are dealing with.

Supervised learning refers to the task of inferring a function from a labeled training dataset. We fit the model to the labeled training set with the main goal of finding the optimal parameters that will predict unknown labels of new examples included in the test dataset. There are two main types of supervised learning: regression, in which we want to predict a label that is a real number, and classification, in which we want to predict a categorical label.

In our case, we have a labeled dataset and we want to use a classification algorithm to find the label in the categorical values: 0 and 1.

We can find many classification supervised learning algorithms, some simple but efficient, such as linear classifier or logistic regression, and another ones more complex but powerful such as decision trees and k-means.

In this case, we will choose Random Forest algorithm. Random forest is one of the most used machine learning algorithm due to the fact that it is very simple, flexible and easy to use but produces reliable results.

Briefly, random forest creates ‘a forest’ of multiple decision trees and ensemble them in order to obtain a more accurate prediction. The advantages of random forest over decision trees are that the combination of the individual models improves the overall result and also, that prevents overfitting by creating smaller trees from random subsets of the features.

So we will first load the packages from scikit-learn that we need to perform Random Forest and also to evaluate afterwards the model. We will also replace the categorical values with 0 or 1 or NaN as well as transform all variables to float and log-transform the variables to fix skewness, like we did in the EDA. We will again check the total number of missing values in each variable:

In the EDA, we dropped all NaN values. Here, we need to evaluate what is the best method to handle them.

There are several ways to deal with missing data but none of them is perfect. The first step is to understand why data went missing. In our case, we can guess that the values missing in the categorical variables could be due to the absence of the feature that instead of being imputed as no was left blank or that it was not tested. Also, missing values in continuous variables could be explained by the lack of biochemical studies performed in that particular patient or because the parameters were within normal range and it was not written down.

In both cases, we could be in the presence of Missing at Random value (The fact that the value is missing has nothing to do with the hypothetical value) or Missing not at Random value (The missing value depends on the hypothetical value). If it was the first one, we could drop the NaN value safely, while in the last case it would not be safe to drop it because this missing value tell us something about the hypothetical value. So in our case, we will impute the values of the missing value once we are about to train our model.

Feature scaling or data normalization, a method used to standardize the range of independent variables, is also a very important step before training many classifiers. Some models can perform very poorly if the data is not within the same range. Another advantage of random forest is that does not requiere this step.

Splitting the dataset into training and test datasets

In order to train and test our model, we need to split our dataset into to subdatasets, the training and the test dataset. The model will learn from the training dataset to generalize to other data; the test dataset will be used to “test” what the model learnt in the training and fitting step.

It is common to use the rule of 80%-20% to split the original dataset. It is important to use a reliable method to split the dataset to avoid data leakage; this is the presence in the test set of examples that were also in the training set and can cause overfitting.

First, we will assign all the columns except our dependant variable (“Class”) to the variable X and the column “Class” to the variable Y.

And then we will train_test_split from the scikit-learn library to split them into X_train, X_test, Y_train and Y_test. It is important to add random_state because this will allow us to have the same results every time we run the code.

Note: Train/Test splitting has some disadvantages due to the fact that some models require to tune hyperparameters, that in this context, is done also in the train set. One way to avoid this is to create a Train/Validation/Test dataset with the rule 60/20/20%. There are several effective methods to do this that we will see below.

Training Random Forest

It is very easy now to impute missing values (using Imputer), create and train the basic random forest model using the package Scikit-learn. We will start by apply .ravel() to the Y_train and Y_test to flatten our array as not doing so will rise warnings from our model.

Then, we will impute our missing values using the function Imputer and the strategy most_frequent that will replace the missing values for the most frequent value in the column (axis = 0). It is worthy to notice that doing so can introduce errors and bias, but of course as we state before there is no perfect way to handle missing data.

Our basic model has now been trained and has learnt the relationship between our independent variables and the target variable. Now, we can check how good our model is by making predictions on the test set. We can then compare the prediction with our known labels.

We will again impute the missing values in our test set and use the function predict and the metrics accuracy_score to evaluate the performance of our model.

As we can observe above, our basic model has an accuracy of 74.19% which tell us that it has to be further improved.

Hyperparameters tuning

There are several ways to improve our model: gather more data, tune the hyperparameters of the model or choose other models. We will choose the second one, we will now tune the hyperparameters of our random forest classifier.

Model parameters are normally learned during training; however hyperparameters must be set manually before training. In the case of random forest, hyperparameters include:

n_estimators: number of trees in the forest
max_features: maximum number of features in each tree
max_depth: maximum splits for all trees
bootstrap: whether to implement bootstrap or not to build trees
criterion: assess stopping criteria for decision trees

Of course, when we implement basic random forest, Scikit-learn implements a set of default hyperparameters, but we are not sure if those parameters are the optimal for our particular problem.

In this point is when we need to considered two concepts: underfitting and overfitting. Underfitting occurs when the model is too simple and it doesn’t fit the data well: it has low variance but high bias. On the other hand, Overfitting occurs when the model adjust too well to the training set and performs poorly in new examples. If we tune the hyperparameters in the training dataset, we would then be prone to overfit our random forest classifier. So instead, we will go back to what was mentioned before: the cross validation.

There are a many cross validation methods, the best known are: K-Folds Cross Validation and Leave One Out Cross Validation. In our case, we will use the first one: we will split our data into K different subsets using k-1 subsets as our training set and the last as our test data. In order to tune our hyperparameters, we will perform many iterations on the K-subset cross validation but using different model settings each time. Afterwards, we compare all models and select the best one; then, we will train the best model in the full training set and evaluate it on the testing set. We will take advantage of GridSearchCV package in Scikit-learn to perform this task.

We will determine the parameters and values that we want to optimize and then we will performed the GridSearchCV and set the best parameters obtained to our model.

As we can see above our GridSearchCV improve our accuracy from 74 to 77%. Even though that it’s not a great improvement, it has been reported that with this dataset other studies reached only an accuracy of 80%. So considering this and the fact that the dataset has many missing data and it’s not big (only 155 samples) we can go on and analyse other model metrics.

Test set metrics

Now that we have optimized our hyperparameters, we will proceed to evaluate our model. First of all, we will create a confusion matrixthat will tell us the True Negative, False positive, False negative and True Positive values according to our predicted values and plot it using seaborn heatmap:

True Negative (TN)| False positive(FP)

— — — — — — — — — — — — — — — —

False negative (FN)| True positive (TP)

Analysing the confusion matrix, we can expect that our model shows a higher recall (TP/TP+FN) than precision (TP/TP+FP) but both parameters will be higher than the accuracy(TP+TN)/Total. These three parameters can be taken into consideration according to what we consider our model needs to solve. We will come back to these afterwards.

We can further investigate the False positive rates and true positive rates using ROC Curve and calculating the area under the curve that it is also a metric of the prediction power of our model (if the value is closer to 1 means that our model does a good job in differentiating a random sample into the two classes).

From the ROC curve, we learned that our model does not do a good job in distinguishing between both classes as the auc is 0.60. We could improve this issue by collecting and adding more data to the model.

Last, we can analyse the precision-recall curve:

We can observe that the precision-recall relationship is quite constant for the different values, indicating that our model has a good precision and recall. This is due to the fact that the True Positive values are quite high compared to the true negative, false positive and false negative. It is important to remember that because of the formula of recall and precision, when one is high the other one is low pushing us to find a balance where both are high enough for our model.

Interpreting the results

The last thing that we could do before finishing our project is to evaluate the variable importance, that is to quantify how useful every variable is for our model.

We can observe that age, protime, alk_phosphate, bilirubin, malaise, ascites are some of the most important variables for our model. This reflects what we have seen previously in our EDA and reinforces the importance of performing this exploratory analysis before starting the machine learning algorithm.

Summary

So after applying random forest to our dataset, we can conclude that our best model was able to predict survival from patients with hepatitis with an accuracy of 77% and a precision and recall of around 80%. This is not the best situacion since we want our model to perform better, specially in this case that involves survival of patients. However, the moderate good results could be due to the small database and the large number of missing values.

Predicting Survival in Patients: Exploratory Analysis

Eugenia — Fri, 22 Jun 2018 07:50:06 +0000

Building my first Data Science project

I discovered soon enough that figuring out that I wanted to work towards a future in data science by taking classes and finishing specialisations was the easy part of my journey.

One of the things you must do if you want to work in data science is to build your portfolio. At the beginning, I struggled for days trying to find a good topic for a project, thinking on how to do it, what I should look into; many times I thought about something and soon dropped it because it was not exactly what I wanted to do. I felt I could not get it started. But it was then when I realised that building a project started much earlier than when you type the first lines of code. In fact, thinking on how to design it is one of the essential parts of any data science project.

Because of my PhD and postdoctoral research work, I knew that liver disease has become one of the most common causes of death around the world. Due to the fact that ending stages can be different from patient to patient, establishing a method to assess the prognosis of a particular patient with liver disease still remains a challenge. So, I decided the purpose of my project to be the analysis of a dataset containing information about liver disease patients and the creation of a model to predict their survival.

I chose a dataset from UCI Machine Learning Repository which CSV file was downloaded from Open ML website. It comprises an observational study where data was collected regarding 19 different features and an extra class (DIE or LIVE) from 155 patients with chronic liver disease.

I decided to use Python 3 in Jupyter Notebook. Python has a set of packages, such as Numpy, Pandas, Matplotlib, Scikit-learn, that are very powerful for data analysis and visualization.

First of all, it is important to use the command %matplotlib notebook in order to interactively plot the figures. We need to load the modules to our python environment using the command import. Because the dataset was downloaded as a CSV file, We will use the Pandas command read_csv that automatically reads the file into a DataFrame. We can check the shape of our DataFrame to match the specifications provided for our dataset: 155 patients(rows), 19 features+1 class (columns).

Exploratory Data Analysis

An important part of doing predictions with Machine Learning techniques is to perform Exploratory Data Analysis (EDA). This is useful for getting to know your data, looking at it from different perspectives, describing and summarizing it without making any assumption in order to detect any potential problems.

First, we can inspect our data to see if we need to clean it. We will start by using the head command that will show us the first 5 rows of our DataFrame. As we can see below, there are missing values identified with the ? symbol. Knowing the data types of the variables included in our dataset is a good piece of information. We can check this by using dtypesfunction.

As we can see above, 19 of our 20 variable appear as object type. Some of these variables are categorical (with ‘no’, ‘yes’ levels) and some of them should be numerical with int or float type.

Because for machine learning algorithms, it is required to have numerical data, we need to convert categorical data that has values ‘no’, ‘yes’ to 0 and 1 respectively. Another important point to consider is to convert the binary survival variable ( Class) encoded now as ‘DIE’, ‘LIVE’ levels to numerical categories (0 and 1, respectively). We will use for this task, the function replace. Lastly, we will convert all of our columns in the dataset to float type.

Machine learning algorithms perform well when the number of observations in each class is similar but when there is a high class imbalance, problems arise leading to misclassification. Class imbalance occurs when the total number of observations in one class is significantly lower than the observations in the other class. Typically, if a data set contains 90% in one class, and 10% in the other class, then it suffers from class imbalance. In order to check this point, we can calculate what percentage of the data belongs to each category.

We can observe above that even though our dataset is not perfectly balanced (79.35% of patients is contained inLIVE class while only 20.65% is in DIE class), it does not suffer from high class imbalance allowing us to continue with our analysis.

The first step in EDA is to generate descriptive statistics summarizing the central tendency, dispersion and shape of our dataset’s distribution. We will do that using the function describe from Pandas. It is important to highlight here that this function excludes the NaN values present in our dataset. Because many of our variables are discrete, it does not make sense to get central tendency parameters for them. So, we will only include the numerical variables in this case. On the other hand, we will use the function apply and value_counts to get the counts in every level (0 or 1 that corresponded to ‘no’ and ‘yes’) for each discrete variable in our dataset.

We can observe in the first table that the patients belong to an age bracket of 7–78 years, with a mean of 41.2 and a median of 39. There are missing values in most of the variables but in particular in PROTIME where we only have 88 observations. If we pay attention to the means of the different variables, it is interesting to note that they display a moderate variance; the range goes from 1.42 (BILIRUBIN) to 105.35 (ALK_PHOSPHATE). Also, the variables SGOT and ALK_PHOSPHATE show a high standard deviation and their distribution could be right skewed due to the fact that the mean is higher than the median. The rest of the variables appear to be normally distributed (mean ~ median). The distribution of our variables is important to considered because they could affect lately our machine learning algorithm due to the fact that many of them make assumptions about the shape of data, particularly about how the residuals are distributed. So we could evaluate to perform a transformation to fix the skewness observed.

In the case of the categorical variables, there is a marked predominance of observations belonging to level 0 in the variable SEX which means that the dataset include more female than male patients. Likewise, there are more observations in the the class 0 than in the class 1 in the variables ANOREXIA, ASCITES and VARICES. This could point out that these features are differentially present in the patients and might be interesting variables influencing their survival.

The next step is to create some visuals in order to understand further our dataset by exploring relationships existent in it. For this task, it is very useful to use the seaborn library which facilitates strong attractive statistical graphics that are easy to code.

We will here take some seconds to evaluate the variables included in our dataset, in particular, some that are interesting regarding liver disease. Elevated levels of Alkaline phosphatase (ALK_PHOSPHATE), Aspartate Aminotransferase (SGOT), bilirubin (BILIRUBIN), albumin (ALK_PHOSPHATE) as well as Prothrombin time (PROTIME) indicate a malfunctioning liver. The presence of anorexia (ANOREXIA) and ascites (ASCITES) appeared later in patients with liver disease and normally, indicates a poor prognosis. Because all of these mentioned variable are indicators of a more or less severe liver damage, we wold evaluate them to see their relationship and explore if they could be important for our predictive model.

As we already observed, our dataset contains a lot of NaN values. How to handle missing values is an extensive topic that we will not address here, but it is important to notice that there are several ways to overcome this issue and the best way to do it has to be evaluated for each situation. In our case, we are going to drop them by using the Pandas function .dropna(). We will create a new data frame by selecting only the interesting value that we mentioned above.

We will continue plotting histograms from our numerical variables to visualize and confirm their distribution. In the seaborn library, the function displot allow as to plot a univariate distribution of observations. It is possible to plot both histograms side by side by using the function subplot of matplotlib.

We can observed in the histograms that in fact several of our variables, including ALK_PHOSPHATE and SGOT that we had detected in the summary statistics, show a degree of skewness. There are several transformation that can be applied in order to fix that. We will use the Pandas function applymap and the Numpy function np.log to log-transform the columns corresponding to those skewed variables in our dataframe.

Then, we can make use of the pairplot function to visualize the relationship between the different numerical variables. One nice feature about seaborn is that we can use the parameter hue to show with different color in the plot, the different levels of a categorical variable. In our case, we are interesting in identifying the patients in Class 0 and_ 1_.

Observing the plots, we can highlight several things:

From the histograms, we learn that the skewness present in our data was mainly fixed
We can observed that patients tend to differentiate according to whether they belong to Class 0 or Class 1 in some of our variables; however this distinction is not completely clear.
It appears that there is not a perfect linear relationship between the variables plotted, though in some of them we can observed a trend to an interaction ( SGOT and ALK_PHOSPHATE, SGOT and BILIRUBIN, PROTIME and ALBUMIN, BILIRUBIN and ALK_PHOSPHATE)

Then, we can analyse the relationship between our categorical variables and our numerical variables. For this part, we will take advantage of PairGrid of seaborn that allows us to plot with a little more freedom to choose the x and y variables. In this case, we will use the graph swarmplot , a particular case of scatterplot which do not overlap the points.

It is possible to observe that there is no difference in the variables plotted regarding the ANOREXIA status. This can be evidenced by the fact that not only patients from both levels of Class are distributed homogeneously but also there is not difference in the expression of the variables analysed regarding the levels of ANOREXIA. On the other hand, we can see a trend that patients with Class 0 tend to have ascites. However, there is no differences in how the variables are expressed regarding ASCITES status.

The last thing that we will deepen our analysis to see if there is any strong correlation between all our parameters. The importance of performing correlation analysis in our dataset lies on the fact that highly correlated variables can hurt some models or in other cases, could provide little extra information and considering them can be computational expensive without any real benefit. Also, knowing if our variables display a linear relationship can help us choose which machine learning algorithm is more suitable for our data.

For this task, we will use the Pearson correlation coefficient because it is a good parameter to evaluate the strength of the linear relationship between two variables. In order to perform the correlation analysis with all our variables, we first need to apply the function factorize to the columns containing categorical variables contained in the dataset in order to obtain a numeric representation of them. We will now make use of the function corr and plot the resulting array using heatmap that will allow us to visualise the correlation coefficient by the intensity of the colours.

We can observe in the heatmap that some of the variables show a coefficient of ~0.6 or -0.4, but most of them display a very low correlation coefficient. So we can conclude that there is no strong linear correlation between our variables.

We have finished the EDA of our dataset. We got to know our data and have now a feeling of it that will become very valuable when choosing the right machine learning algorithm for our case.

If you want to keep going on how to perform prediction for this project, you can read my next post.

References

Dua, D. and Karra Taniskidou, E. (2017).UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

Transitioning to Data Science — My journey so far

Eugenia — Wed, 06 Jun 2018 18:18:49 +0000

Since you’re little, you are constantly asked the question: “What do you want to be when you’re a grown up?”. I remember changing my mind through the years, listing a lot of different professions. But all of them had one thing in common: they all involved answering questions. I always loved trying to find answers to how things worked, or how two things that I saw were connected. I became interested in biology when I was 16, due to a very good teacher I had in high school, and soon after, I decided to study biological sciences to become eventually a full-time researcher.

Through my PhD, I encounter wonderful things: trying to dissect a complicated phenomenon by understanding it and converting it into simpler questions, gather information about it to establish a background, planning and performing experiments to answer those questions, and finally interpreting and communicating our results in research papers, or orally at several national and international conferences were things fulfilling the reason why I wanted to be a biologist in the first place. As well, the possibility of doing an internship in US allowed me to meet an amazing research group that help me shape my character as a researcher and also as a person. However, everything was not bright and I also encounter plenty tough times and a lot of disappointment that dampened my enthusiasm over academia.

All of the professions I could think of when I was growing up involved one thing: Answering questions about how things worked or were connected

By the time I defended my dissertation, all the certainties with which I had started were gone. I was beginning to wonder if academia was the right setting in which I wanted to develop my career in a way that there was a balance between personal life and work. I knew I was not alone. I had long talk with colleagues and friends who were going through the same issue. Nevertheless, I decided to go to on with a postdoctoral position while I figured out where in the system I fitted in. I received later an job offer from a US colleague that was moving back to Germany, which I took still exploring the future direction of my career.

As a postdoc, I really enjoyed managing projects as well as mentoring and supervising students; it’s something which I personally find rewarding for getting not only to teach what I know, but also to learn from the different experiences and people. However, the lack of collaborative spirit to carry out a mutual objective or the absence of rigurosity to work along with the publish or perish culture pushed me further away from academia.

In the search for the biggest answer I had to dealt with till then, “Do I want this for myself?”, I started reading about bioinformatics and big data. I liked very much the concept of it being an interdisciplinary field so out of curiosity, I enrolled in “Finding Hidden Messages in DNA” course in Coursera, part of a Bioinformatics Specialization offer by University of California San Diego. It’s a well organised course were you can opt whether you want to go for the coding track or stay with theoretical knowledge about the algorithms. I chose the first one. Due to my background, the biological concepts involved were familiar to me. The coding was another story! It was my first approach to Python and I was already struggling with slicing a string, let alone defining a function! Anyway, I was able to finish the course and to my surprise I discovered something: I loved coding in Python!! I enrolled in the next course Genome Sequencing. I slowly improved my skills in Python but by the end of the course, the algorithms were overwhelming and I realised that my trial-error strategy with coding will not work for long.

I decided then to take a step back and learn some basic concepts enrolling in Genomics Data Science specialization, offered by John Hopkins University and designed to take the students from introducing basic concepts to applying statistics and algorithms, while learning python, the R package Bioconductor and command line tools as well as Galaxy software. Once I was done with this, I returned to complete the Bioinformatic Specialization, which provided me with a variety of innovative concepts and gave me tools to understand how complex algorithms work and how they can be applied to biological questions.

Though I enjoyed learning how to code and I had strengthened my python skills, I was still not completely able to separate myself from academia. Maybe because it was my comfort zone; perhaps due to the fact that it is easy to apply bioinformatics to research. However, there was still something unsettling about the idea of me staying. It was not until I completely immerse myself into investigating the applications of big data and bioinformatics, that I discovered the unlimited potential of data science_._ The more I read, the more I became interested in that field. I started to read every blog I could find concerning its applications, the skills involved as well as the experiences of people working in it.

The more I read about the data science field, the more I became interested. I started to read every blog that I could find related to its applications and the experience of people working in it.

I decided to start building my future towards that direction, working in projects and doing courses in the evenings and weekends. I enrolled in Applied Data Science with Python (offered by University of Michigan) as well as in Statistics with R (offered by Duke University) specializations. The first one, it is a very useful and comprehensive five-courses specialization, where you can have an overview and a first approach to working with pandas, matplotlib, scikit-learn as well as to machine learning and text mining basic concepts. The second one covers from basic to Bayesian statistics while doing exercise and peer-review projects in R, and it is completely worth taking as the teacher is amazingly clear in explaining all the theory. I also enrolled in Andrew Ng’s machine learning course, which as almost every post or blog that I found commented, it is a must!

Something very useful for me was listening to the first KaggleConf that took place in March 2018 aimed to people searching their first job in data science. The talks and discussion gave tips about creating a compelling portfolio to show your projects, tailoring your CV, finding opportunities and preparing for a potential interview.

So, I finally made up my mind: I’m leaving academia and most importantly, I found out that data science might be the right setting to apply my skills, which I’m trying to enhance, in a way that better satisfy my vision on life and work.

The next step after taking this decision and finishing online specializations was to start my own data science projects that you can read about it here.