Forem: Paul Apivat

Mining Thai Food Text with R

Paul Apivat — Fri, 19 Mar 2021 05:27:00 +0000

This is part 3 in the Thai Food Dishes series.

Text Mining

Which raw material(s) are most popular?

One way to answer this question is to use text mining to tokenize by either word and count the words by frequency as one measure of popularity.

In the below bar chart, we see frequency of words across all Thai Dishes. Mu (หมู) which means pork in Thai appears most frequently across all dish types and sub-grouping. Next we have kaeng (แกง) which means curry. Phat (ผัด) comings in third suggesting "stir-fry" is a popular cooking mode.

As we can see not all words refer to raw materials, so we may not be able to answer this question directly.

library(tidytext)
library(scales)

# new csv file after data cleaning (see above)
df <- read_csv("../web_scraping/edit_thai_dishes.csv")

df %>%
    select(Thai_name, Thai_script) %>%
    # can substitute 'word' for ngrams, sentences, lines
    unnest_tokens(ngrams, Thai_name) %>%  
    # to reference thai spelling: group_by(Thai_script)
    group_by(ngrams) %>%  
    tally(sort = TRUE) %>%  # alt: count(sort = TRUE)
    filter(n > 9) %>%
# visualize
# pipe directly into ggplot2, because using tidytools
    ggplot(aes(x = n, y = reorder(ngrams, n))) + 
    geom_col(aes(fill = ngrams)) +
    scale_fill_manual(values = c(
        "#c3d66b",
        "#70290a",
        "#2f1c0b",
        "#ba9d8f",
        "#dda37b",
        "#8f5e23",
        "#96b224",
        "#dbcac9",
        "#626817",
        "#a67e5f",
        "#be7825",
        "#446206",
        "#c8910b",
        "#88821b",
        "#313d5f",
        "#73869a",
        "#6f370f",
        "#c0580d",
        "#e0d639",
        "#c9d0ce",
        "#ebf1f0",
        "#50607b"
    )) +
    theme_minimal() +
    theme(legend.position = "none") +
    labs(
        x = "Frequency",
        y = "Words",
        title = "Frequency of Words in Thai Cuisine",
        subtitle = "Words appearing at least 10 times in Individual or Shared Dishes",
        caption = "Data: Wikipedia | Graphic: @paulapivat"
    )

We can also see words common to both Individual and Shared Dishes. We see other words like nuea (beef), phrik (chili) and kaphrao (basil leaves).

# frequency for Thai_dishes (Major Grouping) ----

# comparing Individual and Shared Dishes (Major Grouping)
thai_name_freq <- df %>%
    select(Thai_name, Thai_script, major_grouping) %>%
    unnest_tokens(ngrams, Thai_name) %>% 
    count(ngrams, major_grouping) %>%
    group_by(major_grouping) %>%
    mutate(proportion = n / sum(n)) %>%
    select(major_grouping, ngrams, proportion) %>%
    spread(major_grouping, proportion) %>%
    gather(major_grouping, proportion, c(`Shared dishes`)) %>%
    select(ngrams, `Individual dishes`, major_grouping, proportion)


# Expect warming message about missing values
ggplot(thai_name_freq, aes(x = proportion, y = `Individual dishes`,
       color = abs(`Individual dishes` - proportion))) +
    geom_abline(color = 'gray40', lty = 2) +
    geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
    geom_text(aes(label = ngrams), check_overlap = TRUE, vjust = 1.5) +
    scale_x_log10(labels = percent_format()) +
    scale_y_log10(labels = percent_format()) +
    scale_color_gradient(limits = c(0, 0.01), 
                         low = "red", high = "blue") +    # low = "darkslategray4", high = "gray75"
    theme_minimal() +
    theme(legend.position = "none",
          legend.text = element_text(angle = 45, hjust = 1)) +
    labs(y = "Individual Dishes",
         x = "Shared Dishes",
         color = NULL,
         title = "Comparing Word Frequencies in the names Thai Dishes",
         subtitle = "Individual and Shared Dishes",
         caption = "Data: Wikipedia | Graphics: @paulapivat")

Which raw materials are most important?

We can only learn so much from frequency, so text mining practitioners have created term frequency - inverse document frequency to better reflect how important a word is in a document or corpus (further details here).

Again, the words don't necessarily refer to raw materials, so this question can't be fully answered directly here.

Could you learn about Thai food just from the names of the dishes?

The short answer is "yes".

We learned just from frequency and "term frequency - inverse document frequency" not only the most frequent words, but the relative importance within the current set of words that we have tokenized with tidytext. This informs us of not only popular raw materials (Pork), but also dish types (Curries) and other popular mode of preparation (Stir-Fry).

We can even examine the network of relationships between words. Darker arrows suggest a stronger relationship between pairs of words, for example "nam phrik" is a strong pairing. This means "chili sauce" in Thai and suggests the important role that it plays across many types of dishes.

We learned above that "mu" (pork) appears frequently. Now we see that "mu" and "krop" are more related than other pairings (note: "mu krop" means "crispy pork"). We also saw above that "khao" appears frequently in Rice dishes. This alone is not surprising as "khao" means rice in Thai, but we see here "khao phat" is strongly related suggesting that fried rice ("khao phat") is quite popular.

# Visualizing a network of Bi-grams with {ggraph} ----
library(igraph)
library(ggraph)
set.seed(2021)

thai_dish_bigram_counts <- df %>%
    select(Thai_name, minor_grouping) %>%
    unnest_tokens(bigram, Thai_name, token = "ngrams", n = 2) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    count(word1, word2, sort = TRUE)


# filter for relatively common combinations (n > 2)
thai_dish_bigram_graph <- thai_dish_bigram_counts %>%
    filter(n > 2) %>%
    graph_from_data_frame()


# polishing operations to make a better looking graph
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

set.seed(2021)
ggraph(thai_dish_bigram_graph, layout = "fr") +
    geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                   arrow = a, end_cap = circle(.07, 'inches')) +
    geom_node_point(color = "dodgerblue", size = 5, alpha = 0.7) +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
    labs(
        title = "Network of Relations between Word Pairs",
        subtitle = "{ggraph}: common nodes in Thai food",
        caption = "Data: Wikipedia | Graphics: @paulapivat"
    ) +
    theme_void()

Finally, we may be interested in word relationships within individual dishes.

The below graph shows a network of word pairs with moderate-to-high correlations. We can see certain words clustered near each other with relatively dark lines: kaeng (curry), pet (spicy), wan (sweet), khiao (green curry), phrik (chili) and mu (pork). These words represent a collection of ingredient, mode of cooking and description that are generally combined.

set.seed(2021)

# Individual Dishes
individual_dish_words <- df %>%
    select(major_grouping, Thai_name) %>%
    filter(major_grouping == 'Individual dishes') %>%
    mutate(section = row_number() %/% 10) %>%
    filter(section > 0) %>%
    unnest_tokens(word, Thai_name)  # assume no stop words

individual_dish_cors <- individual_dish_words %>%
    group_by(word) %>% 
    filter(n() >= 2) %>%     # looking for co-occuring words, so must be 2 or greater
    pairwise_cor(word, section, sort = TRUE) 


individual_dish_cors %>%
    filter(correlation < -0.40) %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = correlation, size = correlation), show.legend = TRUE) +
    geom_node_point(color = "green", size = 5, alpha = 0.5) +
    geom_node_text(aes(label = name), repel = TRUE) +
    labs(
        title = "Word Pairs in Individual Dishes",
        subtitle = "{ggraph}: Negatively correlated (r = -0.4)",
        caption = "Data: Wikipedia | Graphics: @paulapivat"
    ) +
    theme_void()

Summary

We have completed an exploratory data project where we scraped, clean, manipulated and visualized data using a combination of Python and R. We also used the tidytext package for basic text mining task to see if we could gain some insights into Thai cuisine using words from dish names scraped off Wikipedia.

For more content on data science, R, Python, SQL and more, find me on Twitter.

Visualizing Thai Food with R

Paul Apivat — Fri, 19 Mar 2021 05:12:26 +0000

This post is part 2 in the Thai Food Dishes series.

Data Cleaning

Data cleaning is typically non-linear.

We'll manipulate the data to explore, learn about the data and see that certain things need cleaning or, in some cases, going back to Python to re-scrape. The columns a1 and a6 were scraped differently from other columns due to missing data found during exploration and cleaning.

For certain links, using .find(text=True) did not work as intended, so a slight adjustment was made.

For this post, R is the tool of choice for cleaning the data.

Here are other data cleaning tasks:

Changing column names (snake case)

# read data
df <- read_csv("thai_dishes.csv")

# change column name
df <- df %>%
    rename(
        Thai_name = `Thai name`,
        Thai_name_2 = `Thai name 2`,
        Thai_script = `Thai script`,
        English_name = `English name`
    )

Remove newline escape sequence (\n)

# remove  \n from all columns ----
df$Thai_name <- gsub("[\n]", "", df$Thai_name)
df$Thai_name_2 <- gsub("[\n]", "", df$Thai_name_2)
df$Thai_script <- gsub("[\n]", "", df$Thai_script)
df$English_name <- gsub("[\n]", "", df$English_name)
df$Image <- gsub("[\n]", "", df$Image)
df$Region <- gsub("[\n]", "", df$Region)
df$Description <- gsub("[\n]", "", df$Description)
df$Description2 <- gsub("[\n]", "", df$Description2)

Add/Mutate new columns (major_groupings, minor_groupings):

# Add Major AND Minor Groupings ----
df <- df %>%
    mutate(
        major_grouping = as.character(NA),
        minor_grouping = as.character(NA)
        )

Edit rows for missing data in Thai_name column: 26, 110, 157, 234-238, 240, 241, 246

Note: This was only necessary the first time round, after the changes are made to how I scraped a1 and a6, this step is no longer necessary:

# If necessary; may not need to do this after scraping a1 and a6 - see above
# Edit Rows for missing Thai_name
df[26,]$Thai_name <- "Khanom chin nam ngiao"
df[110,]$Thai_name <- "Lap Lanna"
df[157,]$Thai_name <- "Kai phat khing"
df[234,]$Thai_name <- "Nam chim chaeo"
df[235,]$Thai_name <- "Nam chim kai"
df[236,]$Thai_name <- "Nam chim paesa"
df[237,]$Thai_name <- "Nam chim sate"
df[238,]$Thai_name <- "Nam phrik i-ke"
df[240,]$Thai_name <- "Nam phrik kha"
df[241,]$Thai_name <- "Nam phrik khaep mu"
df[246,]$Thai_name <- "Nam phrik pla chi"

save to "edit_thai_dishes.csv"

# Write new csv to save edits made to data frame
write_csv(df, "edit_thai_dishes.csv")

Data Visualization

There are several ways to visualize the data. Because we want to communicate the diversity of Thai dishes, aside from Pad Thai, we want a visualization that captures the many, many options.

I opted for a dendrogram. This graph assumes hierarchy within the data, which fits our project because we can organize the dishes in grouping and sub-grouping.

How might we organized Thai dishes?

We first make a distinction between individual and shared dishes to show that Pad Thai is not even close to being the best individual dish. And, in fact, more dishes fall under the shared grouping.

To avoid cramming too much data into one visual, we'll create two separate visualizations for individual vs. shared dishes.

Here is the first dendrogram representing 52 individual dish alternatives to Pad Thai.

Creating a dendrogram requires using the ggraph and igraph libraries. First, we'll load the libraries and sub-set our data frame by filtering for Individual Dishes:

df <- read_csv("edit_thai_dishes.csv")

library(ggraph)
library(igraph)

df %>%
    select(major_grouping, minor_grouping, Thai_name, Thai_script) %>%
    filter(major_grouping == 'Individual dishes') %>%
    group_by(minor_grouping) %>%
    count()

We create edges and nodes (i.e., from and to) to create the sub-groupings within Individual Dishes (i.e., Rice, Noodles and Misc):

# Individual Dishes ----

# data: edge list
d1 <- data.frame(from="Individual dishes", to=c("Misc Indiv", "Noodle dishes", "Rice dishes"))

d2 <- df %>%
    select(minor_grouping, Thai_name) %>%
    slice(1:53) %>%
    rename(
        from = minor_grouping,
        to = Thai_name
    ) 

edges <- rbind(d1, d2)

# plot dendrogram (idividual dishes)
indiv_dishes_graph <- graph_from_data_frame(edges)

ggraph(indiv_dishes_graph, layout = "dendrogram", circular = FALSE) +
    geom_edge_diagonal(aes(edge_colour = edges$from), label_dodge = NULL) +
    geom_node_text(aes(label = name, filter = leaf, color = 'red'), hjust = 1.1, size = 3) +
    geom_node_point(color = "whitesmoke") +
    theme(
        plot.background = element_rect(fill = '#343d46'),
        panel.background = element_rect(fill = '#343d46'),
        legend.position = 'none',
        plot.title = element_text(colour = 'whitesmoke', face = 'bold', size = 25),
        plot.subtitle = element_text(colour = 'whitesmoke', face = 'bold'),
        plot.caption = element_text(color = 'whitesmoke', face = 'italic')
    ) +
    labs(
        title = '52 Alternatives to Pad Thai',
        subtitle = 'Individual Thai Dishes',
        caption = 'Data: Wikipedia | Graphic: @paulapivat'
    ) +
    expand_limits(x = c(-1.5, 1.5), y = c(-0.8, 0.8)) +
    coord_flip() +
    annotate("text", x = 47, y = 1, label = "Miscellaneous (7)", color = "#7CAE00")+
    annotate("text", x = 31, y = 1, label = "Noodle Dishes (24)", color = "#00C08B") +
    annotate("text", x = 8, y = 1, label = "Rice Dishes (22)", color = "#C77CFF") +
    annotate("text", x = 26, y = 2, label = "Individual\nDishes", color = "#F8766D")

What is the best way to organized the different dishes?

There are approximately 4X as many shared dishes as individual dishes, so the dendrogram should be circular to fit the names of all dishes in one graphic.

A wonderful resource I use regularly for these types of visuals is the R Graph Gallery. There was a slight issue in how the text angles were calculated so I submitted a PR to fix.

Perhaps distinguishing between individual and shared dishes is too crude, within the dendrogram for 201 shared Thai dishes, we can see further sub-groupings including Curries, Sauces/Pastes, Steamed, Grilled, Deep-Fried, Fried & Stir-Fried, Salads, Soups and other Misc:

# Shared Dishes ----
df %>%
    select(major_grouping, minor_grouping, Thai_name, Thai_script) %>%
    filter(major_grouping == 'Shared dishes') %>%
    group_by(minor_grouping) %>%
    count() %>%
    arrange(desc(n))

d3 <- data.frame(from="Shared dishes", to=c("Curries", "Soups", "Salads",
                                            "Fried and stir-fried dishes", "Deep-fried dishes", "Grilled dishes",
                                            "Steamed or blanched dishes", "Stewed dishes", "Dipping sauces and pastes", "Misc Shared"))


d4 <- df %>%
    select(minor_grouping, Thai_name) %>%
    slice(54:254) %>%
    rename(
        from = minor_grouping,
        to = Thai_name
    )

edges2 <- rbind(d3, d4)

# create a vertices data.frame. One line per object of hierarchy
vertices = data.frame(
    name = unique(c(as.character(edges2$from), as.character(edges2$to)))
)

# add column with group of each name. Useful to later color points
vertices$group = edges2$from[ match(vertices$name, edges2$to)]

# Add information concerning the label we are going to add: angle, horizontal adjustment and potential flip
# calculate the ANGLE of the labels
vertices$id=NA
myleaves=which(is.na(match(vertices$name, edges2$from)))
nleaves=length(myleaves)
vertices$id[myleaves] = seq(1:nleaves)
vertices$angle = 360 / nleaves * vertices$id + 90    


# calculate the alignment of labels: right or left
vertices$hjust<-ifelse( vertices$angle < 275, 1, 0)



# flip angle BY to make them readable
vertices$angle<-ifelse(vertices$angle < 275, vertices$angle+180, vertices$angle)

# plot dendrogram (shared dishes)
shared_dishes_graph <- graph_from_data_frame(edges2)

ggraph(shared_dishes_graph, layout = "dendrogram", circular = TRUE) +
    geom_edge_diagonal(aes(edge_colour = edges2$from), label_dodge = NULL) +
    geom_node_text(aes(x = x*1.15, y=y*1.15, filter = leaf, label=name, angle = vertices$angle, hjust= vertices$hjust, colour= vertices$group), size=2.7, alpha=1) +
    geom_node_point(color = "whitesmoke") +
    theme(
        plot.background = element_rect(fill = '#343d46'),
        panel.background = element_rect(fill = '#343d46'),
        legend.position = 'none',
        plot.title = element_text(colour = 'whitesmoke', face = 'bold', size = 25),
        plot.subtitle = element_text(colour = 'whitesmoke', margin = margin(0,0,30,0), size = 20),
        plot.caption = element_text(color = 'whitesmoke', face = 'italic')
    ) +
    labs(
        title = 'Thai Food is Best Shared',
        subtitle = '201 Ways to Make Friends',
        caption = 'Data: Wikipedia | Graphic: @paulapivat'
    ) +
    #expand_limits(x = c(-1.5, 1.5), y = c(-0.8, 0.8)) +
    expand_limits(x = c(-1.5, 1.5), y = c(-1.5, 1.5)) +
    coord_flip() +
    annotate("text", x = 0.4, y = 0.45, label = "Steamed", color = "#F564E3") +
    annotate("text", x = 0.2, y = 0.5, label = "Grilled", color = "#00BA38") +
    annotate("text", x = -0.2, y = 0.5, label = "Deep-Fried", color = "#DE8C00") +
    annotate("text", x = -0.4, y = 0.1, label = "Fried &\n Stir-Fried", color = "#7CAE00") +
    annotate("text", x = -0.3, y = -0.4, label = "Salads", color = "#00B4F0") +
    annotate("text", x = -0.05, y = -0.5, label = "Soups", color = "#C77CFF") +
    annotate("text", x = 0.3, y = -0.5, label = "Curries", color = "#F8766D") +
    annotate("text", x = 0.5, y = -0.1, label = "Misc", color = "#00BFC4") +
    annotate("text", x = 0.5, y = 0.1, label = "Sauces\nPastes", color = "#B79F00")

For more content on data science, R, Python, SQL and more, find me on Twitter.

Using Python to Scrape Thai Food Data

Paul Apivat — Fri, 19 Mar 2021 05:00:48 +0000

Photo by Alyssa Kowalski on Unsplash

Overview

"Let's order Thai."

"Great, what's your go-to dish?"

"Pad Thai.”

This has bugged me for years and is the genesis for this project.

People need to know they have other choices aside from Pad Thai. Pad Thai is one of 53 individual dishes and stopping there risks missing out on at least 201 shared Thai dishes (source: wikipedia).

This project is an opportunity to build a data set of Thai dishes by scraping tables off Wikipedia. We will use Python for web scraping and R for visualization. Web scraping is done in Beautiful Soup (Python) and pre-processed further with dplyr and visualized with ggplot2.

Furthermore, we'll use the tidytext package in R to explore the names of Thai dishes (in English) to see if we can learn some interest things from text data.

Finally, there is an opportunity to make an open source contribution.

The project repo is here.

Exploratory Questions

The purpose of this analysis is to generate questions.

Because exploratory analysis is iterative, these questions were generated in the process of manipulating and visualizing data. We can use these questions to structure the rest of the post:

How might we organized Thai dishes?
What is the best way to organized the different dishes?
Which raw material(s) are most popular?
Which raw materials are most important?
Could you learn about Thai food just from the names of the dishes?

Web Scraping

We scraped over 300 Thai dishes. For each dish, we got:

Thai name
Thai script
English name
Region
Description

First, we'll use the following Python libraries/modules:

import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
import ssl
import pandas as pd

We'll use requests to send an HTTP requests to the wikipedia url we need. We'll access network sockets using 'secure sockets layer' (SSL). Then we'll read in the html data to parse it with Beautiful Soup.

Before using Beautiful Soup, we want to understand the structure of the page (and tables) we want to scrape under inspect element on the browser (note: I used Chrome). We can see that we want the table tag, along with class of wikitable sortable.

The main function we'll use from Beautiful Soup is findAll() and the three parameters are th (Header Cell in HTML table), tr (Row in HTML table) and td (Standard Data Cell).

First, we'll save the table headers in a list, which we'll use when creating an empty dictionary to store the data we need.

header = [item.text.rstrip() for item in all_tables[0].findAll('th')]

table = dict([(x, 0) for x in header])

Initially, we want to scrape one table, knowing that we'll need to repeat the process for all 16 tables. Therefore we'll use a nested loop. Because all tables have 6 columns, we'll want to create 6 empty lists.

We'll scrape through all table rows tr and check for 6 cells (which we should have for 6 columns), then we'll append the data to each empty list we created.

# loop through all 16 tables
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

# 6 empty list (for 6 columns) to store data
a1 = []
a2 = []
a3 = []
a4 = []
a5 = []
a6 = []

# nested loop for looping through all 16 tables, then all tables individually
for i in a:
    for row in all_tables[i].findAll('tr'):
        cells = row.findAll('td')
        if len(cells) == 6:
            a1.append([string for string in cells[0].strings])
            a2.append(cells[1].find(text=True))
            a3.append(cells[2].find(text=True))
            a4.append(cells[3].find(text=True))
            a5.append(cells[4].find(text=True))
            a6.append([string for string in cells[5].strings])

You'll note the code for a1 and a6 are slightly different. In retrospect, I found that cells[0].find(text=True) did not yield certain texts, particularly if they were links, therefore a slight adjustment is made.

The strings tag returns a NavigableString type object while text returns a unicode object (see stack overflow explanation).

After we've scrapped the data, we'll need to store the data in a dictionary before converting to data frame:

# create dictionary
table = dict([(x, 0) for x in header])

# append dictionary with corresponding data list
table['Thai name'] = a1
table['Thai script'] = a2
table['English name'] = a3
table['Image'] = a4
table['Region'] = a5
table['Description'] = a6

# turn dict into dataframe
df_table = pd.DataFrame(table)

For a1 and a6, we need to do an extra step of joining the strings together, so I've created two additional corresponding columns, Thai name 2 and Description2:

# Need to Flatten Two Columns: 'Thai name' and 'Description'
# Create two new columns
df_table['Thai name 2'] = ""
df_table['Description2'] = ""

# join all words in the list for each of 328 rows and set to thai_dishes['Description2'] column
# automatically flatten the list
df_table['Description2'] = [
    ' '.join(cell) for cell in df_table['Description']]

df_table['Thai name 2'] = [
    ' '.join(cell) for cell in df_table['Thai name']]

After we've scrapped all the data and converted from dictionary to data frame, we'll write to CSV to prepare for data cleaning in R (note: I saved the csv as thai_dishes.csv, but you can choose a different name).

For more content on data science, R, Python, SQL and more, find me on Twitter.

How Positive are Your Facebook Posts?

Paul Apivat — Fri, 29 Jan 2021 08:22:17 +0000

Rule-based Sentiment Analysis Using Python and R

Overview

Why Sentiment Analysis?

NLP is subfield of linguistic, computer science and artificial intelligence (wiki), and you could spend years studying it.

However, I wanted a quick dive to a get an intuition for how NLP works, and we'll do that via sentiment analysis, categorizing text by their polarity.

We can't help but feel motivated to see insights about our own social media post, so we'll turn to a well known platform.

How well does Facebook know us?

To find out, I downloaded 14 years of posts to apply text and sentiment analysis. We'l use Python to read and parse json data from Facebook.

We'll perform tasks such as tokenization and normalization aided by Python's Natural Language Toolkit, NLTK. Then, we'll use the Vader module (Hutto & Gilbert, 2014) for rule-based (lexicon) sentiment analysis.

Finally, we'll transition our work flow to R and the tidyverse for data manipulation and visualization.

Getting Data

First, you'll need to download your own Facebook data by following: Setting & Privacy > Setting > Your Facebook Information > Download Your Information > (select) Posts.

Below, I named my file your_posts_1.json, but you can change this.
We'll use Python's json module read in data. We can get a feel for the data with type and len.

import json

# load json into python, assign to 'data'
with open('your_posts_1.json') as file:
    data = json.load(file)

type(data)     # a list
type(data[0])  # first object in the list: a dictionary
len(data)      # my list contains 2166 dictionaries

Here are the Python libraries we use in this post:

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import LancasterStemmer, WordNetLemmatizer      # OPTIONAL (more relevant for individual words)
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import re
import unicodedata
import nltk
import json
import inflect
import matplotlib.pyplot as plt

Natural Language Tookkit is a popular Python platform for working with human language data. While it has over 50 lexical resources, we'll use the Vader Sentiment Lexicon, that is specifically attuned to sentiments expressed in social media.

Regex (regular expressions) will be used to remove punctuation.

Unicode Database will be used to remove non-ASCII characters.

JSON module helps us to read in json from Facebook.

Inflect helps us to convert numbers to words.

Pandas is a powerful data manipulation and data analysis tool for when we save our text data into a data frame and write to csv.

After we have our data, we'll dig through to get actual text data (our posts).

We'll store this in a list.

Note: the data key occasionally returns an empty array and we want to skip over those by checking if len(v) > 0.

# create empty list
empty_lst = []

# multiple nested loops to store all post in empty list
for dct in data:
    for k, v in dct.items():
        if k == 'data':
            if len(v) > 0:
                for k_i, v_i in vee[0].items():  
                    if k_i == 'post':
                        empty_lst.append(v_i)

print("This is the empty list: ", empty_lst)
print("\nLength of list: ", len(empty_lst))

We now have a list of strings.

Tokenization

We'll loop through our list of strings (empty_lst) to tokenize each sentence with nltk.sent_tokenize(). We want to split the text into individual sentences.

This yields a list of list, which we'll flatten:

# - list of list, len: 1762 (each list contain sentences)
nested_sent_token = [nltk.sent_tokenize(lst) for lst in empty_lst]

# flatten list, len: 3241
flat_sent_token = [item for sublist in nested_sent_token for item in sublist]
print("Flatten sentence token: ", len(flat_sent_token))

Normalizing Sentences

For context on the functions used in this section, check out this article by Matthew Mayo on Text Data Preprocessing.

First, we'll remove non-ASCII characters (remove_non_ascii(words)) including: #, -, ' and ?, among many others. Then we'll lowercase (to_lowercase(words)), remove punctuation (remove_punctuation(words)), replace numbers (replace_numbers(words)), and remove stopwords (remove_stopwords(words)).

Example stopwords are: your, yours, yourself, yourselves, he, him, his, himself etc.

This allows us to have each sentence be on equal playing field.

# Remove Non-ASCII
def remove_non_ascii(words):
    """Remove non-ASCII character from List of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode(
            'ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words


# To LowerCase
def to_lowercase(words):
    """Convert all characters to lowercase from List of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words


# Remove Punctuation , then Re-Plot Frequency Graph
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words


# Replace Numbers with Textual Representations
def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

# Remove Stopwords
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

# Combine all functions into Normalize() function
def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    return words

The below screen cap gives us an idea of the difference between sentence normalization vs non-normalization.

sents = normalize(flat_sent_token)
print("Length of sentences list: ", len(sents))   # 3194

NOTE: The process of stemming and lemmatization makes more sense for individuals words (over sentences), so we won't use them here.

Frequency

You can use the FreqDist() function to get the most common sentences. Then, you could plot a line chart for a visual comparison of the most frequent sentences.

Although simple, counting frequencies can yield some insights.

from nltk.probability import FreqDist

# Find frequency of sentence
fdist_sent = FreqDist(sents)
fdist_sent.most_common(10)   

# Plot
fdist_sent.plot(10)

Sentiment Analysis

We'll use the Vader module from NLTK. Vader stands for:

Valence, Aware, Dictionary and sEntiment Reasoner.

We are taking a Rule-based/Lexicon approach to sentiment analysis because we have a fairly large dataset, but lack labeled data to build a robust training set. Thus, Machine Learning would not be ideal for this task.

To get an intuition for how the Vader module works, we can visit the github repo to view vader_lexicon.txt (source). This is a dictionary that has been empirically validated. Sentiment ratings are provided by 10 independent human raters (pre-screened, trained and checked for inter-rater reliability).

Scores range from (-4) Extremely Negative to (4) Extremely Positive, with (0) as Neutral. For example, "die" is rated -2.9, while "dignified" has a 2.2 rating. For more details visit their (repo).

We'll create two empty lists to store the sentences and the polarity scores, separately.

sentiment captures each sentence and sent_scores, which initializes the nltk.sentiment.vader.SentimentIntensityAnalyzer to calculate polarity_score of each sentence (i.e., negative, neutral, positive).

sentiment2 captures each polarity and value in a list of tuples.

The below screen cap should give you a sense of what we have:

After we have appended each sentence (sentiment) and their polarity scores (sentiment2, negative, neutral, positive), we'll create data frames to store these values.

Then, we'll write the data frames to CSV to transition to R. Note that we set index to false when saving for CSV. Python starts counting at 0, while R starts at 1, so we're better off re-creating the index as a separate column in R.

NOTE: There are more efficient ways for what I'm doing here. My solution is to save two CSV files and move the work flow over to R for further data manipulation and visualization. This is primarily a personal preference for handling data frames and visualizations in R, but I should point out this can be done with pandas and matplotlib.

# nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()

sentiment = []
sentiment2 = []

for sent in sents:
    sent1 = sent
    sent_scores = sid.polarity_scores(sent1)
    for x, y in sent_scores.items():
        sentiment2.append((x, y))
    sentiment.append((sent1, sent_scores))
    # print(sentiment)

# sentiment
cols = ['sentence', 'numbers']
result = pd.DataFrame(sentiment, columns=cols)
print("First five rows of results: ", result.head())

# sentiment2
cols2 = ['label', 'values']
result2 = pd.DataFrame(sentiment2, columns=cols2)
print("First five rows of results2: ", result2.head())

# save to CSV
result.to_csv('sent_sentiment.csv', index=False)
result2.to_csv('sent_sentiment_2.csv', index=False)

Data Transformation

From this point forward, we'll be using R and the tidyverse for data manipulation and visualization. RStudio is the IDE of choice here. We'll create an R Script to store all our data transformation and visualization process. We should be in the same directory in which the above CSV files were created with pandas.

We'll load the two CSV files we saved and the tidyverse library:

library(tidyverse)

# load data
df <- read_csv("sent_sentiment.csv")       
df2 <- read_csv('sent_sentiment_2.csv')

We'll create another column that matches the index for the first data frame (sent_sentiment.csv). I save it as df1, but you could overwrite the original df if you wanted.

# create a unique identifier for each sentence
df1 <- df %>%
    mutate(row = row_number())

Then, for the second data frame (sent_sentiment_2.csv), we'll create another column matching the index, but also use pivot_wider from the tidyr package. NOTE: You'll want to group_by label first, then use mutate to create a unique identifier.

We'll then use pivot_wider to ensure that all polarity values (negative, neutral, positive) have their own columns.

By creating a unique identifier using mutate and row_number(), we'll be able to join (left_join) by row.

Finally, I save the operation to df3 which allows me to work off a fresh new data frame for visualization.

# long-to-wide for df2
# note: first, group by label; then, create a unique identifier for each label then use pivot_wider

df3 <- df2 %>%
    group_by(label) %>%
    mutate(row = row_number()) %>%
    pivot_wider(names_from = label, values_from = values) %>%
    left_join(df1, by = 'row') %>%
    select(row, sentence, neg:compound, numbers)

Visualization

First, we'll visualize the positive and negative polarity scores separately, across all 3194 sentences (your numbers will vary).

Here are positivity scores:

Here are negativity scores:

When I sum positivity and negativity scores to get a ratio, it's approximately 568:97 or 5.8x more positive than negative according to the Vader (Valance Aware Dictionary and Sentiment Reasoner).

The Vader module will take in every sentence and assign a valence score from -1 (most negative) to 1 (most positive). We can classify sentences as pos (positive), neu (neutral) and neg(negative) or as a composite (compound) score (i.e., normalized, weighted composite score). For more details, see vader-sentiment documentation.

Here is a chart to see both positive and negative scores together (positive = blue, negative = red, neutral = black).

Finally, we can also use histograms to see the distribution of negative and positive sentiment among the sentences:

Non-Normalized Data

It turns out the Vader module is fully capable of analyzing sentences with punctuation, word-shape (capitalization for emphasis), slang and even utf-8 encoded emojis.

So to see if there would be any difference if we implemented sentiment analysis without normalization, I re-ran all the analyses above.

Here are the two version of data for comparison. Top for normalization and bottom for non-normalized.

While there are expected slight differences, they are only slight.

Summary

I downloaded 14 years worth of Facebook posts to run a rule-based sentiment analysis and visualize the results, using a combination of Python and R.

I enjoyed using both for this project and sought to play to their strengths. I found parsing JSON straight-forward with Python, but once we transition to data frames, I was itching to get back to R.

Because we lacked labeled data, using a rule-based/lexicon-approach to sentiment analysis made sense. Now that we have a label for valence scores, it may be possible to take a machine learning approach to predict the valence of future posts.

References

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Quickly Analyze Relationships in your Twitter Data

Paul Apivat — Sat, 16 Jan 2021 16:05:20 +0000

Overview & Setup

This post uses various R libraries and functions to help you explore your Twitter Analytics Data. The first thing to do is download data from analytics.twitter.com. The assumption here is that you're already a Twitter user and have been using for at least 6 months.

Once there, you'll click on the Tweets tab, which should bring you to your Tweet activity with the option to Export data:

Once you click on Export data, you'll choose "By day", which provides your impressions and engagements metrics for everyday (you'll also select the time period, in the drop down menu right next to Export data - the default is "Last 28 Days").

Note: The other option is to choose "By Tweet" and that will download the text of each Tweet along with associated metrics. We could potentially do fun text analysis with this, but we'll save that for another post.

For this post, I downloaded all available data, which goes five months back.

After downloading, you'll want to read in the data and, in our case, combine all five months into one data frame, we'll use the readr package and read_csv() function contained in tidyverse. Then we'll use rbind() to combine five data frames by rows:

library(tidyverse)

# load data from September to mid-January
df1 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20200901_20201001_en.csv")
df2 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201001_20201101_en.csv")
df3 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201101_20201201_en.csv")
df4 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201201_20210101_en.csv")
df5 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20210101_20210112_en.csv")

# combining ALL five dataframes into ONE, by rows
df <- rbind(df1, df2, df3, df4, df5)

Exploring Relationships

Twitter analytics tracks several metric that are broadly grouped under Engagements, including: retweets, replies, likes, user profile clicks, url clicks, hashtag clicks, detail expands, media views and media engagements.

There are other metrics like "app opens" and "promoted engagements", which are services I have not used and so do not have any data available.

A Guiding Question

It's useful to have a guiding question as it helps focus your exploration. Let's say, I was interested in whether one of my tweets prompted a reader to click on my profile. The metric for this is user profile clicks.

My initial guiding question for this post is:

Which metrics are most strongly correlated with User Profile Clicks?

You could simply use the cor.test() function, which comes with base R, to go one by one between each metric and User Profile Click. For example, below we calculate the correlation between three pairs of variables, User Profile Clicks and retweets, replies and likes, separately. After awhile, this can get tedious.

cor.test(x = df$`user profile clicks`, y = df$retweets)
cor.test(x = df$`user profile clicks`, y = df$replies)
cor.test(x = df$`user profile clicks`, y = df$likes)

A quicker way to explore the relationship between pairs of metrics throughout a dataset is to use a correlelogram.

We'll start with base R. You'll want to limit the number of variables you visualize so the correlelogram doesn't become too cluttered. Here are four variables that correlate the highest with User Profile Clicks:

# four columns are selected along with user profile clicks to plot
df %>%
    select(8, 12, 19:20, `user profile clicks`) %>%
    plot(pch = 20, cex = 1.5, col="#69b3a2")

Here's a visual:

Here are another four metrics with moderate relationships:

df %>%
    select(6:7, 10:11, `user profile clicks`) %>%
    plot(pch = 20, cex = 1.5, col="#69b3a2")

Visually, you can see the moderate relationship scatter plots are more dispersed, with a less identifiable direction.

While base R is dependable, we can get more informative plots with the GGally package. Here are the four highly correlated variables with User Profile Clicks:

library(GGally)

# GGally, Strongest Related
df %>%
    select(8, 12, 19:20, `user profile clicks`) %>%
    ggpairs(
        diag = NULL,
        title = "Strongest Relationships with User Profile Clicks: Sep 2020 - Jan 2021",
        axisLabels = c("internal"),
        xlab = "Value"
    )

Here's the correlelogram between the four most highly correlated variables with user profile clicks:

Here are the moderately correlated variables with User Profile Clicks:

As you can see, not only do these provide scatter plots, but they also show the numerical values of the correlation between each pair of variables, which is much more informative than base R.

Now, its entirely possible that the pattern of correlation in your data is different as the initial patterns we're seeing here are not meant to generalize to a different dataset.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Grasping Gradient Descent using Python

Paul Apivat — Sat, 26 Dec 2020 14:38:59 +0000

Photo by Fineas Anton on Unsplash

Overview

In this post, we'll explore Gradient Descent from the ground up starting conceptually, then using code to build up our intuition brick by brick.

While this post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus, for this post I am drawing on external sources including Aurélien Geron's Hands-On Machine Learning to provide a context for why and when gradient descent is used.

We'll also be using external libraries such as numpy, that are generally avoided in Data Science from Scratch, to help highlight concepts.

While the book introduces gradient descent as a standalone topic, I find it more intuitive to reason about it within the context of a regression problem.

Setup

In any modeling task, there is error, and our objective is minimize the errors so that when we develop models from our training data, we'll have some confidence that the predictions will work in testing and completely new data.

We'll train a linear regression model. Our dataset will only have three data points. To create the model, we'll setting up parameters (slope and intercept) that best "fits" the data (i.e., best-fitting line), for example:

We know the values for both x and y, so we can calculate the slope and intercept directly through the normal equation, which is the analytical approach to finding regression coefficients (slope and intercept):

# Normal Equation

import numpy as np
import matplotlib.pyplot as plt

x = np.array([2, 4, 5])
y = np.array([45, 85, 105])

# computing Normal Equation
x_b = np.c_[np.ones((3, 1)), x]       # add x0 = 1 to each of three instances
theta = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y)

# array([ 5., 20.])
theta

The key line is np.linalg.inv() which computes the multiplicative inverse of a matrix.

Our slope is 20 and intercept is 5 (i.e., theta).

We could also have used the more familiar "rise over run" ((85 - 45) / (4 - 2)) or (40/2) or 20, but we want to illustrate the normal equation which should come in handy when we go beyond the simplistic three data point example.

We could also use the LinearRegression class from sklearn to call the least squares (np.linalg.lstsq()) function directly:

# Least Squares

from sklearn.linear_model import LinearRegression
import numpy as np

x = np.array([2, 4, 5])
y = np.array([45, 85, 105])

x = x.reshape(-1, 1)              # reshape because sklearn expect 2D array

x_b = np.c_[np.ones((3, 1)), x]   # add x0 = 1 to each of three instances

theta, residuals, rank, s = np.linalg.lstsq(x_b, y, rcond=1e-6)

# array([ 5., 20.])
print("theta:", theta)

This appraoch also yields the slope (20) and intercept (5) directly.

We know the parameters of x and y in our example, but we want to see how learning from data would work. Here's the equation we're working with:

y = 20 * x + 5

And here's what it looks like (intercept = 5, slope = 20)

Gradient Descent

Why?

The normal equation and the least squares approach can handle large training sets efficiently, but when your model has a large number of features or too many training instances to fit into memory, gradient descent is an often used alternative.

Moreover, linear least squares assume the errors have a normal distribution and the relationship in the data is linear (this is where closed-form solutions like the normal equation excel). When the data is non-linear, an iterative solution (gradient descent) can be used.

With linear regression we seek to minimize the sum-of-squares differences between the observed data and the predicted values (aka the error), in a non-iterative fashion.

Alternatively, we use gradient descent to find the slope and intercept that minimizes the average squared error, however, in an interative fashion.

Using Gradient Descent to Fit a Model

The process for gradient descent is to start with a random slope and intercept, then compute the gradient of the mean squared error, while adjusting the slope/intercept (theta) in the direction that continues to minimize the error. This is repeated iteratively until we find a point where errors are most minimized.

NOTE: This section builds heavily on a previous post on linear algebra. You'll want to read this post to get a feel for the functions used to construct the functions we see in this post.

from typing import TypeVar, List, Iterator
import math
import random
import matplotlib.pyplot as plt
from typing import Callable
from typing import List
import numpy as np

x = np.array([2, 4, 5])

# instead of putting y directly, we'll use the equation: 20 * x + 5, which is a direct representation of its relationship to x

# y = np.array([45, 85, 105])   

# both x and y are represented in inputs
inputs = [(x, 20 * x + 5) for x in range(2, 6)]

First, we'll start with random values for the slope and intercept; we'll also establish a learning rate, which controls how much a change in the model is warranted in response to the estimated error each time the model parameters (slope and intercept) are updated.

# 1. start with a random value for slope and intercept
theta = [random.uniform(-1, 1), random.uniform(-1, 1)]

learning_rate = 0.001

Next, we'll compute the mean of the gradients, then adjust the slope/intercept in the direction of minimizing the gradient, which is based on the error.

You'll note that this for-loop has 100 iterations. The more iterations we go through, the more that errors are minimized and the more we approach a slope/intercept where the model "fits" the data better.

You can see in this list, [linear_gradient(x, y, theta) for x, y in inputs], that our linear_gradient function is applied to the known x and y values in the list of tuples, inputs, along with random values for slope/intercept (theta).

We multiply each x value with a random value for slope, then add a random value for intercept. This yields the initial prediction. Error is the gap between the initial prediction and actual y values. We minimize the squared error by using its gradient.

# start with a function that determines the gradient based on the error from a single data point
def linear_gradient(x: float, y: float, theta: Vector) -> Vector:
    slope, intercept = theta
    predicted = slope * x + intercept   # model prediction
    error = (predicted - y)             # error is (predicted - actual)
    squared_error = error ** 2          # minimize squared error
    grad = [2 * error * x, 2 * error]   # using its gradient
    return grad

The linear_gradient function along with initial parameters are then passed to vector_mean, which utilize scalar_multiply and vector_sum:


def vector_mean(vectors: List[Vector]) -> Vector:
    """Computes the element-wise average"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

def scalar_multiply(c: float, v: Vector) -> Vector:
    """Multiplies every element by c"""
    return [c * v_i for v_i in v]

def vector_sum(vectors: List[Vector]) -> Vector:
    """Sum all corresponding elements (componentwise sum)"""
    # Check that vectors is not empty
    assert vectors, "no vectors provided!"
    # Check the vectors are all the same size
    num_elements = len(vectors[0])
    assert all(len(v) == num_elements for v in vectors), "different sizes!"
    # the i-th element of the result is the sum of every vector[i]
    return [sum(vector[i] for vector in vectors)
            for i in range(num_elements)]

This yields the gradient. Then, each gradient_step is determined as our function adjusts the initial random theta values (slope/intercept) in the direction that minimizes the error.

def gradient_step(v: Vector, gradient: Vector, step_size: float) -> Vector:
    """Moves `step_size` in the `gradient` direction from `v`"""
    assert len(v) == len(gradient)
    step = scalar_multiply(step_size, gradient)
    return add(v, step)

def add(v: Vector, w: Vector) -> Vector:
    """Adds corresponding elements"""
    assert len(v) == len(w), "vectors must be the same length"
    return [v_i + w_i for v_i, w_i in zip(v, w)]

All this comes together in this for-loop to print out how the slope and intercept change with each iteration (we start with 100):

for epoch in range(100):     # start with 100 <--- change this figure to try different iterations
    # compute the mean of the gradients
    grad = vector_mean([linear_gradient(x, y, theta) for x, y in inputs])
    # take a step in that direction
    theta = gradient_step(theta, grad, -learning_rate)
    print(epoch, grad, theta)

slope, intercept = theta

#assert 19.9 < slope < 20.1,  "slope should be about 20"
#assert 4.9 < intercept < 5.1, "intercept should be about 5"
print("slope", slope)
print("intercept", intercept)

Iterative Descent

At 100 iterations, the slope is 18.87 and intercept is 4.87 and the gradient is -32.48 (error for the slope) and -8.45 (error for the intercept). These numbers suggest that we need to decrease the slope and intercept from our random starting point, but our emphasis needs to be on decreasing the slope.

At 200 iterations, the slope is 19.97 and intercept is 4.86 and the gradient is -1.76 (error for the slope) and -0.48 (error for the intercept). Our errors have been reduced significantly.

At 1000 iterations, the slope is 19.97 (not much difference from 200 iterations) and intercept is 5.09 and the gradients are markedly lower at -0.004 (error for the slope) and 0.02 (error for the intercept). Here the errors may not be much different from zero and we are near our optimal point.

In summary, the normal equation and regression approaches gave us a slope of 20 and intercept of 5. With gradient descent, we approached these values with each successive iterations, 1000 iterations yielding less error than 100 or 200 iterations.

From Scratch

As mentioned above, the functions used to compute the gradients and adjust the slope/intercept build on functions we explored in this post. Here's a visual showing how the functions we used to iteratively arrive at the slope and intercept through gradient descent was built:

Take Away

Gradient descent is an optimization technique often used in machine learning and in this post, we built some intuition around how it works by applying it to a simple linear regression problem, favoring code over math (which we'll return to in a later post). Gradient Descent is useful if you are expecting computational complexity due to the number of features or training instances.

We placed gradient descent in context, in comparison to a more analytical approach, normal equation and the least squares method, both of which are non-iterative.

Furthermore, we saw how the functions used in this post can be traced back to a previous post on linear algebra, thus giving us a big picture view of how the building blocks of data science and an intuition for areas we'll need to explore at a deeper, perhaps at a more mathematical, level.

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Exploring nested data with sunburst plots in R

Paul Apivat — Fri, 18 Dec 2020 13:56:43 +0000

Overview

This is a quick walk through of using the sunburstR package to create sunburst plots in R. The original document is written in RMarkdown, which is an interactive version of markdown.

The following code can be run in RMarkdown or an R script. For interactive visuals, you'll want to use RMarkdown.

Load Libraries

The two main libraries are tidyverse (mostly dplyr so you can just load that if you want) and sunburstR. There are other packages for sunburst plots including: plotly and ggsunburst (of ggplot), but we'll explore sunburstR in this post.

library(tidyverse)
library(sunburstR)

Load Data & Explore

The data is from week 50 of TidyTuesday, exploring the BBC's top 100 influential women of 2020.

The head() function presents the first six rows in a dataframe.

women <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-12-08/women.csv')

head(women)

Add Continents

The original dataset organized 100 women by category, country, role and description. I found that for employing the sunburst plot, I would want to group countries together by continents.

I manually added country names to continent vectors, then added a new column to the women dataframe to conditionally add continent name.

We could then focus on six continents rather than 65 separate countries.

# add continent as character vector
asia <-  c('Afghanistan', 'Bangladesh', 'China', 'Exiled Uighur from Ghulja (in Chinese, Yining)', 'Hong Kong', 'India', 'Indonesia', 'Iran', 'Iraq/UK', 'Japan', 'Kyrgyzstan', 'Lebanon', 'Malaysia', 'Myanmar', 'Nepal', 'Pakistan', 'Singapore', 'South Korea', 'Syria', 'Thailand', 'UAE', 'Vietnam', 'Yemen')

south_america <- c('Argentina', 'Brazil', 'Colombia', 'Ecuador', 'Peru', 'Venezuela')
oceania <- c('Australia')
europe <- c('Belarus', 'Finland', 'France', 'Germany', 'Italy', 'Netherlands', 'Northern Ireland', 'Norway', 'Republic of Ireland', 'Russia', 'Turkey', 'UK', 'Ukraine', 'Wales, UK')
africa <- c('Benin', 'DR Congo', 'Egypt', 'Ethiopia', 'Kenya', 'Morocco', 'Mozambique', 'Nigeria', 'Sierra Leone', 'Somalia', 'Somaliland', 'South Africa', 'Tanzania', 'Uganda', 'Zambia', 'Zimbabwe')
north_america <- c('El Salvador', 'Jamaica', 'Mexico', 'US')

# add new column for continent
women <- women %>%
    mutate(continent = NA) 

# add continents to women dataframe
women$continent <- ifelse(women$country %in% asia, 'Asia', women$continent)
women$continent <- ifelse(women$country %in% south_america, 'South America', women$continent)
women$continent <- ifelse(women$country %in% oceania, 'Oceania', women$continent)
women$continent <- ifelse(women$country %in% europe, 'Europe', women$continent)
women$continent <- ifelse(women$country %in% africa, 'Africa', women$continent)
women$continent <- ifelse(women$country %in% north_america, 'North America', women$continent)

women

Data Wrangling

The key to using the sunburstR package with this specific dataset is the wrangling that happens to filter by continents we created above. We'll also want to get rid of dashes with mutate_at as dashes are structurally needed to render the sunburst plots.

Below, I've filtered the women data frame into Africa and Asia (the same could be done for North and South America and Europe as well).

The two most important operations here are the creation of the path and V2 columns that will later be parameters for rendering the sunburst plots.


# Filter for Africa
africa_name <- women %>%
    select(continent, category, role, name) %>%
    # remove dash within dplyr pipe
    mutate_at(vars(3, 4), funs(gsub("-", "", .))) %>% 
    filter(continent=='Africa') %>%
    mutate(
        path = paste(continent, category, role, name, sep = "-")
    ) %>%
    slice(2:100) %>%
    mutate(
        V2 = 1
    )

# Filter for Asia
asia_name <- women %>%
    select(continent, category, role, name) %>%
    # remove dash within dplyr pipe
    mutate_at(vars(3, 4), funs(gsub("-", "", .))) %>%
    filter(continent=='Asia') %>%
    mutate(
        path = paste(continent, category, role, name, sep = "-")
    ) %>%
    slice(2:100) %>%
    mutate(
        V2 = 1
    )

Sunburst: Africa

Ultimately, I found the information best presented by continent as the base of the sunburst plot, followed by category, specific roles and the names of each of the 100 women honored by the BBC.

Moreover, by presenting the data by continent, you can focus on just five specific color as you decide on a palette.

I wouldn't recommend trying to pick a color for each role or name; it becomes too unweildy. Just pick five colors for the two inner most rings of the sunburst plot and it'll shuffle the rest of the colors.

# Africa
sunburst(data = data.frame(xtabs(V2~path, africa_name)), legend = FALSE,
         colors = c("D99527", "6F7239", "CE4B3C", "C8AC70", "018A9D"))

Sunburst: Asia

# Asia
sunburst(data = data.frame(xtabs(V2~path, asia_name)), legend = FALSE,
         colors = c("#e6e0ae", "#dfbc5e", "#ee6146", "#d73c37", "#b51f09"))

Here's what the plot would look like on RMarkdown as you hover over it:

And that's it for visualizing the BBC's top 100 influential women in 2020 with the sunburstR package.

For more content on data science, visualization, in R and Python, find me on Twitter.

Explore Hypothesis Testing using Python

Paul Apivat — Wed, 16 Dec 2020 12:49:17 +0000

Cover Photo by Nasonov Aleksandr on Unsplash

Overview

This is a continuation of my progress through Data Science from Scratch by Joel Grus. We'll use a classic coin-flipping example in this post because it is simple to illustrate with both concept and code. The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, hypothesis testing, p-Values and confidence intervals, using python to build our intuition.

Central Limit Theorem

Terms like "null" and "alternative" hypothesis are used quite frequently, so let's set some context. The "null" is the default position. The "alternative", alt for short, is something we're comparing to the default (null).

The classic coin-flipping exercise is to test the fairness off a coin. If a coin is fair, it'll land on heads 50% of the time (and tails 50% of the time). Let's translate into hypothesis testing language:

Null Hypothesis: Probability of landing on Heads = 0.5.

Alt Hypothesis: Probability of landing on Heads != 0.5.

Each coin flip is a Bernoulli trial, which is an experiment with two outcomes - outcome 1, "success", (probability p) and outcome 0, "fail" (probability p - 1). The reason it's a Bernoulli trial is because there are only two outcome with a coin flip (heads or tails). Read more about Bernoulli here.

Here's the code for a single Bernoulli Trial:

def bernoulli_trial(p: float) -> int:
    """Returns 1 with probability p and 0 with probability 1-p"""
    return 1 if random.random() < p else 0

When you sum the independent Bernoulli trials, you get a Binomial(n,p) random variable, a variable whose possible values have a probability distribution. The central limit theorem says as n or the number of independent Bernoulli trials get large, the Binomial distribution approaches a normal distribution.

Here's the code for when you sum all the Bernoulli Trials to get a Binomial random variable:

def binomial(n: int, p: float) -> int:
    """Returns the sum of n bernoulli(p) trials"""
    return sum(bernoulli_trial(p) for _ in range(n))

Note: A single 'success' in a Bernoulli trial is 'x'. Summing up all those x's into X, is a Binomial random variable. Success doesn't imply desirability, nor does "failure" imply undesirability. They're just terms to count the cases we're looking for (i.e., number of heads in multiple coin flips to assess a coin's fairness).

Given that our null is (p = 0.5) and alt is (p != 0.5), we can run some independent bernoulli trials, then sum them up to get a binomial random variable.

Each bernoulli_trial is an experiment with either 0 or 1 as outcomes. The binomial function sums up n bernoulli(0.5) trails. We ran both twice and got different results. Each bernoulli experiment can be a success(1) or faill(0); summing up into a binomial random variable means we're taking the probability p(0.5) that a coin flips head and we ran the experiment 1,000 times to get a random binomial variable.

The first 1,000 flips we got 510. The second 1,000 flips we got 495. We can repeat this process many times to get a distribution. We can plot this distribution to reinforce our understanding. To this we'll use binomial_histogram function. This function picks points from a Binomial(n,p) random variable and plots their histogram.

from collections import Counter
import matplotlib.pyplot as plt

def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2


def binomial_histogram(p: float, n: int, num_points: int) -> None:
    """Picks points from a Binomial(n, p) and plots their histogram"""
    data = [binomial(n, p) for _ in range(num_points)]
    # use a bar chart to show the actual binomial samples
    histogram = Counter(data)
    plt.bar([x - 0.4 for x in histogram.keys()],
            [v / num_points for v in histogram.values()],
            0.8,
            color='0.75')
    mu = p * n
    sigma = math.sqrt(n * p * (1 - p))
    # use a line chart to show the normal approximation
    xs = range(min(data), max(data) + 1)
    ys = [normal_cdf(i + 0.5, mu, sigma) -
          normal_cdf(i - 0.5, mu, sigma) for i in xs]
    plt.plot(xs, ys)
    plt.title("Binomial Distribution vs. Normal Approximation")
    plt.show()

# call function   
binomial_histogram(0.5, 1000, 10000)

This plot is then rendered:

What we did was sum up independent bernoulli_trial(s) of 1,000 coin flips, where the probability of head is p = 0.5, to create a binomial random variable. We then repeated this a large number of times (N = 10,000), then plotted a histogram of the distribution of all binomial random variables. And because we did it so many times, it approximates the standard normal distribution (smooth bell shape curve).

Just to demonstrate how this works, we can generate several binomial random variables:

If we do this 10,000 times, we'll generate the above histogram. You'll notice that because we are testing whether the coin is fair, the probability of heads (success) should be at 0.5 and, from 1,000 coin flips, the mean(mu) should be a 500.

We have another function that can help us calculate normal_approximation_to_binomial:

import random
from typing import Tuple
import math


def normal_approximation_to_binomial(n: int, p: float) -> Tuple[float, float]:
    """Return mu and sigma corresponding to a Binomial(n, p)"""
    mu = p * n
    sigma = math.sqrt(p * (1 - p) * n)
    return mu, sigma

# call function
# (500.0, 15.811388300841896)
normal_approximation_to_binomial(1000, 0.5)

When calling the function with our parameters, we get a mean mu of 500 (from 1,000 coin flips) and a standard deviation sigma of 15.8114. Which means that 68% of the time, the binomial random variable will be 500 +/- 15.8114 and 95% of the time it'll be 500 +/- 31.6228 (see 68-95-99.7 rule)

Hypothesis Testing

Now that we have seen the results of our "coin fairness" experiment plotted on a binomial distribution (approximately normal), we will be, for the purpose of testing our hypothesis, be interested in the probability of its realized value (binomial random variable) lies within or outside a particular interval.

This means we'll be interested in questions like:

What's the probability that the binomial(n,p) is below a threshold?
Above a threshold?
Between an interval?
Outside an interval?

First, the normal_cdf (normal cummulative distribution function), which we learned in a previous post, is the probability of a variable being below a certain threshold.

Here, the probability of X (success or heads for a 'fair coin') is at 0.5 (mu = 500, sigma = 15.8113), and we want to find the probability that X falls below 490, which comes out to roughly 26%

def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2


normal_probability_below = normal_cdf

# probability that binomal random variable, mu = 500, sigma = 15.8113, is below 490

# 0.26354347477247553
normal_probability_below(490, 500, 15.8113)

On the other hand, the normal_probability_above, probability that X falls above 490 would be
1 - 0.2635 = 0.7365 or roughly 74%.

def normal_probability_above(lo: float,
                             mu: float = 0,
                             sigma: float = 1) -> float:
    """The probability that an N(mu, sigma) is greater than lo."""
    return 1 - normal_cdf(lo, mu, sigma)

# 0.7364565252275245
normal_probability_above(490, 500, 15.8113)

To make sense of this we need to recall the binomal distribution, that approximates the normal distribution, but we'll draw a vertical line at 490.

We're asking, given the binomal distribution with mu 500 and sigma at 15.8113, what is the probability that a binomal random variable falls below the threshold (left of the line); the answer is approximately 26% and correspondingly falling above the threshold (right of the line), is approximately 74%.

Between an Interval

We may also wonder what the probability of a binomial random variable falling between 490 and 520:

Here is the function to calculate this probability and it comes out to approximately 63%. note: Bear in mind the full area under the curve is 1.0 or 100%.

def normal_probability_between(lo: float,
                               hi: float,
                               mu: float = 0,
                               sigma: float = 1) -> float:
    """The probability that an N(mu, sigma) is between lo and hi."""
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# 0.6335061861416337
normal_probability_between(490, 520, 500, 15.8113)

Finally, the area outside of the interval should be 1 - 0.6335 = 0.3665:

def normal_probability_outside(lo: float,
                               hi: float,
                               mu: float = 0,
                               sigma: float = 1) -> float:
    """The probability that an N(mu, sigma) is not between lo and hi."""
    return 1 - normal_probability_between(lo, hi, mu, sigma)

# 0.3664938138583663
normal_probability_outside(490, 520, 500, 15.8113)

In addition to the above, we may also be interested in finding (symmetric) intervals around the mean that account for a certain level of likelihood, for example, 60% probability centered around the mean.

For this operation we would use the inverse_normal_cdf:

def inverse_normal_cdf(p: float,
                       mu: float = 0,
                       sigma: float = 1,
                       tolerance: float = 0.00001) -> float:
    """Find approximate inverse using binary search"""
    # if not standard, compute standard and rescale
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
    low_z = -10.0     # normal_cdf(-10) is (very close to) 0
    hi_z = 10.0       # normal_cdf(10) is (very close to) 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2      # Consider the midpoint
        mid_p = normal_cdf(mid_z)       # and the CDF's value there
        if mid_p < p:
            low_z = mid_z               # Midpoint too low, search above it
        else:
            hi_z = mid_z                # Midpoint too high, search below it
    return mid_z

First we'd have to find the cutoffs where the upper and lower tails each contain 20% of the probability. We calculate normal_upper_bound and normal_lower_bound and use those to calculate the normal_two_sided_bounds.

def normal_upper_bound(probability: float,
                       mu: float = 0,
                       sigma: float = 1) -> float:
    """Returns the z for which P(Z <= z) = probability"""
    return inverse_normal_cdf(probability, mu, sigma)


def normal_lower_bound(probability: float,
                       mu: float = 0,
                       sigma: float = 1) -> float:
    """Returns the z for which P(Z >= z) = probability"""
    return inverse_normal_cdf(1 - probability, mu, sigma)


def normal_two_sided_bounds(probability: float,
                            mu: float = 0,
                            sigma: float = 1) -> Tuple[float, float]:
    """
    Returns the symmetric (about the mean) bounds
    that contain the specified probability
    """
    tail_probability = (1 - probability) / 2
    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)
    return lower_bound, upper_bound

So if we wanted to know what the cutoff points were for a 60% probability around the mean and standard deviation (mu = 500, sigma = 15.8113), it would be between 486.69 and 513.31.

Said differently, this means roughly 60% of the time, we can expect the binomial random variable to fall between 486 and 513.

# (486.6927811021805, 513.3072188978196)
normal_two_sided_bounds(0.60, 500, 15.8113)

Significance and Power

Now that we have a handle on the binomial normal distribution, thresholds (left and right of the mean), and cut-off points, we want to make a decision about significance. Probably the most important part of statistical significance is that it is a decision to be made, not a standard that is externally set.

Significance is a decision about how willing we are to make a type 1 error (false positive), which we explored in a previous post. The convention is to set it to a 5% or 1% willingness to make a type 1 error. Suppose we say 5%.

We would say that out of 1,000 coin flips, 95% of the time, we'd get between 469 and 531 heads on a "fair coin" and 5% of the time, outside of this 469-531 range.

# (469.0104394712448, 530.9895605287552)
normal_two_sided_bounds(0.95, 500, 15.8113)

If we recall our hypotheses:

Null Hypothesis: Probability of landing on Heads = 0.5 (fair coin)

Alt Hypothesis: Probability of landing on Heads != 0.5 (biased coin)

Each binomial distribution (test) that consist of 1,000 bernoulli trials, each test where the number of heads falls outside the range of 469-531, we'll reject the null that the coin is fair. And we'll be wrong (false positive), 5% of the time. It's a false positive when we incorrectly reject the null hypothesis, when it's actually true.

We also want to avoid making a type-2 error (false negative), where we fail to reject the null hypothesis, when it's actually false.

Note: Its important to keep in mind that terms like significance and power are used to describe tests, in our case, the test of whether a coin is fair or not. Each test is the sum of 1,000 independent bernoulli trials.

For a "test" that has a 95% significance, we'll assume that out of a 1,000 coin flips, it'll land on heads between 469-531 times and we'll determine the coin is fair. For the 5% of the time it lands outside of this range, we'll determine the coin to be "unfair", but we'll be wrong because it actually is fair.

To calculate the power of the test, we'll take the assumed mu and sigma with a 95% bounds (based on the assumption that the probability of the coin landing on heads is 0.5 or 50% - a fair coin). We'll determine the lower and upper bounds:

lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
lo # 469.01026640487555
hi # 530.9897335951244

And if the coin was actually biased, we should reject the null, but we fail to. Let's suppose the actual probability that the coin lands on heads is 55% ( biased towards head):

mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
mu_1    # 550.0
sigma_1 # 15.732132722552274

Using the same range 469 - 531, where the coin is assumed 'fair' with mu at 500 and sigma at 15.8113:

If the coin, in fact, had a bias towards head (p = 0.55), the distribution would shift right, but if our 95% significance test remains the same, we get:

The probability of making a type-2 error is 11.345%. This is the probability that we're see that the coin's distribution is within the previous interval 469-531, thinking we should accept the null hypothesis (that the coin is fair), but in actuality, failing to see that the distribution has shifted to the coin having a bias towards heads.

# 0.11345199870463285
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)

The other way to arrive at this is to find the probability, under the new mu and sigma (new distribution), that X (number of successes) will fall below 531.

# 0.11357762975476304
normal_probability_below(531, mu_1, sigma_1)

So the probability of making a type-2 error or the probability that the new distribution falls below 531 is approximately 11.3%.

The power to detect a type-2 error is 1.00 minus the probability of a type-2 error (1 - 0.113 = 0.887), or 88.7%.

power = 1 - type_2_probability # 0.8865480012953671

Finally, we may be interested in increasing power to detect a type-2 error. Instead of using a normal_two_sided_bounds function to find the cut-off points (i.e., 469 and 531), we could use a one-sided test that rejects the null hypothesis ('fair coin') when X (number of heads on a coin-flip) is much larger than 500.

Here's the code, using normal_upper_bound:

# 526.0073585242053
hi = normal_upper_bound(0.95, mu_0, sigma_0)

This means shifting the upper bounds from 531 to 526, providing more probability in the upper tail. This means the probability of a type-2 error goes down from 11.3 to 6.3.

# previous probability of type-2 error
# 0.11357762975476304
normal_probability_below(531, mu_1, sigma_1)


# new probability of type-2 error
# 0.06356221447122662
normal_probability_below(526, mu_1, sigma_1)

And the new (stronger) power to detect type-2 error is 1.0 - 0.064 = 0.936 or 93.6% (up from 88.7% above).

p-Values

p-Values represent another way of deciding whether to accept or reject the Null Hypothesis. Instead of choosing bounds, thresholds or cut-off points, we could compute the probability, assuming the Null Hypothesis is true, that we would see a value as extreme as the one we just observed.

Here is the code:

def two_sided_p_values(x: float, mu: float = 0, sigma: float = 1) -> float:
    """
    How likely are we to see a value at least as extreme as x (in either
    direction) if our values are from an N(mu, sigma)?
    """
    if x >= mu:
        # x is greater than the mean, so the tail is everything greater than x
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        # x is less than the mean, so the tail is everything less than x
        return 2 * normal_probability_below(x, mu, sigma)

If we wanted to compute, assuming we have a "fair coin" (mu = 500, sigma = 15.8113), what is the probability of seeing a value like 530? (note: We use 529.5 instead of 530 below due to continuity correction)

Answer: approximately 6.2%

# 0.06207721579598835
two_sided_p_values(529.5, mu_0, sigma_0)

The p-value, 6.2% is higher than our (hypothetical) 5% significance, so we don't reject the null. On the other hand, if X was slightly more extreme, 532, the probability of seeing that value would be approximately 4.3%, which is less than 5% significance, so we would reject the null.

# 0.04298479507085862
two_sided_p_values(532, mu_0, sigma_0)

For one-sided tests, we would use the normal_probability_above and normal_probability_below functions created above:

upper_p_value = normal_probability_above
lower_p_value = normal_probability_below

Under the two_sided_p_values test, the extreme value of 529.5 had a probability of 6.2% of showing up, but not low enough to reject the null hypothesis.

However, with a one-sided test, upper_p_value for the same threshold is now 3.1% and we would reject the null hypothesis.

# 0.031038607897994175
upper_p_value(529.5, mu_0, sigma_0)

Confidence Intervals

A third approach to deciding whether to accept or reject the null is to use confidence intervals. We'll use the 530 as we did in the p-Values example.

p_hat = 530/1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000) # 0.015782902141241326

# (0.4990660982192851, 0.560933901780715)
normal_two_sided_bounds(0.95, mu, sigma)

The confidence interval for a coin flipping heads 530 (out 1,000) times is (0.4991, 0.5609). Since this interval contains the p = 0.5 (probability of heads 50% of the time, assuming a fair coin), we do not reject the null.

If the extreme value were more extreme at 540, we would arrive at a different conclusion:

p_hat = 540/1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)

(0.5091095927295919, 0.5708904072704082)
normal_two_sided_bounds(0.95, mu, sigma)

Here we would be 95% confident that the mean of this distribution is contained between 0.5091 and 0.5709 and this does not contain 0.500 (albiet by a slim margin), so we reject the null hypothesis that this is a fair coin.

note: Confidence intervals are about the interval not probability p. We interpret the confidence interval as, if you were to repeat the experiment many times, 95% of the time, the "true" parameter, in our example p = 0.5, would lie within the observed confidence interval.

Connecting the Dots

We used several python functions to build intuition around statistical hypothesis testing. To highlight this "from scratch" aspect of the book here is a diagram tying together the various python function used in this post:

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Permutation in Python

Paul Apivat — Thu, 10 Dec 2020 08:57:57 +0000

Cover Credit

Overview

Itertools are a core set of fast, memory efficient tools for creating iterators for efficient looping (read the documentation here).

Itertools Permutations

One (of many) uses for itertools is to create a permutations() function that will return all possible combinations of items in a list.

I was working on a project that involved user funnels with different stages and we were wondering how many different "paths" a user could take, so this was naturally a good fit for using permutations.

Sample Funnel

In our hypothetical example, we're looking at a funnel with three stages for a total of 6 permutations. Here is the formula:

If you're using a sales/marketing funnel, you'll have in mind what your funnel would look like so you may not want all possible paths, but if you're interested in exploring potentially overlooked paths, read on.

Here's the python documentation for itertools, and permutations specifically. We'll break down the code to better understand what's going on in this function.

note: I found a clearer alternative after the fact. Feel free to skip to the final section below, although there is value in comparing the two versions.

We'll start off with the iterable which is a list with three strings. The permutations function takes in two parameters, the iterable and r which is the number of items from the list that we're interested in finding the combination of. If we have three items in the list, we generally want to find all possible combinations of those three items.

Here is the code, and subsequent breakdown:

# list of length 3
list1 = ['stage 1', 'stage 2', 'stage 3']

# iterable is the list
# r = number of items from the list to find combinations of


def permutations(iterable, r=None):
    """Find all possible order of a list of elements"""
    # permutations('ABCD',2)--> AB AC AD BA BC BD CA CB CD DA DB DC
    # permutations(range(3))--> 012 021 102 120 201 210
    # permutations(list1, 6)--> ...720 permutations
    pool = tuple(iterable)
    n = len(pool)
    r = n if r is None else r
    if r > n:
        return
    indices = list(range(n))                     # [0, 1, 2]
    cycles = list(range(n, n-r, -1))             # [3, 2, 1]
    yield tuple(pool[i] for i in indices[:r])
    while n:
        for i in reversed(range(r)):
            cycles[i] -= 1
            if cycles[i] == 0:
                indices[i:] = indices[i+1:] + indices[i:i+1]
                cycles[i] = n - i
            else:
                j = cycles[i]
                indices[i], indices[-j] = indices[-j], indices[i]
                yield tuple(pool[i] for i in indices[:r])
                break
        else:
            return


#permutations(list1, 6)

perm = permutations(list1, 3)
count = 0

for p in perm:
    count += 1
    print(p)
print("there are:", count, "permutations.")

The first thing we do is take the iterable input parameter is turn it from a list into a tuple.

pool = tuple(iterable)

There are several reasons to do this. First, tuples are faster than lists; the permutations() function will do several operations to the input so changing it to a tuple allows faster operations and because tuples are immutable, we can do a bunch of different operations without fear that we might inadvertently change the list.

We then create n from the length of pool (in our case it's 3) and the additional r parameter, which defaults to None is also 3 as we're interested in seeing all combinations of a list of three elements.

We also have a line that ensures that r can never be greater than the number of elements in the iterable (list).

if r > n:
    return

Next, we create indices and cycles. Indices are basically the index of each item, starting with 0 to 2, for three items. Cycles uses range(n, n-r, -1), which in our case is range(3, 3-3, -1); this means start at three and end at zero, in -1 steps.

The next chunk of code is a while-loop that will continue for the length of the list, n (note the break at the bottom to exit out of this loop).

After each if-else cycle, a new set of indices are created, which then gets looped through with pool, the interable parameter input, which changes the order of the elements in the list.

You'll note in the commented code above, cycles start off at [3,2,1] and indices start off at [0,1,2]. Each loop through the code changes the indices where indices[i:] successively gets longer [2], then [1,2], then [1,2,3]. While cycles changes as it trends toward [1,1,1], which point the code breaks out of the loop.

while n:
        for i in reversed(range(r)):
            cycles[i] -= 1
            if cycles[i] == 0:
                indices[i:] = indices[i+1:] + indices[i:i+1]
                cycles[i] = n - i
            else:
                j = cycles[i]
                indices[i], indices[-j] = indices[-j], indices[i]
                yield tuple(pool[i] for i in indices[:r])
                break
        else:
            print("return:")

The permutations(iterable, r) function actually creates a generator so we need to loop through it again to print out all the permutations of the list.

<generator object permutations at 0x7fe19400fdd0>

We add another for-loop at the bottom to print out all the permutations:

perm = permutations(list1, 3)
count = 0

for p in perm:
    count += 1
    print(p)
print("there are:", count, "permutations.")

Here is our result:

A Clearer Alternative: Permutation Using Recursion

As is often the case, there is a better way I found in retrospect from this stack overflow (h/t to Eric O Lebigot):

def all_perms(elements):
    if len(elements) <= 1:
        yield elements  # Only permutation possible = no permutation
    else:
        # Iteration over the first element in the result permutation:
        for (index, first_elmt) in enumerate(elements):
            other_elmts = elements[:index] + elements[index+1:]
            for permutation in all_perms(other_elmts):
                yield [first_elmt] + permutation

The enumerate built-in function obviates the need to separately create cycles and indices. The local variable other_elmts separates the other elements in the list from the first_elmt, then the second for-loop recursively finds the permutation of the other elements before adding with the first_elmt on the final line, yielding all possible permutations of a list. As with the previous case, the result of this function is a generator which requires looping through and printing the permutations.

I found this much easier to digest than the documentation version.

Permutations can be useful when you have varied user journeys through your product and you want to figure out all the possible paths. With this short python script, you can easily print out all options for consideration.

Take Aways

From the perspective of a user funnel, permutations allow us to explore all possible paths a user might take. For our hypothetical example, a three-step funnel yields six possible paths a user could navigate from start to finish.

Knowing permutations should also give us pause when deciding whether to add another "step" to a funnel. Going from a three-step funnel to a four-step funnel increases the number of possible paths from six to 24 - a quadruple increase.

Not only does this increase friction between your user and the 'end goal' (conversion), whatever that may be for your product, but it also increases complexity (and potentially confusion) in the user experience.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Probability Distributions with Python: Discrete & Continuous

Paul Apivat — Tue, 08 Dec 2020 03:15:48 +0000

Context

There are several posts that could serve as context (as needed) for the concepts discuss in this post including these posts on:

Distributions

In this post, we'll cover probability distributions. This is a broad topic so we'll sample a few concepts to get a feel for it. Borrowing from the previous post, we'll chart our medical diagnostic outcomes.

You'll recall that each outcome is the combination of whether someone has a disease, P(D), or not, P(not D). Then, they're given a diagnostic test that returns positive, P(P) or negative, P(not P).

These are discrete outcomes so they can be represented with the probability mass function, as opposed to a probability density function, which represent a continuous distribution.

Let's take another hypothetical scenario of a city where 1 in 10 people have a disease and a diagnostic test has a True Positive of 95% and True Negative of 90%. The probability that a test-positive person actually having the disease is 46.50%.

Here's the code:

from random import random, seed

seed(0)
pop = 1000  # 1000 people
counts = {}
for i in range(pop):
    has_disease = i % 10 == 0  # one in 10 people have disease
    # assuming that every person gets tested regardless of any symptoms
    if has_disease:
        tests_positive = True       # True Positive  95%
        if random() < 0.05:
            tests_positive = False  # False Negative 5%
    else:
        tests_positive = False      # True Negative  90%
        if random() < 0.1:
            tests_positive = True   # False Positive 10%
    outcome = (has_disease, tests_positive)
    counts[outcome] = counts.get(outcome, 0) + 1

for (has_disease, tested_positive), n in counts.items():
    print('Has Disease: %6s, Test Positive: %6s, count: %d' %
          (has_disease, tested_positive, n))

n_positive = counts[(True, True)] + counts[(False, True)]
print('Number of people who tested positive:', n_positive)
print('Probability that a test-positive person actually has disease: %.2f' %
      (100.0 * counts[(True, True)] / n_positive),)

Given the probability that someone has the disease (1 in 10), also called the 'prior' in Bayesian terms. We modeled four scenarios where people were given a diagnostic test. Again, the big assumption here is that people get randomly tested. With the true positive and true negative rates stated above, here are the outcomes:

Probability Mass Function

Given these discrete events, we can chart a probability mass function, also known as discrete density function. We'll import pandas to help us create DataFrames and matplotlib to chart the probability mass function.

We first need to turn the counts of events into a DataFrame and change the column to item_counts. Then, we'll calculate the probability of each event by dividing the count by the total number of people in our hypothetical city (i.e., population: 1000).

Optional: Create another column with abbreviations for test outcome (i.e., "True True" becomes "TT"). We'll call this column item2.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame.from_dict(counts, orient='index')
df = df.rename(columns={0: 'item_counts'})
df['probability'] = df['item_counts']/1000
df['item2'] = ['TT', 'FF', 'FT', 'TF']

Here is the DataFrame we have so far:

You'll note that the numbers in the probability column adds up to 1.0 and that the item_counts numbers are the same as the count above when we had calculated the probability of a test-positive person actually having the disease.

We'll use a simple bar chart to chart out the diagnostic probabilities and this is how we'd visually represent the probability mass function - probabilities of each discrete event; each 'discrete event' is a conditional (e.g., probability that someone has a positive test, given that they have the disease - TT or probability that someone has a negative test, given that they don't have the disease - FF, and so on).

Here's the code:

df = pd.DataFrame.from_dict(counts, orient='index')
df = df.rename(columns={0: 'item_counts'})
df['probability'] = df['item_counts']/1000
df['item2'] = ['TT', 'FF', 'FT', 'TF']
plt.bar(df['item2'], df['probability'])
plt.title("Probability Mass Function")
plt.show()

Cumulative Distribution Function

While the probability mass function can tell us the probability of each discrete event (i.e., TT, FF, FT, and TF) we can also represent the same information as a cumulative distribution function which allows us to see how the probability changes as we add events together.

The cumulative distribution function simply adds the probability from the previous row in a DataFrame in a cumulative fashion, like in the column probability2:

We use the cumsum() function to create the cumsum column which is simply adding the item_counts, with each successive row. When we create the corresponding probability column, probability2, it gets larger until we reach 1.0.

Here's the chart:

This chart tells us that the probability of getting both TT and FF (True, True = True Positive, and False, False = True Negative) is 88.6% which indicates that 11.4% (100 - 88.6) of the time, the diagnostic test will let us down.

Normal Distribution

More often than not, you'll be interested in continuous distributions and you can see better see how the cumulative distribution function works.

You're probably familiar with the bell shaped curve or the normal distribution, defined solely by its mean (mu) and standard deviation (sigma). If you have a standard normal distribution of probability values, the average would be 0 and the standard deviation would be 1.

Code:

import math
SQRT_TWO_PI = math.sqrt(2 * math.pi)

def normal_pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (SQRT_TWO_PI * sigma))

# plot
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs, [normal_pdf(x, sigma=1) for x in xs], '-', label='mu=0, sigma=1')
plt.show()

With the standard normal distribution curve, you see the average probability is around 0.4. But if you add up the area under the curve (i.e., all probabilities of every possible outcome), you would get 1.0, just like with the medical diagnostic example.

And if you split the bell in half, then flip over the left half, you'll (visually) get the cumulative distribution function:

Code:

import math

def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

# plot
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs, [normal_cdf(x, sigma=1) for x in xs], '-', label='mu=0,sigma=1')

In both cases, the area under the curve for the standard normal distribution and the cumulative distribution function is 1.0, thus summing the probabilities of all events is one.

This post is part of my ongoing progress through Data Science from Scratch by Joel Grus:

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Bayes Application & Code: A Medical Diagnostic Scenario

Paul Apivat — Wed, 02 Dec 2020 07:00:45 +0000

Applying Bayes Theorem

note: This article presents a hypothetical situation and is not intended as medical advice.

Now that we have a basic understanding of Bayes' Theorem (please refer to these posts on conditional probability and Bayes' Theorem for context), let's extend the application to a slightly more complex example. This section was inspired by this tweet from Grant Sanderson (of 3Blue1Brown fame):

This is a classic application of Bayes Theorem - the medical diagnostic scenario. The above tweet can be re-stated:

What is the probability of you actually having the disease, given that you tested positive?

This happens to be even more relevant as we're living through a generational pandemic.

Let's start off with a conceptual understanding, using the tools we learned previously. First, we have to keep in mind testing and actually having the disease are not independent events. Therefore, we will use conditional probability to express their joint outcomes.

The intuitive visual to illustrate this is the tree diagram:

The initial information provided is as follows:

P(D): Probability of having the disease (covid-19)
P(P): Probability of testing positive
*P(D|P): Our objective is to find the probability of having the disease, given a positive test
1 in 1,000 actively have covid-19, P(D), this implies...
999 in 1,000 do not actively have covid-19, P(not D)
1% or 0.01 false positive (given)
10% or 0.1 false negative (given)

The false positive is when you don't have the disease, but your test (in error) shows up positive. False negative is when you have the disease, but your test (in error) shows up negative. We are provided this information and have to calculate other values to fill in the tree.

We know that all possible events have to add up to 1, so if 1 in 1,000 actively have the disease, we know that 999 in 1,000 do not have it. If the false negative is 10%, then the true positive is 90%. If the false positive is 1%, then the true negative is 99%. From our calculations, the tree can be updated:

Now that we've filled out the tree, we can use Bayes' Theorem to find P(D|P). Here is Bayes' Theorem that we discussed in the previous section. We have Bayes' Theorem, the denominator, probability of testing positive P(P) and the second version of Bayes Theorem in cases were we do not know the probability of testing positive (as in the present case):

Then we can plug-in the denominator to get the alternative version of Bayes' Theorem:

Here's how the numbers add up:

P(D|P) = P(P|D) * P(D) / P(P|D) * P(D) + P(P|not D) * P(not D)
P(D|P) = 0.9 * 0.001 / 0.9 * 0.001 + 0.01 * 0.999
P(D|P) = 0.0009 / 0.0009 + 0.00999
P(D|P) = 0.0009 / 0.01089
P(D|P) ~ 0.08264 or 8.26%

Interestingly, Andrej Karpathy actually responded in the thread and provided an intuitive way to arrive at the same result using Python.

Here's his code (with added comments):

from random import random, seed
seed(0)

pop = 10000000 # 10M people
counts = {}

for i in range(pop):
    has_covid = i % 1000 == 0 # one in 1,000 people have covid (priors or prevalence of disease)
    # The major assumption is that every person gets tested regardless of any symptoms
    if has_covid:                  # Has disease
        tests_positive = True      # True positive
        if random() < 0.1:     
            tests_positive = False # False negative
    else:                          # Does not have disease
        tests_positive = False     # True negative
        if random() < 0.01:    
            tests_positive = True  # False positive
    outcome = (has_covid, tests_positive)
    counts[outcome] = counts.get(outcome, 0) + 1

for (has_covid, tested_positive), n in counts.items():
    print('has covid: %6s, tests positive: %6s, count: %d' % (has_covid, tested_positive, n))

n_positive = counts[(True, True)] + counts[(False, True)]

print('number of people who tested positive:', n_positive)
print('probability that a test-positive person actually has covid: %.2f' % (100.0 * counts[(True, True)] / n_positive), )

We first build a hypothetical population of 10 million. If the prior of disease is 1 in 1,000, a population of 10 million should find 10000 people with covid. You can see how this works with this short snippet:

pop = 10000000
counts = 0

for i in range(pop):
    has_covid = i % 1000 == 0
    if has_covid:
        counts = counts + 1
print(counts, "people have the disease in a population of 10 million")

Nested in the for-loop are if-statements that segment the population (10M) into one of four categories True Positive, False Negative, True Negative, False Positive. Each category is counted and stored in a dict called counts. Then another for-loop is used to loop through this dictionary to print out all the categories:

has covid:   True, tests positive:   True, count: 9033
has covid:  False, tests positive:  False, count: 9890133
has covid:  False, tests positive:   True, count: 99867
has covid:   True, tests positive:  False, count: 967

number of people who tested positive: 108900
probability that a test-positive person actually has covid: 8.29

Finally, we want the number of people who have the disease and tested positive (True Positive, 9033) divided by the number of people who tested positive, regardless of whether they actually have the disease (True Positive (9033) + False Positive (99867) = 108,900) and this comes out to approximately 8.29.

Although the code was billed as "simple code to build intuition", I found that Bayes' Theorem is the intuition.

What about symptoms?

The key to Bayes' Theorem is that it encourages us to update our beliefs when presented with new evidence. But what if there's evidence we missed in the first place?

If you look back at the original tweet, there are important details about symptoms that, if we wanted to be more realistic, should be accounted for.

You feel fatigued and have a slight sore throat.

Here, instead of assuming that prevalence of the disease (1 in 1,000 people have covid-19) is the prior, we might ask what probability that someone who is symptomatic has the disease?

Let's suppose we change from 1 in 1,000 to 1 in 100. We could change just one line of code (while everything else remains the same):

for i in range(pop):
    has_covid = i % 100 == 0 # update info: 1/1000 have covid, but 1/100 with symptoms have covid

The probability that someone with a positive test actually has the disease jumps from 8.29% to 47.61%

has covid:   True, tests positive:   True, count: 180224
has covid:  False, tests positive:  False, count: 19601715
has covid:  False, tests positive:   True, count: 198285
has covid:   True, tests positive:  False, count: 19776
number of people who tested positive: 378509
probability that a test-positive person with symptoms actually has covid: 47.61

Thus, being symptomatic means our priors should be adjusted and our beliefs about the likelihood that a positive test means we have the disease (P(D|P)) should be updated accordingly (in this case, it goes way up).

Take Aways

Hypothetically, if we have family or friends living in an area where 1 in 1,000 people have covid-19 and they (god forbid) got tested and got a positive result, you could tell them that their probability of actually having the disease, given a positive test was around 8.26–8.29%.

However, what’s useful about the Bayesian approach is that it encourages us to incorporate new information and update our beliefs accordingly. So if we find out our family or friend is also symptomatic, we could advise them of the higher probability (~47.61%).

Finally, we may also advise our family/friends to get tested again, because as much as test-positive person would hope they got a ‘false positive’, chances are low. And even lower, is getting a false positive twice.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Bayes' Theorem: Concepts and Code

Paul Apivat — Sat, 28 Nov 2020 03:51:54 +0000

Overview

This post is a in continuation of my coverage of Data Science from Scratch by Joel Grus.

It picks up from the previous post, so be sure to check that out for proper context.

Building on our understanding of conditional probability we'll get into Bayes' Theorem. We'll spend some time understanding the concept before we implement an example in code.

Bayes Theorem

Previously, we established an understanding of conditional probability, but building up with marginal and joint probabilities. We explored the conditional probabilities of two outcomes:

Outcome 1: What is the probability of the event "both children are girls" (B) conditional on the event "the older child is a girl" (G)?

The probability for outcome one is roughly 50% or (1/2).

Outcome 2: What is the probability of the event "both children are girls" (B) conditional on the event "at least one of the children is a girl" (L)?

The probability for outcome two is roughly 33% or (1/3).

Bayes' Theorem is simply an alternate way of calculating conditional probability.

Previously, we used the joint probability to calculate the conditional probability.

Outcome 1

Here's the conditional probability for outcome 1, using a joint probability:

P(G) = 'Probability that first child is a girl' (1/2)
P(B) = 'Probability that both children are girls' (1/4)
P(B|G) = P(B,G) / P(G)
P(B|G) = (1/4) / (1/2) = 1/2 or roughly 50%

Technically, we can't use joint probability because the two events are not independent.

To clarify, the probability of the older child being a certain gender and the probability of the younger child being a certain gender is independent, but P(B|G) the 'probability of both child being a girl' and 'the probability of the older child being a girl' are not independent; and hence we express it as a conditional probability.

So, the joint probability of P(B,G) is just event B,P(B).

Here's an alternate way to calculate the conditional probability (without joint probability):

P(B|G) = P(G|B) * P(B) / P(G) This is Bayes Theorem
P(B|G) = 1 * (1/4) / (1/2)
P(B|G) = (1/4) * (2/1)
P(B|G) = 1/2 = 50%

note: P(G|B) is 'the probability that the first child is a girl, given that both children are girls is a certainty (1.0)'

The reverse conditional probability, can also be calculated, without joint probability:

What is the probability of the older child being a girl, given that both children are girls?

P(G|B) = P(B|G) * P(G) / P(B) This is Bayes Theorem (reverse case)
P(G|B) = (1/2) * (1/2) / (1/4)
P(G|B) = (1/4) / (1/4)
P(G|B) = 1 = 100%

This is consistent with what we already derived above, namely that P(G|B) is a certainty (probability = 1.0), that the older child is a girl, given that both children are girls.

We can point out two additional observations / rules:

While, joint probabilities are symmetrical: P(B,G) == P(G,B),
Conditional probabilities are not symmetrical: P(B|G) != P(G|B)

Bayes' Theorem: Alternative Expression

Bayes Theorem is a way of calculating conditional probability without the joint probability, summarized here:

P(B|G) = P(G|B) * P(B) / P(G) This is Bayes Theorem
P(G|B) = P(B|G) * P(G) / P(B) This is Bayes Theorem (reverse case)

You'll note that P(G) is the denominator in the former, and P(B) is the denominator in the latter.

What if, for some reasons, we don't have access to the denominator?

We could derive both P(G) and P(B) in another way using the NOT operator:

P(G) = P(G,B) + P(G,not B) = P(G|B) * P(B) + P(G|not B) * P(not B)
P(B) = P(B,G) + P(B,not G) = P(B|G) * P(G) + P(B|not G) * P(not G)

Therefore, the alternative expression of Bayes Theorem for the probability of both children being girls, given that the first child is a girl ( P(B|G) ) is:

P(B|G) = P(G|B) * P(B) / ( P(G|B) * P(B) + P(G|not B) * P(not B) )
P(B|G) = 1 * 1/4 / (1 * 1/4 + 1/3 * 3/4)
P(B|G) = 1/4 / (1/4 + 3/12)
P(B|G) = 1/4 / 2/4 = 1/4 * 4/2
P(B|G) = 1/2 or roughly 50%

We can check the result in code:

def bayes_theorem(p_b, p_g_given_b, p_g_given_not_b):
   # calculate P(not B)
   not_b = 1 - p_b
   # calculate P(G)
   p_g = p_g_given_b * p_b + p_g_given_not_b * not_b
   # calculate P(B|G)
   p_b_given_g = (p_g_given_b * p_b) / p_g
   return p_b_given_g

#P(B)
p_b = 1/4

# P(G|B)
p_g_given_b = 1

# P(G|notB)
p_g_given_not_b = 1/3

# calculate P(B|G)
result = bayes_theorem(p_b, p_g_given_b, p_g_given_not_b)

# print result
print('P(B|G) = %.2f%%' % (result * 100))

For the probability that the first child is a girl, given that both children are girls ( P(G|B) ) is:

P(G|B) = P(B|G) * P(G) / ( P(G|B) * P(G) + P(B|not G) * P(not G) )
P(G|B) = 1/2 * 1/2 / ((1/2 * 1/2) + (0 * 1/2))
P(G|B) = 1/4 / 1/4
P(G|B) = 1

Let's unpack Outcome 2.

Outcome 2

Outcome 2: What is the probability of the event "both children are girls" (B) conditional on the event "at least one of the children is a girl" (L)?

The probability for outcome two is roughly 33% or (1/3).

We'll go through the same process as above.

We could use joint probability to calculate the conditional probability. As with the previous outcome, the joint probability of P(B,G) is just event B,P(B).

P(B|L) = P(B,L) / P(L) = 1/3

Or, we could use Bayes' Theorem to figure out the conditional probability without joint probability:

P(B|L) = P(L|B) * P(B) / P(L)
P(B|L) = (1 * 1/4) / (3/4)
P(B|L) = 1/3

And, if there's no P(L), we can calculate that indirectly, also using Bayes' Theorem:

P(L) = P(L|B) * P(B) + P(L|not B) * P(not B)
P(L) = 1 * (1/4) + (2/3) * (3/4)
P(L) = (1/4) + (2/4)
P(L) = 3/4

Then, we can use P(L) in the way Bayes' Theorem is commonly expressed, when we don't have the denominator:

P(B|L) = P(L|B) * P(B) / ( P(L|B) * P(B) + P(L|not B) * P(not B) )
P(B|L) = 1 * (1/4) / (3/4)
P(B|L) = 1/3

Now that we've gone through the calculation for two conditional probabilities, P(B|G) and P(B|L), using Bayes Theorem, and implemented code for one of the scenarios, let's take a step back and assess what this means.

Bayesian Terminology

I think its useful to understand that probability in general shines when we want to describe uncertainty and that Bayes' Theorem allows us to quantify how much the data we observe, should change our beliefs.

We have two posteriors, P(B|G) and P(B|L), both with equal priors and likelihood, but with different evidence.

Said differently, we want to know the 'probability that both children are girls`, given different conditions.

In the first case, our condition is 'the first child is a girl' and in the second case, our condition is 'at least one of the child is a girl'. The question is which condition will increase the probability that both children are girls?

Bayes' Theorem allows us to update our belief about the probability in these two cases, as we incorporate varied data into our framework.

What the calculations tell us is that the evidence that 'one child is a girl' increases the probability that both children are girls more than the other piece of evidence that 'at least one child is a girl' increases that probability.

And our beliefs should be updated accordingly.

At the end of the day, understanding conditional probability (and Bayes Theorem) comes down to counting. For our hypothetical scenarios, we only need one hand:

When we look at the probability table for outcome one, P(B|G), we can see how the posterior probability comes out to 1/2:

When we look at the probability table for outcome two, P(B|L), we can see how the posterior probability comes out to 1/3:

This is part of an ongoing series documenting my progress through Data Science from Scratch by Joel Grus:

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.